◀️ 🌱 April 🌱 ▶️
일	월	목	금	토
		0	0	0
0	0	0	0	0
0	0	0	0	0
0	0	0	0	0
0	0

[빅데이터분석기사 실기] 제2유형 시험 준비

2024. 11. 26. 19:12

728x90

제2유형 시험 준비

들어가며

빅데이터분석기사 실기 제2유형 시험 준비를 위한 내용을 정리해본다.
제2유형은 데이터 모형 구축 및 평가와 관련된 내용이 포함된다.
제2회 ~ 제8회 기출 변형 문제와 풀이 방법을 함께 정리하였다.
- 모든 문제의 모델링은 성능이 준수하게 나오는 랜덤 포레스트(Random Forest)를 이용하여 수행하였다.

랜덤 포레스트(Random Forest)

개념

앙상블 학습(Ensemble Learning) 방법 중 한 방법
여러 개의 의사결정 나무(Decision Tree)를 생성하고 그 결과를 종합하여 예측 성능을 높이는 알고리즘
매우 유연하고 강력하지만, 데이터가 잘 준비되지 않으면 성능이 저하될 수 있다.

데이터 전처리

1️⃣ 종속 변수를 범주형 변수로 바꾸지 않아도 된다.

랜덤 포레스트는 회귀(Regression)와 분류(Classification) 모두 지원하므로, 종속 변수가 범주형일 경우에는 분류로, 연속형일 경우에는 회귀로 처리할 수 있다.
따라서 범주형 종속 변수를 별도로 변환할 필요는 없다.

2️⃣ 데이터 스케일링을 해주지 않아도 된다.

랜덤 포레스트는 트리 기반 알고리즘이므로, 데이터의 크기나 스케일(표준화, 정규화 등)이 모델 성능에 큰 영향을 미치지 않는다.
따라서 데이터 스케일링 작업을 필수적으로 하지 않아도 된다.

3️⃣ 범주형 변수를 반드시 수치형 변수로 바꿔줘야 한다.

랜덤 포레스트는 수치형 데이터를 처리할 수 있는 알고리즘이기 때문에 범주형 변수는 반드시 수치형 변수로 변환해줘야 한다.
보통 원-핫 인코딩(One-Hot Encoding)이나 라벨 인코딩(Label Encoding)을 사용한다.

4️⃣결측치 처리를 해줘야 한다.

랜덤 포레스트는 결측치를 직접 처리하지 못하므로, 모델을 학습시키기 전에 결측값을 채우는 전처리가 필요하다.
일반적으로 0, 최빈값, 평균 또는 중위값 등으로 결측치를 대체한다.
결측치가 많을 경우 적절한 처리 방법을 선택하는 것이 중요하다.

문제

📎 문제 1 (21년 2회)

기업에서 생성된 주문 데이터
data_q1-01.csv 파일의 데이터로 정시 도착 가능 여부 예측 모델을 만들고, data_q1-02.csv 파일에 대하여 정시 도착 여부를 예측한 확률을 기록한 CSV 생성하기

정시 도착 가능 여부 (Y/N)를 예측 하는 문제이므로 분류(Classification) 모델링을 진행한다.


			
			
			
		
import pandas as pd
 
df1 = pd.read_csv('./datasets/data_q1-01.csv')
df2 = pd.read_csv('./datasets/data_q1-02.csv')
 
print(df1.info())
 
""" 출력 결과
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8009 entries, 0 to 8008
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   ID                   8009 non-null   int64 
 1   Warehouse_block      8009 non-null   object
 2   Mode_of_Shipment     8009 non-null   object
 3   Customer_care_calls  8009 non-null   int64 
 4   Customer_rating      8009 non-null   int64 
 5   Cost_of_the_Product  8009 non-null   int64 
 6   Prior_purchases      8009 non-null   int64 
 7   Product_importance   8009 non-null   object
 8   Gender               8009 non-null   object
 9   Discount_offered     8009 non-null   int64 
 10  Weight_in_gms        8009 non-null   int64 
 11  Reached.on.Time_Y.N  8009 non-null   int64 
dtypes: int64(8), object(4)
memory usage: 751.0+ KB
"""
 
print(df2.info())
 
""" 출력 결과
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2990 entries, 0 to 2989
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   ID                   2990 non-null   int64 
 1   Warehouse_block      2990 non-null   object
 2   Mode_of_Shipment     2990 non-null   object
 3   Customer_care_calls  2990 non-null   int64 
 4   Customer_rating      2990 non-null   int64 
 5   Cost_of_the_Product  2990 non-null   int64 
 6   Prior_purchases      2990 non-null   int64 
 7   Product_importance   2990 non-null   object
 8   Gender               2990 non-null   object
 9   Discount_offered     2990 non-null   int64 
 10  Weight_in_gms        2990 non-null   int64 
dtypes: int64(7), object(4)
memory usage: 257.1+ KB
"""
 
# (1) 결측치 처리
## 필요 없음.
 
# (2) 필요 없는 변수 제거
## ID 컬럼 삭제
X_train = df1.copy()
X_test = df2.copy()
 
X_train = X_train.drop('ID', axis=1)  
X_test = X_test.drop('ID', axis=1)
 
# (3) 종속 변수, 독립 변수 분리
y = X_train['Reached.on.Time_Y.N']
X = X_train.drop('Reached.on.Time_Y.N', axis=1)
 
# (4) 원-핫 인코딩
X_encoded = pd.get_dummies(X)
X_test_encoded = pd.get_dummies(X_test)
 
X_test_encoded = X_test_encoded.reindex(columns=X_encoded.columns, fill_value=0)   # 훈련/테스트 데이터 열 구성 맞추기
 
# (5) 데이터 분할
from sklearn.model_selection import train_test_split
 
X_tr, X_val, y_tr, y_val = train_test_split(X_encoded, y, test_size=0.2)
 
print(X_tr.shape, X_val.shape, y_tr.shape, y_val.shape)
 
""" 출력 결과
(6407, 19) (1602, 19) (6407,) (1602,)
"""
 
# (6) 모델링
from sklearn.ensemble import RandomForestClassifier
 
model = RandomForestClassifier()
model.fit(X_tr, y_tr)
 
# (7) 예측
pred = model.predict(X_val)
 
# (8) 평가
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score
 
cm = confusion_matrix(y_val, pred)
print(cm)
 
""" 출력 결과
[[388 176]
 [289 749]]
"""
 
acc_score = accuracy_score(y_val, pred)
print(acc_score)
 
""" 출력 결과
0.7097378277153558
"""
 
roc_auc_score = roc_auc_score(y_val, pred)
print(roc_auc_score)
 
""" 출력 결과
0.704761611937851
"""
 
# (9) 테스트 데이터 예측 (예측 확률)
pred = model.predict_proba(X_test_encoded)
print(pred)
 
""" 출력 결과
[[0.3  0.7 ]
 [0.14 0.86]
 [0.56 0.44]
 ...
 [0.33 0.67]
 [0.26 0.74]
 [0.25 0.75]]
"""
 
# (10) CSV 내보내기
result = pd.DataFrame({
    'ID': df2['ID'],
    'pred': pred[:, 1]   # 정시 도착 여부를 예측하는 것이므로, 클래스1(정시에 도착)을 선택한다.
})
 
print(result[:1000])
 
""" 출력 결과
       ID  pred
0    8010  0.70
1    8011  0.86
2    8012  0.44
3    8013  0.54
4    8014  0.38
..    ...   ...
995  9005  0.27
996  9006  0.76
997  9007  0.38
998  9008  0.47
999  9009  0.30
 
[1000 rows x 2 columns]
"""
 
result.to_csv('./outputs/result_q1.csv', index=False)

📎 문제 2 ( 21년 3회)

고객의 예약 현황을 나타난 데이터
data_q2-01.csv 파일에 저장된 학습 데이터로 여행 보험 가입 여부 예측 모델을 만들고, data_q2-02.csv 파일에 저장된 테스트 데이터로 여행 보험 패키지 가입 여부를 예측하는 결과 예시 파일과 동일한 형태의 CSV 파일로 생성하여 제출하기

여행 보험 가입 여부 (Y/N)를 예측하는 문제이므로, 분류(Classification) 모델링을 진행한다.


			
			
			
		
import pandas as pd
 
df1 = pd.read_csv('./datasets/data_q2-01.csv')
df2 = pd.read_csv('./datasets/data_q2-02.csv')
 
print(df1.info())
 
""" 출력 결과
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1491 entries, 0 to 1490
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   X                    1491 non-null   int64 
 1   Age                  1491 non-null   int64 
 2   Employment Type      1491 non-null   object
 3   GraduateOrNot        1491 non-null   object
 4   AnnualIncome         1491 non-null   int64 
 5   FamilyMembers        1491 non-null   int64 
 6   ChronicDiseases      1491 non-null   int64 
 7   FrequentFlyer        1491 non-null   object
 8   EverTravelledAbroad  1491 non-null   object
 9   TravelInsurance      1491 non-null   int64 
dtypes: int64(6), object(4)
memory usage: 116.6+ KB
"""
 
print(df2.info())
 
""" 출력 결과
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 496 entries, 0 to 495
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   X                    496 non-null    int64 
 1   Age                  496 non-null    int64 
...
 8   EverTravelledAbroad  496 non-null    object
dtypes: int64(5), object(4)
memory usage: 35.0+ KB
"""
 
# (1) 결측치 처리
## 필요 없음.
 
# (2) 필요없는 변수 제거
## X 컬럼 삭제
X_train = df1.copy()
X_test = df2.copy()
 
X_train = X_train.drop('X', axis=1)
X_test = X_test.drop('X', axis=1)
 
# (3)독립 변수, 종속 변수 분리
y = X_train['TravelInsurance']
X = X_train.drop('TravelInsurance', axis=1)
 
# (4) 원-핫 인코딩
X_encoded = pd.get_dummies(X)
X_test_encoded = pd.get_dummies(X_test)
 
X_test_encoded = X_test_encoded.reindex(columns=X_encoded.columns, fill_value=0)   # 훈련/테스트 데이터 열 구성 맞추기
 
# (5) 데이터 분할
from sklearn.model_selection import train_test_split
 
X_tr, X_val, y_tr, y_val = train_test_split(X_encoded, y, test_size=0.2)
 
print(X_tr.shape, X_val.shape, y_tr.shape, y_val.shape)
 
""" 출력 결과
(1192, 12) (299, 12) (1192,) (299,)
"""
 
# (6) 모델링
from sklearn.ensemble import RandomForestClassifier
 
model = RandomForestClassifier()
model.fit(X_tr, y_tr)
 
# (7) 예측
pred = model.predict(X_val)
print(pred)
 
""" 출력 결과
[0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 0
 1 0 1 0 1 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 0 1 1 0 0 1 0 0
 0 1 0 1 0 0 1 0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 1 0
 1 1 1 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 0 0 1 0 0 0 0 0 0 1 1 1 0 0 1 1 1 1 0
 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1
 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 1 1 0 1 0 0 1 0 1 0 0 0 0 0
 0 1 1 0 0 1 0 1 1 0 1 0 0 1 1 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0
 1 1 0 0 0 0 0 1 1 1 1 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1
 0 0 1]
"""
 
# (8) 평가
from sklearn.metrics import roc_auc_score, confusion_matrix, accuracy_score
 
cm = confusion_matrix(y_val, pred)
print(cm)
 
""" 출력 결과
[[165  25]
 [ 38  71]]
"""
 
acc_score = accuracy_score(y_val, pred)
print(acc_score)
 
""" 출력 결과
0.8160535117056856
"""
 
roc_auc_score = roc_auc_score(y_val, pred)
print(roc_auc_score)
 
""" 출력 결과
0.7813800277392512
"""
 
# (9) 테스트 데이터 예측 (예측 확률)
pred = model.predict_proba(X_test_encoded)
print(pred[:50])
 
""" 출력 결과
[[0.15       0.85      ]
 [0.14       0.86      ]
 [0.76833333 0.23166667]
 [0.9175     0.0825    ]
 [0.975      0.025     ]
 [0.72332468 0.27667532]
 [0.23       0.77      ]
 [0.9        0.1       ]
 [0.65852381 0.34147619]
 [0.34021429 0.65978571]
 [0.84589216 0.15410784]
 [0.99       0.01      ]
 [0.76916667 0.23083333]
 [0.98333333 0.01666667]
 [0.76156349 0.23843651]
 [0.99       0.01      ]
 [0.95533333 0.04466667]
 [0.35       0.65      ]
 [0.61083333 0.38916667]
 [0.9975     0.0025    ]
 [0.78328571 0.21671429]
 [0.93916667 0.06083333]
 [0.         1.        ]
 [0.03       0.97      ]
 [0.87216667 0.12783333]
 [0.90203571 0.09796429]
 [0.         1.        ]
 [0.87666667 0.12333333]
 [1.         0.        ]
 [0.871      0.129     ]
 [0.         1.        ]
 [1.         0.        ]
 [0.95533333 0.04466667]
 [0.33797619 0.66202381]
 [0.33471429 0.66528571]
 [0.32       0.68      ]
 [0.235      0.765     ]
 [0.28416667 0.71583333]
 [0.995      0.005     ]
 [0.92366667 0.07633333]
 [0.95533333 0.04466667]
 [0.51345238 0.48654762]
 [0.27       0.73      ]
 [0.33       0.67      ]
 [0.79       0.21      ]
 [0.91       0.09      ]
 [0.76916667 0.23083333]
 [1.         0.        ]
 [0.42       0.58      ]
 [0.55413095 0.44586905]]
"""
 
# (10) CSV 내보내기
result = pd.DataFrame({
    'index': df2['X'],
    'y_pred': pred[:, 1]   # 클래스1 (여행 보험에 가입)
})
print(result[:1000])
 
""" 출력 결과
     index    y_pred
0     1491  0.860000
1     1492  0.834444
2     1493  0.169000
3     1494  0.160000
4     1495  0.039333
..     ...       ...
491   1982  0.930000
492   1983  0.810000
493   1984  0.022381
494   1985  0.620000
495   1986  0.480881
 
[496 rows x 2 columns]
"""
 
result.to_csv('./outputs/result_q2.csv', index=False)

📎문제 3 (22년 4회)

자동차 보험 회사는 새로운 전략을 수립하기 위해 고객을 4가지로 분류(A, B, C, D)로 세분화 하였다.
기존 고객에 대한 분류(data_q3-01.csv)를 바탕으로 신규 고객data_q3-02.csv)이 어떤 분류에 속할지 예측하여 제출하기

평가 : Macro F1-Score
예측할 값 : Segmentation
제출되는 파일은 테스트 데이터의 행의 수와 같아야 한다.

ID pred

1 A

2 B

3 C

... ...

1500 D

고객을 4가지 분류(A, B, C, D)로 나누어 신규 고객이 어떤 분류에 속하는지 예측하는 문제이므로, 분류(Classification) 모델링을 진행한다. (이 문제는 다중 클래스 분류 문제이다.)


			
			
			
		
import pandas as pd
 
df1 = pd.read_csv('./datasets/data_q3-01.csv')
df2 = pd.read_csv('./datasets/data_q3-02.csv')
 
print(df1.info())
 
""" 출력 결과
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6665 entries, 0 to 6664
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   ID               6665 non-null   int64 
 1   Gender           6665 non-null   object
 2   Ever_Married     6665 non-null   object
 3   Age              6665 non-null   int64 
 4   Graduated        6665 non-null   object
 5   Profession       6665 non-null   object
 6   Work_Experience  6665 non-null   int64 
 7   Spending_Score   6665 non-null   object
 8   Family_Size      6665 non-null   int64 
 9   Var_1            6665 non-null   object
 10  Segmentation     6665 non-null   object
dtypes: int64(4), object(7)
memory usage: 572.9+ KB
"""
 
print(df2.info())
 
""" 출력 결과
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2154 entries, 0 to 2153
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   ID               2154 non-null   int64 
 1   Gender           2154 non-null   object
 2   Ever_Married     2154 non-null   object
 3   Age              2154 non-null   int64 
 4   Graduated        2154 non-null   object
 5   Profession       2154 non-null   object
 6   Work_Experience  2154 non-null   int64 
 7   Spending_Score   2154 non-null   object
 8   Family_Size      2154 non-null   int64 
 9   Var_1            2154 non-null   object
dtypes: int64(4), object(6)
memory usage: 168.4+ KB
"""
 
# (1) 결측치 처리
## 필요 없음.
 
# (2) 필요 없는 변수 제거
## ID 컬럼 삭제
X_train = df1.copy()
X_test = df2.copy()
 
X_train = X_train.drop('ID', axis=1)
X_test = X_test.drop('ID', axis=1)
 
# (3) 독립 변수, 종속 변수 분리
y = X_train['Segmentation']
X = X_train.drop('Segmentation', axis=1)
 
## (4) 원-핫 인코딩
X_encoded = pd.get_dummies(X)
X_test_encoded = pd.get_dummies(X_test)
 
X_test_encoded = X_test_encoded.reindex(columns=X_encoded.columns, fill_value=0)   # 훈련/테스트 데이터 열 구성 맞추기
 
## (5) 데이터 분할
from sklearn.model_selection import train_test_split
 
X_tr, X_val, y_tr, y_val = train_test_split(X_encoded, y, test_size=0.2)
 
## (6) 모델링
from sklearn.ensemble import RandomForestClassifier
 
model = RandomForestClassifier()
model.fit(X_tr, y_tr)
 
## (7) 예측
pred = model.predict(X_val)
print(pred)
 
""" 출력 결과
['B' 'C' 'D' ... 'D' 'D' 'C']
"""
 
## (8) 평가
from sklearn.metrics import f1_score
 
cm = confusion_matrix(y_val, pred, labels=['A', 'B', 'C', 'D'])
print(cm)
 
""" 출력 결과
[[137  79  34  91]
 [ 81 104  90  31]
 [ 48  83 179  39]
 [ 58  28  23 228]]
"""
 
f1_score = f1_score(y_val, pred, average='macro')
print(f1_score)
 
""" 출력 결과
0.6792915714446815
"""
 
## (9) 테스트 데이터 예측
pred = model.predict(X_test_encoded)
print(pred)
 
 
""" 출력 결과
['B' 'C' 'C' ... 'B' 'C' 'D']
"""
 
## (10) CSV 내보내기
result = pd.DataFrame({
    'ID': df2['ID'],
    'pred': pred
})
print(result)
 
"""출력 결과
          ID pred
0     458989    A
1     458994    C
2     459000    C
3     459003    C
4     459005    A
...      ...  ...
2149  467950    A
2150  467954    D
2151  467958    A
2152  467961    C
2153  467968    D
 
[2154 rows x 2 columns]
"""
 
result.to_csv('./outputs/result_q3.csv', index=False)

📎 문제 4 (22년 5회)

주어진 훈련 데이터를 이용하여 중고 차량 가격(price)을 예측하는 모형을 만들고, 테스트 데이터를 이용하여 중고 차량 가격을 예측하여 제출하기

평가 : RMSE
제출되는 파일은 테스트 데이터의 행의 수와 같아야 한다.

pred

1230

2562

...

3761

중고 차량 가격(수치형 변수)을 예측 하는 문제이므로 회귀(Regression) 모델링을 진행한다.


			
			
			
		
import pandas as pd
 
df1 = pd.read_csv('./datasets/data_q4-01.csv')
df2 = pd.read_csv('./datasets/data_q4-02.csv')
 
print(df1.info())
 
""" 출력 결과
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6899 entries, 0 to 6898
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   model         6899 non-null   object 
 1   year          6899 non-null   int64  
 2   price         6899 non-null   int64  
 3   transmission  6899 non-null   object 
 4   mileage       6899 non-null   int64  
 5   fuelType      6899 non-null   object 
 6   tax           6899 non-null   int64  
 7   mpg           6899 non-null   float64
 8   engineSize    6899 non-null   float64
dtypes: float64(2), int64(4), object(3)
memory usage: 485.2+ KB
"""
 
print(df2.info())
 
 
""" 출력 결과
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3882 entries, 0 to 3881
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   model         3882 non-null   object 
 1   year          3882 non-null   int64  
 2   transmission  3882 non-null   object 
 3   mileage       3882 non-null   int64  
 4   fuelType      3882 non-null   object 
 5   tax           3882 non-null   int64  
 6   mpg           3882 non-null   float64
 7   engineSize    3882 non-null   float64
dtypes: float64(2), int64(3), object(3)
memory usage: 242.8+ KB
"""
 
# (1) 결측치 처리
## 필요 없음.
 
# (2) 필요 없는 변수 제거
## 필요 없음.
X_train = df1.copy()
X_test = df2.copy()
 
# (3) 독립 변수, 종속 변수 분리
y = X_train['price']
X = X_train.drop('price', axis=1)
 
# (4) 원-핫 인코딩
X_encoded = pd.get_dummies(X)
X_test_encoded = pd.get_dummies(X_test)
 
X_test_encoded = X_test_encoded.reindex(columns=X_encoded.columns, fill_value=0)   # 훈련/테스트 데이터 열 구성 맞추기
 
# (5) 데이터 분할
from sklearn.model_selection import train_test_split
 
X_tr, X_val, y_tr, y_val = train_test_split(X_encoded, y, test_size=0.2)
print(X_tr.shape, X_val.shape, y_tr.shape, y_val.shape)
 
""" 출력 결과
(5519, 35) (1380, 35) (5519,) (1380,)
"""
 
# (6) 모델링
from sklearn.ensemble import RandomForestRegressor
 
model = RandomForestRegressor()
model.fit(X_tr, y_tr)
 
# (7) 예측
pred = model.predict(X_val)
print(pred)
 
""" 출력 결과
[29677.32 21173.22 33422.73 ... 26821.71 19052.17 14527.45]
"""
 
# (8) 평가
from sklearn.metrics import root_mean_squared_error
 
rmse = root_mean_squared_error(y_val, pred)
print(rmse)
 
""" 출력 결과
3870.9560766020772
"""
 
# (9) 테스트 데이터 예측
pred = model.predict(X_test_encoded)
print(pred)
 
""" 출력 결과
[17652.87 29504.3  23954.44 ... 16623.11 10795.43 16917.17]
"""
 
# (10) CSV 내보내기
result = pd.DataFrame({
    'pred': pred
})
 
print(result)
 
""" 출력 결과
          pred
0     18179.28
1     29664.95
2     24153.54
3     23039.82
4     20186.82
...        ...
3877  19082.91
3878  15840.25
3879  16553.06
3880  10671.41
3881  17202.90
 
[3882 rows x 1 columns]
"""
 
result.to_csv('./outputs/result_q4.csv', index=False)

📎 문제 5 (23년 6회)

모바일 데이터 세트
분류 모델을 사용하여 price_range 값을 예측하려고 한다.
data_q5-01.csv 파일의 학습 데이터로 모델을 생성하고 data_q5-02.csv 파일의 평가 데이터로 평가하여 예측하기

평가 : Macro F1 Score
feature engineering, 하이퍼파라미터 최적화 등을 수행할 수 있으며, 과대적합이 발생할 수 있다.

pred

2

3

0

...

price_range 값을 4가지 분류(0, 1, 2, 3)로 나누어 어떤 분류에 속하는지 예측하는 문제이므로, 분류(Classification) 모델링을 진행한다. (이 문제는 다중 클래스 분류 문제이다.)


			
			
			
		
import pandas as pd
 
df1 = pd.read_csv('./datasets/data_q5-01.csv')
df2 = pd.read_csv('./datasets/data_q5-02.csv')
 
print(df1.info())
 
""" 출력 결과
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   battery_power  2000 non-null   int64  
 1   blue           2000 non-null   int64  
 2   clock_speed    2000 non-null   float64
 3   dual_sim       2000 non-null   int64  
 4   fc             2000 non-null   int64  
 5   four_g         2000 non-null   int64  
 6   int_memory     2000 non-null   int64  
 7   m_dep          2000 non-null   float64
 8   mobile_wt      2000 non-null   int64  
 9   n_cores        2000 non-null   int64  
 10  pc             2000 non-null   int64  
 11  px_height      2000 non-null   int64  
 12  px_width       2000 non-null   int64  
 13  ram            2000 non-null   int64  
 14  sc_h           2000 non-null   int64  
 15  sc_w           2000 non-null   int64  
 16  talk_time      2000 non-null   int64  
 17  three_g        2000 non-null   int64  
 18  touch_screen   2000 non-null   int64  
 19  wifi           2000 non-null   int64  
 20  price_range    2000 non-null   int64  
dtypes: float64(2), int64(19)
memory usage: 328.3 KB
"""
 
print(df2.info())
 
""" 출력 결과
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             1000 non-null   int64  
 1   battery_power  1000 non-null   int64  
 2   blue           1000 non-null   int64  
 3   clock_speed    1000 non-null   float64
 4   dual_sim       1000 non-null   int64  
 5   fc             1000 non-null   int64  
 6   four_g         1000 non-null   int64  
 7   int_memory     1000 non-null   int64  
 8   m_dep          1000 non-null   float64
 9   mobile_wt      1000 non-null   int64  
 10  n_cores        1000 non-null   int64  
 11  pc             1000 non-null   int64  
 12  px_height      1000 non-null   int64  
 13  px_width       1000 non-null   int64  
 14  ram            1000 non-null   int64  
 15  sc_h           1000 non-null   int64  
 16  sc_w           1000 non-null   int64  
 17  talk_time      1000 non-null   int64  
 18  three_g        1000 non-null   int64  
 19  touch_screen   1000 non-null   int64  
 20  wifi           1000 non-null   int64  
dtypes: float64(2), int64(19)
memory usage: 164.2 KB
"""
 
# (1) 결측치 처리
## 필요 없음.
 
# (2) 필요 없는 변수 제거
## id 컬럼 삭제
X_train = df1.copy()
X_test = df2.copy()
 
X_test = X_test.drop('id', axis=1)
 
# (3) 독립 변수, 종속 변수 분리
y = X_train['price_range']
X = X_train.drop('price_range', axis=1)
 
# (4) 원-핫 인코딩
X_encoded = pd.get_dummies(X)
X_test_encoded = pd.get_dummies(X_test)
 
X_test_encoded = X_test_encoded.reindex(columns=X_encoded.columns, fill_value=0)    # 훈련/테스트 데이터 열 구성 맞추기
 
# (5) 데이터 분할
from sklearn.model_selection import train_test_split
 
X_tr, X_val, y_tr, y_val = train_test_split(X_encoded, y, test_size=0.2)
 
print(X_tr.shape, X_val.shape, y_tr.shape, y_val.shape)
 
""" 출력 결과
(1600, 20) (400, 20) (1600,) (400,)
"""
 
# (6) 모델링
from sklearn.ensemble import RandomForestClassifier
 
model = RandomForestClassifier()
model.fit(X_tr, y_tr)
 
# (7) 예측
pred = model.predict(X_val)
print(pred)
 
""" 출력 결과
[0 3 1 1 1 3 2 1 2 0 2 1 0 0 0 0 3 2 0 1 1 1 3 3 0 1 1 2 0 1 1 3 3 0 0 1 1
 2 3 0 0 2 1 3 0 1 3 1 0 1 1 2 3 2 1 2 2 0 3 3 2 1 3 3 1 0 1 3 2 2 3 0 0 0
 3 1 0 3 1 2 0 1 2 2 0 0 2 2 2 1 3 1 2 0 1 1 1 0 0 2 0 0 2 3 3 3 3 3 0 0 0
 1 3 2 0 2 3 0 1 0 2 1 3 2 1 3 0 2 1 3 0 3 0 3 0 3 2 1 2 1 1 3 0 1 2 3 1 2
 1 2 3 0 2 0 2 2 1 0 0 2 0 2 2 0 0 3 3 1 3 3 1 1 1 3 0 0 0 2 2 1 0 2 1 0 1
 1 0 1 3 1 3 3 1 2 2 1 3 3 1 0 2 0 3 0 2 2 3 1 3 2 2 0 1 2 1 0 2 1 2 2 1 2
 0 1 2 1 1 1 0 2 3 0 2 3 2 3 2 2 0 0 2 0 3 1 2 0 1 1 1 2 3 0 1 1 0 2 1 2 2
 3 1 0 2 1 2 2 1 3 0 1 0 1 0 0 3 3 0 0 3 1 3 3 1 1 3 3 1 1 0 3 1 0 0 1 1 0
 3 1 3 3 3 1 1 0 2 2 2 2 3 3 3 3 3 3 0 3 2 2 1 2 1 2 3 0 0 3 3 2 3 1 0 3 1
 3 0 0 3 3 1 1 0 3 3 3 3 2 0 3 0 3 2 2 0 1 0 2 1 2 0 0 2 1 3 2 3 0 2 2 0 2
 2 3 1 2 1 1 3 2 2 1 0 1 1 2 1 0 0 2 3 3 2 3 1 3 1 0 1 1 3 1]
"""
 
# (8) 평가
from sklearn.metrics import confusion_matrix, f1_score
 
cm = confusion_matrix(y_val, pred, labels=[0, 1, 2, 3])
print(cm)
 
""" 출력 결과
[[111   7   0   0]
 [ 13  89   3   0]
 [  0   6  82  12]
 [  0   0   5  72]]
"""
 
f1_score = f1_score(y_val, pred, average='macro')
print(f1_score)
 
""" 출력 결과
0.8636209242150752
"""
 
# (9) 테스트 데이터 예측
pred = model.predict(X_test_encoded)
print(pred)
 
""" 출력 결과
[3 3 2 3 1 3 3 1 3 0 3 3 0 0 2 0 2 1 3 2 1 3 1 1 3 0 2 0 2 0 2 0 3 0 0 1 3
 1 2 1 1 2 0 0 0 1 0 3 1 2 1 0 2 0 3 1 3 1 1 3 3 2 0 1 1 1 1 2 1 1 1 2 2 3
 3 0 2 0 2 3 0 3 3 0 3 0 3 1 3 0 1 1 2 1 2 1 0 2 1 3 1 0 0 3 1 2 0 1 2 3 3
 3 1 3 3 3 3 1 3 0 0 3 2 1 1 0 3 2 3 1 0 2 1 1 3 1 1 0 3 2 1 3 2 2 2 3 3 2
 2 3 2 3 0 0 2 2 3 3 3 3 2 2 3 3 3 3 1 0 3 0 0 0 1 0 0 1 0 0 1 2 1 0 0 1 2
 2 2 1 0 0 0 1 0 3 1 0 2 2 2 3 1 2 3 3 3 1 2 1 0 0 1 2 0 2 3 3 0 2 0 3 2 3
 3 0 0 1 0 3 0 1 0 2 2 1 3 0 2 0 3 1 2 0 0 2 1 3 3 3 1 1 3 0 0 2 3 3 1 3 2
 1 3 2 1 2 3 3 3 1 0 1 2 3 1 1 3 2 0 3 0 1 2 0 0 3 2 3 3 2 0 3 3 2 3 1 2 1
 1 0 2 3 1 0 0 3 0 3 0 1 2 0 2 3 1 3 2 2 1 2 0 0 0 1 3 2 0 0 0 3 2 0 3 3 1
 2 3 2 3 1 3 3 2 2 2 3 3 0 3 0 3 1 3 1 3 3 0 1 1 3 1 3 2 3 0 0 0 0 2 0 0 1
 1 1 2 3 2 0 1 0 0 3 3 0 3 1 2 2 1 2 3 1 1 2 2 1 2 0 1 1 0 3 2 0 0 1 0 0 1
 1 0 0 0 2 2 3 2 3 0 3 0 3 0 1 1 1 2 0 3 2 3 3 1 3 1 3 1 2 2 1 2 2 1 1 0 0
 0 1 2 1 0 3 3 1 2 3 0 0 3 1 1 1 2 2 3 0 3 0 2 3 3 3 0 2 0 2 2 0 1 1 0 0 1
 1 1 3 3 3 2 3 1 2 2 3 3 3 1 0 2 2 2 2 1 0 2 2 0 0 0 3 1 1 2 2 2 0 3 0 2 2
 0 3 0 2 3 0 2 1 3 3 1 1 2 3 2 0 2 1 3 0 3 3 1 2 3 2 3 0 1 2 3 1 3 2 3 1 0
 1 0 3 1 0 3 2 3 2 0 3 3 3 2 3 3 1 2 0 2 3 3 0 0 1 1 2 2 2 0 0 2 2 3 2 0 2
 1 3 3 0 1 3 1 2 1 0 0 0 2 1 0 1 1 2 2 1 2 2 1 0 3 0 0 3 2 0 0 0 0 0 3 0 3
 1 3 2 1 3 2 0 1 1 3 2 3 1 0 3 0 2 0 2 0 0 1 1 1 2 1 3 1 3 2 2 1 3 2 0 1 3
 0 3 3 0 2 1 1 2 0 3 2 0 3 2 3 0 0 3 0 1 2 3 2 2 2 2 1 2 3 0 1 0 2 2 1 0 0
 1 0 0 3 0 1 1 0 1 1 0 3 0 3 3 3 0 0 1 2 2 1 0 1 1 0 1 1 0 0 3 3 0 3 1 2 3
 0 1 0 2 2 0 3 1 0 3 0 1 0 2 3 3 2 3 0 3 2 0 1 0 3 3 2 0 2 1 3 1 0 3 3 0 3
 1 2 1 1 1 3 1 1 2 2 0 0 1 2 0 2 0 1 0 0 3 3 3 3 0 1 2 2 1 0 0 2 1 0 2 0 2
 2 2 1 2 0 2 1 3 0 0 3 1 3 0 0 2 3 2 1 3 2 1 0 0 2 3 0 3 0 0 0 2 2 1 2 0 3
 2 1 2 3 3 0 1 1 2 1 2 2 0 1 3 1 1 3 1 2 3 2 1 1 2 3 3 0 2 3 0 2 3 2 2 2 3
 2 0 1 2 0 2 1 1 2 2 2 1 2 0 0 1 3 1 0 1 1 3 1 0 0 3 2 2 3 0 3 3 2 1 3 0 1
 3 1 2 1 2 2 2 0 3 0 2 3 0 3 2 3 3 1 0 2 3 1 0 1 1 2 1 2 0 2 2 0 2 3 2 3 0
 2 1 1 2 2 3 3 0 2 1 2 1 3 0 1 3 0 1 0 0 3 2 2 0 0 0 0 3 2 3 3 0 0 2 1 0 2
 2]
"""
 
# (10) CSV 내보내기
result = pd.DataFrame({
    'pred': pred
})
 
print(result)
 
""" 출력 결과
     pred
0       3
1       3
2       2
3       3
4       1
..    ...
995     2
996     1
997     0
998     2
999     2
 
[1000 rows x 1 columns]
"""
 
result.to_csv('./outputs/result_q5.csv', index=False)

📎 문제 6 (23년 7회)

제주 업종별 카드 이용 정보 데이터
종속 변수 : 이용금액
평가 지표 : RMSE

이용 금액(수치형 변수)를 예측하는 문제이므로, 회귀(Regression) 모델링을 진행한다.


			
			
			
		
import pandas as pd
 
df1 = pd.read_csv('./datasets/data_q6-01.csv')
df2 = pd.read_csv('./datasets/data_q6-02.csv')
 
print(df1.info())
 
""" 출력 결과
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2245 entries, 0 to 2244
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   ID      2245 non-null   object
 1   연월      2245 non-null   int64 
 2   업종명     2245 non-null   object
 3   이용자구분   2245 non-null   object
 4   성별      2245 non-null   object
 5   이용자수    2245 non-null   int64 
 6   이용건수    2245 non-null   int64 
 7   이용금액    2245 non-null   int64 
dtypes: int64(4), object(4)
memory usage: 140.4+ KB
"""
 
print(df2.info())
 
""" 출력 결과
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5020 entries, 0 to 5019
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   ID      5020 non-null   object
 1   연월      5020 non-null   int64 
 2   업종명     5020 non-null   object
 3   이용자구분   5020 non-null   object
 4   성별      5020 non-null   object
 5   이용자수    5020 non-null   int64 
 6   이용건수    5020 non-null   int64 
dtypes: int64(3), object(4)
memory usage: 274.7+ KB
"""
 
# (1) 결측치 처리
## 필요 없음.
 
# (2) 필요 없는 변수 제거
## ID 컬럼 삭제
X_train = df1.copy()
X_test = df2.copy()
 
X_train = X_train.drop("ID", axis=1)
X_test = X_test.drop("ID", axis=1)
 
# (3) 독립 변수, 종속 변수 분리
y = X_train["이용금액"]
X = X_train.drop("이용금액", axis=1)
 
# (4) 원-핫 인코딩
X_encoded = pd.get_dummies(X)
X_test_encoded = pd.get_dummies(X_test)
 
X_test_encoded = X_test_encoded.reindex(columns=X_encoded.columns, fill_value=0)   # 훈련/테스트 데이터 열 구성 맞추기
 
# (5) 데이터 분할
from sklearn.model_selection import train_test_split
 
X_tr, X_val, y_tr, y_val = train_test_split(X_encoded, y, test_size=0.2)
 
print(X_tr.shape, X_val.shape, y_tr.shape, y_val.shape)
 
""" 출력 결과
(1796, 28) (449, 28) (1796,) (449,)
"""
 
# (6) 모델링
from sklearn.ensemble import RandomForestRegressor
 
model = RandomForestRegressor()
model.fit(X_tr, y_tr)
 
# (7) 예측
pred = model.predict(X_val)
print(pred[:100])
 
""" 출력 결과
[1.90008207e+09 4.65955478e+08 2.95193996e+09 3.90638913e+08
 2.35675097e+08 2.65152463e+07 4.65501828e+07 4.64963900e+05
 8.01738458e+08 1.13109676e+07 9.39969541e+08 2.55429366e+06
 3.52700899e+07 2.25255431e+07 2.40363667e+08 5.24261291e+08
 6.82240277e+07 2.10644493e+09 3.35211418e+08 3.99681682e+08
 1.85156922e+09 1.54739128e+08 1.41116640e+09 4.26778200e+05
 4.94774415e+08 4.79368719e+09 8.20259763e+08 3.31353793e+09
 1.87295992e+09 3.40675936e+08 2.29268637e+08 9.35716985e+08
 1.70437333e+09 4.60506068e+06 8.81625451e+08 2.96526898e+08
 4.08634663e+08 5.22350997e+09 4.30011570e+08 2.47635484e+08
 8.01235334e+06 3.48341015e+08 1.35211088e+09 6.67125585e+09
 3.13282845e+06 7.17754028e+08 5.57982725e+07 1.78484304e+07
 4.72766825e+08 9.62785377e+08 3.33496207e+08 1.45653110e+08
 6.50008131e+08 1.89599953e+09 1.23010364e+09 9.15219653e+07
 5.70047836e+08 1.78863103e+09 8.56080821e+08 9.51140759e+09
 7.75300879e+08 4.19912040e+08 1.05095150e+06 5.01733880e+09
 4.46393427e+07 9.94192375e+07 2.61651650e+06 4.19109183e+08
 1.56882579e+08 1.25811332e+09 5.71898606e+08 1.01670072e+10
 1.28331156e+08 1.31681496e+08 2.52039078e+08 2.58716089e+07
 1.44338939e+09 4.09569719e+08 2.43890504e+08 1.69064174e+06
 4.62218252e+07 4.92879416e+07 1.48273870e+06 2.31906067e+08
 1.54132082e+07 8.24023964e+08 3.50660196e+09 4.02912682e+08
 1.12496411e+07 1.65915469e+08 1.01712096e+08 1.23282782e+09
 3.31261303e+09 1.52742089e+09 3.49351562e+09 9.05551270e+07
 1.29130468e+09 9.74110522e+08 2.34913754e+07 4.35480839e+08]
"""
 
# (8) 평가
from sklearn.metrics import root_mean_squared_error
 
rmse = root_mean_squared_error(y_val, pred)
print(rmse)
 
""" 출력 결과
182519728.18885794
"""
 
# (9) 테스트 데이터 예측
pred = model.predict(X_test_encoded)
print(pred)
 
""" 출력 결과
[5.79965803e+09 4.09976141e+07 2.43990300e+06 ... 5.42755781e+09
 7.45447840e+08 6.56516711e+08]
"""
 
# (10) CSV 내보내기
result = pd.DataFrame({
    'ID': df2['ID'],
    'pred': pred
})
 
print(result)
 
""" 출력 결과
           ID          pred
0     ID_2575  5.467737e+09
1     ID_6637  4.008152e+07
2     ID_5704  2.338352e+06
3     ID_3606  1.783806e+06
4     ID_6443  4.369645e+05
...       ...           ...
5015  ID_4523  3.886178e+08
5016  ID_3483  1.018352e+08
5017   ID_453  5.447200e+09
5018   ID_998  1.152426e+09
5019  ID_3237  5.857870e+08
 
[5020 rows x 2 columns]
"""
 
result.to_csv('./outputs/result_q6.csv', index=False)

📎 문제 7 (24년 8회)

종속 변수 : 지하철역 인원수
평가지표 : MAE(Mean Absolute Error)

지하철역 인원수(수치형 변수)를 예측하는 문제이므로, 회귀(Regression) 모델링을 진행한다.


			
			
			
		
import pandas as pd
 
df1 = pd.read_csv('./datasets/data_q7-01.csv')
df2 = pd.read_csv('./datasets/data_q7-02.csv')
 
print(df1.info())
 
""" 출력 결과
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 900 entries, 0 to 899
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   date           900 non-null    object 
 1   day_of_week    900 non-null    object 
 2   month          900 non-null    int64  
 3   station_name   895 non-null    object 
 4   visibility     892 non-null    float64
 5   precipitation  900 non-null    float64
 6   temperature    900 non-null    float64
 7   num_people     900 non-null    int64  
dtypes: float64(3), int64(2), object(3)
memory usage: 56.4+ KB
"""
 
print(df2.info())
 
""" 출력 결과
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   date           300 non-null    object 
 1   day_of_week    300 non-null    object 
 2   month          300 non-null    int64  
 3   station_name   300 non-null    object 
 4   visibility     300 non-null    float64
 5   precipitation  300 non-null    float64
 6   temperature    300 non-null    float64
dtypes: float64(3), int64(1), object(3)
memory usage: 16.5+ KB
"""
 
# (1) 결측치 처리
X_train = df1.copy()
X_test = df2.copy()
 
target_column1 = X_train['station_name']
target_column2 = X_train['visibility']
 
## 최빈값으로 대체
X_train['station_name'] = target_column1.fillna(target_column1.mode)
X_train['visibility'] = target_column2.fillna(target_column2.mode)
 
# (2) 필요 없는 변수 제거
## 필요 없음.
 
# (3) 독립 변수, 종속 변수 분리
y = X_train['num_people']
X = X_train.drop('num_people', axis=1)
 
# (4) 원-핫 인코딩
X_encoded = pd.get_dummies(X)
X_test_encoded = pd.get_dummies(X_test)
 
X_test_encoded = X_test_encoded.reindex(columns=X_encoded.columns, fill_value=0)   # 훈련/테스트 데이터 열 구성 맞추기
 
# (5) 데이터 분할
from sklearn.model_selection import train_test_split
 
X_tr, X_val, y_tr, y_val = train_test_split(X_encoded, y, test_size=0.2)
print(X_tr.shape, X_val.shape, y_tr.shape, y_val.shape)
 
""" 출력 결과
(720, 916) (180, 916) (720,) (180,)
"""
 
# (6) 모델링
from sklearn.ensemble import RandomForestRegressor
 
model = RandomForestRegressor()
model.fit(X_tr, y_tr)
 
# (7) 예측
pred = model.predict(X_val)
print(pred)
 
""" 출력 결과
[12236.46 12733.06  9336.26 13332.44  8553.55 14111.59 11656.91 13240.12
 11802.8  13838.01 10012.24  9335.87 12446.76 13129.15 12942.51 13564.35
 11506.63 11582.89 12720.87  9484.63 11178.24 12422.58 10516.45 11982.67
 11057.6  15171.71  9680.88 11961.17 13791.65 11144.45 12174.55 11860.55
 12054.97 13899.41 15109.53 11254.85 13535.48 13576.75 11024.56  8897.38
 14461.47 13247.12 12323.59  9413.27 12437.45 12031.8  11940.07 10894.57
 10830.42 13728.16 11830.95 13622.64 11791.73 11853.71 12275.16 15121.62
 11812.97 12685.03 12356.14 14686.27 14524.01 12977.51 12367.66 11177.6
  9392.41 11030.1  12833.11 11300.76 12465.19 13866.05 14231.53 11574.67
 12709.23 15923.4  13073.99 10711.67 13892.44 13151.95 13618.13 13082.44
 14719.13 12044.77 14165.7  11126.25 15306.02 10878.04 11838.32 10307.7
 12501.42 12603.72  9504.5  10612.06 11305.31  9766.67 14549.17 13617.26
 11941.98 15287.39 11708.24 13083.42 12053.59 12916.26 10926.3  15131.65
 10840.68 10896.91 14725.57 11345.14 11859.78 12316.63  9486.17 14127.73
 10503.94 13129.13 12327.28 13155.95 16055.36 12609.08 12220.94 13935.74
 11013.08 15427.32 13283.78 14681.94 12883.63 16854.11 12417.27 13419.06
 13944.63 12128.41 12846.43 13973.34 13042.27  9215.26 14014.56 13895.13
 11449.83 11291.52 11712.63 11867.93 14125.03 14614.49  9370.99 13336.32
 11844.1  13490.19 15247.68 12180.57 15175.84 11628.74 11133.86 13937.42
 11595.94 15576.15 13869.88 14176.7  11274.85 10927.77 12752.74 11683.27
 10707.67 14043.11 14405.62 11629.11  9986.95 11341.57 11402.65 12307.42
 12606.05 11840.43 13645.18 10603.76 13334.25 13896.73 12306.21 11569.46
 16055.71 13872.51 11135.61 11484.46]
"""
 
# (8) 평가
from sklearn.metrics import mean_absolute_error
 
mse = mean_absolute_error(y_val, pred)
print(mse)
 
""" 출력 결과
629.1398888888889
"""
 
# (9) 테스트 데이터 예측
pred = model.predict(X_test_encoded)
print(pred)
 
""" 출력 결과
[12070.17 11837.78 12487.12 13741.28 13921.6  12806.1  12948.99 12181.78
 10363.02 13993.09 12530.8  13532.8  13312.56 13874.18 13947.93 12502.65
 13577.14 12124.83 12851.14 11886.71 13303.52 13422.19 13941.75 13821.01
 14186.37 12203.9  11470.13 10991.97 11915.85  9743.88 14051.56 12532.26
 12481.29  9768.91 12619.09 13856.84 10663.67 12207.56 12422.26 11277.
 13433.01 11496.41 11887.91 14201.2  12078.11 13310.41 13584.11 13648.09
 12133.83 11759.32 12802.43 11294.08 16222.76 13173.95 10372.62  9221.57
 12197.82 11588.91 12580.89 13634.82 13683.65 13026.51 14542.45 11039.62
 12613.09 10925.29 14673.69  9060.45  9234.86 12017.93 15047.73 14234.73
 12791.03 13606.   12274.   14743.8  10581.5  12496.64 12980.62 10429.97
 14575.65 12616.09 11999.2  14104.46 10435.33 15434.21 14557.23 12109.58
 14681.67 10052.42 13076.44 10712.79 13403.83 15092.52 15574.82 14658.26
 11598.32 10963.11 13458.29 14363.23 14825.67  8947.51 15564.43 13259.87
 11442.01 13767.38 15013.07 12343.67 10090.5  11489.02 13938.28 12861.66
 13658.69 14889.84 13746.61 13862.68 12458.25 12886.42 10664.74 11238.15
 14278.66 12131.63 12765.67 14939.48 12385.35 13002.15 11617.37 12261.03
 15125.96 12791.93 15149.38 10920.4  12877.92 12037.61 15158.12 13978.45
 13690.88 14017.78 11535.3  12278.11  9851.28 12080.31 12488.82 12302.05
 13884.63 12421.07 10938.66 12968.42 12664.09 12703.78 10755.93 11475.61
 12403.18 11503.27 13759.99 11976.77 10080.32  8824.29 11917.5  11377.46
 13920.9  15540.7  13059.64 10909.93 14191.86 11981.46 10923.34 10475.18
 14382.32 16308.99  9016.27 12148.03 13619.67 11503.18 13703.97 11599.52
 13627.66 11265.62 13441.59 13334.74 12704.51 13559.1  12253.13 11081.88
 15250.2  12708.14 11314.68 14044.41 10453.28 15837.55 12298.65 10426.15
 14720.06 11752.15 12812.51 12516.28 13891.   13620.81 11834.42 13884.91
 13320.04 12862.61 15074.58 12925.26 11862.35 13940.3  15179.27 11365.45
 12549.91 15724.7  14703.01 11117.68 11424.55 13110.28 14208.   14195.6
 13308.58 12227.53 12350.51 11996.04 12705.86  9491.18 15335.24  9433.83
 13504.78 12113.44 14539.61 12426.03 11408.18 12705.72 11053.25 12616.64
 14614.25  9145.26 13305.27 11757.91 11768.79 14263.87 11014.09 11949.22
 13928.54 14379.71 12685.12 10804.89 12974.54 13794.75 12420.   12649.26
 14134.92 11515.95 14193.01 11853.54 11696.01 10853.63 13179.79 10549.92
 11985.93 12459.23 12369.26 12367.93 13452.42 12488.63  8839.12 16069.07
 14908.72 15293.89 12985.79 14797.71 12789.99 12017.57 12029.58 11229.91
 12126.57  8761.5  12800.93 14219.93 12596.58 15352.03 12108.7  15307.53
 11358.01 13951.89 14165.46 12285.25  9815.1  12473.16 12877.   15298.89
 14479.22 16406.25 14685.8  10965.84 10915.15 13624.83 12935.05 13890.44
 16332.45 15962.11 13871.73 14502.21]
"""
 
# (10) CSV 내보내기
result = pd.DataFrame({
    'pred': pred
})
print(result)
 
""" 출력 결과
         pred
0    11767.44
1    11823.30
2    12558.62
3    13440.26
4    13793.40
..        ...
295  14024.25
296  16082.65
297  15066.40
298  13964.19
299  14439.47
 
[300 rows x 1 columns]
"""
 
result.to_csv('./outputs/result_q7.csv', index=False)

📎 문제 8 (시험장 환경 체험 예제)

백화점 고객이 1년간 상품을 구매한 속성 데이터
제공된 학습용 데이터(customer_train.csv)를 이용하여 백화점 구매 고객의 성별을 예측하는 모델을 개발하고, 개발한 모델에 기반하여 평가용 데이터(customer_test.csv)에 적용하여 성별 예측하기

예측 결과는 ROC-AUC 평가 지표에 따라 평가
예측 성별 컬럼명 : pred
제출 컬럼 개수 : 1개
평가용 데이터 개수와 예측 결과 데이터 개수 일치 : 2482개
pred 컬럼 데이터 개수 : 2,482개
학습용 데이터 : 3,500개
평가용 데이터 : 2,482개

pred

0

1

...

백화점 구매 고객의 성별(0, 1)을 예측하는 문제이므로, 분류(Classification) 모델링을 진행한다.


			
			
			
		
import pandas as pd
 
train = pd.read_csv("data/customer_train.csv")
test = pd.read_csv("data/customer_test.csv")
 
# 사용자 코딩
 
print(train.info())
 
""" 출력 결과
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3500 entries, 0 to 3499
Data columns (total 11 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   회원ID     3500 non-null   int64  
 1   총구매액     3500 non-null   int64  
 2   최대구매액    3500 non-null   int64  
 3   환불금액     1205 non-null   float64
 4   주구매상품    3500 non-null   object 
 5   주구매지점    3500 non-null   object 
 6   방문일수     3500 non-null   int64  
 7   방문당구매건수  3500 non-null   float64
 8   주말방문비율   3500 non-null   float64
 9   구매주기     3500 non-null   int64  
 10  성별       3500 non-null   int64  
dtypes: float64(3), int64(6), object(2)
memory usage: 300.9+ KB
"""
 
print(test.info())
 
""" 출력 결과
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2482 entries, 0 to 2481
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   회원ID     2482 non-null   int64  
 1   총구매액     2482 non-null   int64  
 2   최대구매액    2482 non-null   int64  
 3   환불금액     871 non-null    float64
 4   주구매상품    2482 non-null   object 
 5   주구매지점    2482 non-null   object 
 6   방문일수     2482 non-null   int64  
 7   방문당구매건수  2482 non-null   float64
 8   주말방문비율   2482 non-null   float64
 9   구매주기     2482 non-null   int64  
dtypes: float64(3), int64(5), object(2)
memory usage: 194.0+ KB
"""
 
# (1) 결측치 처리
X_train = train.copy()
X_test = test.copy()
 
target_column = X_train['환불금액']
target_column = target_column.fillna(0)    # 환불 금액이 없는 경우가 많으므로, 0으로 대체
 
# (2) 필요 없는 변수 제거
X_train = X_train.drop('회원ID', axis=1)
X_test = X_test.drop('회원ID', axis=1)
 
# (3) 독립 변수, 종속 변수 분리
y = train["성별"]
X = train.drop("성별", axis=1)
 
# (4) 원-핫 인코딩
X_encoded = pd.get_dummies(X)
X_test_encoded = pd.get_dummies(X_test)
 
X_test_encoded = X_test_encoded.reindex(columns=X_encoded.columns, fill_value=0)
 
# (5) 데이터 분할
from sklearn.model_selection import train_test_split
 
X_tr, X_val, y_tr, y_val = train_test_split(X_encoded, y, test_size=0.2)
print(X_tr.shape, X_val.shape, y_tr.shape, y_val.shape)
 
""" 출력 결과
(2800, 74) (700, 74) (2800,) (700,)
"""
 
# (6) 모델링
from sklearn.ensemble import RandomForestClassifier
 
model = RandomForestClassifier()
model.fit(X_tr, y_tr)
 
# (7) 예측
pred = model.predict(X_val)
 
# (8) 평가
from sklearn.metrics import roc_auc_score
 
roc_auc_score = roc_auc_score(y_val, pred)
print(roc_auc_score)
 
""" 출력 결과
0.5846884367456296
"""
 
# (9) 테스트 데이터 예측
pred = model.predict(X_test_encoded)
print(pred)
 
""" 출력 결과
[0 0 0 ... 0 0 1]
"""
 
# (10) CSV 내보내기
result = pd.DataFrame({
	'pred': pred
})
print(result)
 
""" 출력 결과
      pred
0        0
1        0
2        0
3        0
4        0
...    ...
2477     0
2478     0
2479     1
2480     1
2481     0
 
[2482 rows x 1 columns]
"""
 
result.to_csv("./outputs/result_q8.csv", index=False)

728x90

저작자표시 비영리 변경금지

'Certificate > 빅데이터분석기사' 카테고리의 다른 글

[빅데이터분석기사 실기] 피어슨 상관 계수 구하기 (3)	2024.11.30
[빅데이터분석기사 실기] 시험장에서 알아두면 좋은 팁 (0)	2024.11.29
[빅데이터분석기사 실기] 제3유형: 가설 검정 연습 문제 (0)	2024.11.27
[빅데이터분석기사 실기] 제1유형 시험 준비 (0)	2024.11.25
[빅데이터분석기사 실기] help(), dir() 활용하기 (0)	2024.11.25
[빅데이터분석기사 실기] corr() 함수와 numeric_only 옵션 (0)	2024.11.25
[빅데이터분석기사 실기] 시험장 들어가기 전에 보기 빠르게 보기 좋은 강의 모음 (1)	2024.11.17
[빅데이터분석기사 실기] 제6회 기출 변형 문제 (제3유형) (0)	2024.11.16

Per ardua ad astra."Hello, World!" 🤖

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Per ardua ad astra.

"Hello, World!

[빅데이터분석기사 실기] 제2유형 시험 준비

제2유형 시험 준비

들어가며

랜덤 포레스트(Random Forest)

개념

데이터 전처리

1️⃣ 종속 변수를 범주형 변수로 바꾸지 않아도 된다.

2️⃣ 데이터 스케일링을 해주지 않아도 된다.

3️⃣ 범주형 변수를 반드시 수치형 변수로 바꿔줘야 한다.

4️⃣결측치 처리를 해줘야 한다.

문제

📎 문제 1 (21년 2회)

📎 문제 2 ( 21년 3회)

📎문제 3 (22년 4회)

📎 문제 4 (22년 5회)

📎 문제 5 (23년 6회)

📎 문제 6 (23년 7회)

📎 문제 7 (24년 8회)

📎 문제 8 (시험장 환경 체험 예제)

'Certificate > 빅데이터분석기사' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역

	import pandas as pd

	df1 = pd.read_csv('./datasets/data_q1-01.csv')
	df2 = pd.read_csv('./datasets/data_q1-02.csv')

	print(df1.info())

	""" 출력 결과
	<class 'pandas.core.frame.DataFrame'>
	RangeIndex: 8009 entries, 0 to 8008
	Data columns (total 12 columns):
	# Column Non-Null Count Dtype
	--- ------ -------------- -----
	0 ID 8009 non-null int64
	1 Warehouse_block 8009 non-null object
	2 Mode_of_Shipment 8009 non-null object
	3 Customer_care_calls 8009 non-null int64
	4 Customer_rating 8009 non-null int64
	5 Cost_of_the_Product 8009 non-null int64
	6 Prior_purchases 8009 non-null int64
	7 Product_importance 8009 non-null object
	8 Gender 8009 non-null object
	9 Discount_offered 8009 non-null int64
	10 Weight_in_gms 8009 non-null int64
	11 Reached.on.Time_Y.N 8009 non-null int64
	dtypes: int64(8), object(4)
	memory usage: 751.0+ KB
	"""

	print(df2.info())

	""" 출력 결과
	<class 'pandas.core.frame.DataFrame'>
	RangeIndex: 2990 entries, 0 to 2989
	Data columns (total 11 columns):
	# Column Non-Null Count Dtype
	--- ------ -------------- -----
	0 ID 2990 non-null int64
	1 Warehouse_block 2990 non-null object
	2 Mode_of_Shipment 2990 non-null object
	3 Customer_care_calls 2990 non-null int64
	4 Customer_rating 2990 non-null int64
	5 Cost_of_the_Product 2990 non-null int64
	6 Prior_purchases 2990 non-null int64
	7 Product_importance 2990 non-null object
	8 Gender 2990 non-null object
	9 Discount_offered 2990 non-null int64
	10 Weight_in_gms 2990 non-null int64
	dtypes: int64(7), object(4)
	memory usage: 257.1+ KB
	"""

	# (1) 결측치 처리
	## 필요 없음.

	# (2) 필요 없는 변수 제거
	## ID 컬럼 삭제
	X_train = df1.copy()
	X_test = df2.copy()

	X_train = X_train.drop('ID', axis=1)
	X_test = X_test.drop('ID', axis=1)

	# (3) 종속 변수, 독립 변수 분리
	y = X_train['Reached.on.Time_Y.N']
	X = X_train.drop('Reached.on.Time_Y.N', axis=1)

	# (4) 원-핫 인코딩
	X_encoded = pd.get_dummies(X)
	X_test_encoded = pd.get_dummies(X_test)

	X_test_encoded = X_test_encoded.reindex(columns=X_encoded.columns, fill_value=0) # 훈련/테스트 데이터 열 구성 맞추기

	# (5) 데이터 분할
	from sklearn.model_selection import train_test_split

	X_tr, X_val, y_tr, y_val = train_test_split(X_encoded, y, test_size=0.2)

	print(X_tr.shape, X_val.shape, y_tr.shape, y_val.shape)

	""" 출력 결과
	(6407, 19) (1602, 19) (6407,) (1602,)
	"""

	# (6) 모델링
	from sklearn.ensemble import RandomForestClassifier

	model = RandomForestClassifier()
	model.fit(X_tr, y_tr)

	# (7) 예측
	pred = model.predict(X_val)

	# (8) 평가
	from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score

	cm = confusion_matrix(y_val, pred)
	print(cm)

	""" 출력 결과
	[[388 176]
	[289 749]]
	"""

	acc_score = accuracy_score(y_val, pred)
	print(acc_score)

	""" 출력 결과
	0.7097378277153558
	"""

	roc_auc_score = roc_auc_score(y_val, pred)
	print(roc_auc_score)

	""" 출력 결과
	0.704761611937851
	"""

	# (9) 테스트 데이터 예측 (예측 확률)
	pred = model.predict_proba(X_test_encoded)
	print(pred)

	""" 출력 결과
	[[0.3 0.7 ]
	[0.14 0.86]
	[0.56 0.44]
	...
	[0.33 0.67]
	[0.26 0.74]
	[0.25 0.75]]
	"""

	# (10) CSV 내보내기
	result = pd.DataFrame({
	'ID': df2['ID'],
	'pred': pred[:, 1] # 정시 도착 여부를 예측하는 것이므로, 클래스1(정시에 도착)을 선택한다.
	})

	print(result[:1000])

	""" 출력 결과
	ID pred
	0 8010 0.70
	1 8011 0.86
	2 8012 0.44
	3 8013 0.54
	4 8014 0.38
	.. ... ...
	995 9005 0.27
	996 9006 0.76
	997 9007 0.38
	998 9008 0.47
	999 9009 0.30

	[1000 rows x 2 columns]
	"""

	result.to_csv('./outputs/result_q1.csv', index=False)

	import pandas as pd

	df1 = pd.read_csv('./datasets/data_q2-01.csv')
	df2 = pd.read_csv('./datasets/data_q2-02.csv')

	print(df1.info())

	""" 출력 결과
	<class 'pandas.core.frame.DataFrame'>
	RangeIndex: 1491 entries, 0 to 1490
	Data columns (total 10 columns):
	# Column Non-Null Count Dtype
	--- ------ -------------- -----
	0 X 1491 non-null int64
	1 Age 1491 non-null int64
	2 Employment Type 1491 non-null object
	3 GraduateOrNot 1491 non-null object
	4 AnnualIncome 1491 non-null int64
	5 FamilyMembers 1491 non-null int64
	6 ChronicDiseases 1491 non-null int64
	7 FrequentFlyer 1491 non-null object
	8 EverTravelledAbroad 1491 non-null object
	9 TravelInsurance 1491 non-null int64
	dtypes: int64(6), object(4)
	memory usage: 116.6+ KB
	"""

	print(df2.info())

	""" 출력 결과
	<class 'pandas.core.frame.DataFrame'>
	RangeIndex: 496 entries, 0 to 495
	Data columns (total 9 columns):
	# Column Non-Null Count Dtype
	--- ------ -------------- -----
	0 X 496 non-null int64
	1 Age 496 non-null int64
	...
	8 EverTravelledAbroad 496 non-null object
	dtypes: int64(5), object(4)
	memory usage: 35.0+ KB
	"""

	# (1) 결측치 처리
	## 필요 없음.

	# (2) 필요없는 변수 제거
	## X 컬럼 삭제
	X_train = df1.copy()
	X_test = df2.copy()

	X_train = X_train.drop('X', axis=1)
	X_test = X_test.drop('X', axis=1)

	# (3)독립 변수, 종속 변수 분리
	y = X_train['TravelInsurance']
	X = X_train.drop('TravelInsurance', axis=1)

	# (4) 원-핫 인코딩
	X_encoded = pd.get_dummies(X)
	X_test_encoded = pd.get_dummies(X_test)

	X_test_encoded = X_test_encoded.reindex(columns=X_encoded.columns, fill_value=0) # 훈련/테스트 데이터 열 구성 맞추기

	# (5) 데이터 분할
	from sklearn.model_selection import train_test_split

	X_tr, X_val, y_tr, y_val = train_test_split(X_encoded, y, test_size=0.2)

	print(X_tr.shape, X_val.shape, y_tr.shape, y_val.shape)

	""" 출력 결과
	(1192, 12) (299, 12) (1192,) (299,)
	"""

	# (6) 모델링
	from sklearn.ensemble import RandomForestClassifier

	model = RandomForestClassifier()
	model.fit(X_tr, y_tr)

	# (7) 예측
	pred = model.predict(X_val)
	print(pred)

	""" 출력 결과
	[0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 0
	1 0 1 0 1 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 0 1 1 0 0 1 0 0
	0 1 0 1 0 0 1 0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 1 0
	1 1 1 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 0 0 1 0 0 0 0 0 0 1 1 1 0 0 1 1 1 1 0
	0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1
	1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 1 1 0 1 0 0 1 0 1 0 0 0 0 0
	0 1 1 0 0 1 0 1 1 0 1 0 0 1 1 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0
	1 1 0 0 0 0 0 1 1 1 1 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1
	0 0 1]
	"""

	# (8) 평가
	from sklearn.metrics import roc_auc_score, confusion_matrix, accuracy_score

	cm = confusion_matrix(y_val, pred)
	print(cm)

	""" 출력 결과
	[[165 25]
	[ 38 71]]
	"""

	acc_score = accuracy_score(y_val, pred)
	print(acc_score)

	""" 출력 결과
	0.8160535117056856
	"""

	roc_auc_score = roc_auc_score(y_val, pred)
	print(roc_auc_score)

	""" 출력 결과
	0.7813800277392512
	"""

	# (9) 테스트 데이터 예측 (예측 확률)
	pred = model.predict_proba(X_test_encoded)
	print(pred[:50])

	""" 출력 결과
	[[0.15 0.85 ]
	[0.14 0.86 ]
	[0.76833333 0.23166667]
	[0.9175 0.0825 ]
	[0.975 0.025 ]
	[0.72332468 0.27667532]
	[0.23 0.77 ]
	[0.9 0.1 ]
	[0.65852381 0.34147619]
	[0.34021429 0.65978571]
	[0.84589216 0.15410784]
	[0.99 0.01 ]
	[0.76916667 0.23083333]
	[0.98333333 0.01666667]
	[0.76156349 0.23843651]
	[0.99 0.01 ]
	[0.95533333 0.04466667]
	[0.35 0.65 ]
	[0.61083333 0.38916667]
	[0.9975 0.0025 ]
	[0.78328571 0.21671429]
	[0.93916667 0.06083333]
	[0. 1. ]
	[0.03 0.97 ]
	[0.87216667 0.12783333]
	[0.90203571 0.09796429]
	[0. 1. ]
	[0.87666667 0.12333333]
	[1. 0. ]
	[0.871 0.129 ]
	[0. 1. ]
	[1. 0. ]
	[0.95533333 0.04466667]
	[0.33797619 0.66202381]
	[0.33471429 0.66528571]
	[0.32 0.68 ]
	[0.235 0.765 ]
	[0.28416667 0.71583333]
	[0.995 0.005 ]
	[0.92366667 0.07633333]
	[0.95533333 0.04466667]
	[0.51345238 0.48654762]
	[0.27 0.73 ]
	[0.33 0.67 ]
	[0.79 0.21 ]
	[0.91 0.09 ]
	[0.76916667 0.23083333]
	[1. 0. ]
	[0.42 0.58 ]
	[0.55413095 0.44586905]]
	"""

	# (10) CSV 내보내기
	result = pd.DataFrame({
	'index': df2['X'],
	'y_pred': pred[:, 1] # 클래스1 (여행 보험에 가입)
	})
	print(result[:1000])

	""" 출력 결과
	index y_pred
	0 1491 0.860000
	1 1492 0.834444
	2 1493 0.169000
	3 1494 0.160000
	4 1495 0.039333
	.. ... ...
	491 1982 0.930000
	492 1983 0.810000
	493 1984 0.022381
	494 1985 0.620000
	495 1986 0.480881

	[496 rows x 2 columns]
	"""

	result.to_csv('./outputs/result_q2.csv', index=False)

	import pandas as pd

	df1 = pd.read_csv('./datasets/data_q3-01.csv')
	df2 = pd.read_csv('./datasets/data_q3-02.csv')

	print(df1.info())

	""" 출력 결과
	<class 'pandas.core.frame.DataFrame'>
	RangeIndex: 6665 entries, 0 to 6664
	Data columns (total 11 columns):
	# Column Non-Null Count Dtype
	--- ------ -------------- -----
	0 ID 6665 non-null int64
	1 Gender 6665 non-null object
	2 Ever_Married 6665 non-null object
	3 Age 6665 non-null int64
	4 Graduated 6665 non-null object
	5 Profession 6665 non-null object
	6 Work_Experience 6665 non-null int64
	7 Spending_Score 6665 non-null object
	8 Family_Size 6665 non-null int64
	9 Var_1 6665 non-null object
	10 Segmentation 6665 non-null object
	dtypes: int64(4), object(7)
	memory usage: 572.9+ KB
	"""

	print(df2.info())

	""" 출력 결과
	<class 'pandas.core.frame.DataFrame'>
	RangeIndex: 2154 entries, 0 to 2153
	Data columns (total 10 columns):
	# Column Non-Null Count Dtype
	--- ------ -------------- -----
	0 ID 2154 non-null int64
	1 Gender 2154 non-null object
	2 Ever_Married 2154 non-null object
	3 Age 2154 non-null int64
	4 Graduated 2154 non-null object
	5 Profession 2154 non-null object
	6 Work_Experience 2154 non-null int64
	7 Spending_Score 2154 non-null object
	8 Family_Size 2154 non-null int64
	9 Var_1 2154 non-null object
	dtypes: int64(4), object(6)
	memory usage: 168.4+ KB
	"""

	# (1) 결측치 처리
	## 필요 없음.

	# (2) 필요 없는 변수 제거
	## ID 컬럼 삭제
	X_train = df1.copy()
	X_test = df2.copy()

	X_train = X_train.drop('ID', axis=1)
	X_test = X_test.drop('ID', axis=1)

	# (3) 독립 변수, 종속 변수 분리
	y = X_train['Segmentation']
	X = X_train.drop('Segmentation', axis=1)

	## (4) 원-핫 인코딩
	X_encoded = pd.get_dummies(X)
	X_test_encoded = pd.get_dummies(X_test)

	X_test_encoded = X_test_encoded.reindex(columns=X_encoded.columns, fill_value=0) # 훈련/테스트 데이터 열 구성 맞추기

	## (5) 데이터 분할
	from sklearn.model_selection import train_test_split

	X_tr, X_val, y_tr, y_val = train_test_split(X_encoded, y, test_size=0.2)

	## (6) 모델링
	from sklearn.ensemble import RandomForestClassifier

	model = RandomForestClassifier()
	model.fit(X_tr, y_tr)

	## (7) 예측
	pred = model.predict(X_val)
	print(pred)

	""" 출력 결과
	['B' 'C' 'D' ... 'D' 'D' 'C']
	"""

	## (8) 평가
	from sklearn.metrics import f1_score

	cm = confusion_matrix(y_val, pred, labels=['A', 'B', 'C', 'D'])
	print(cm)

	""" 출력 결과
	[[137 79 34 91]
	[ 81 104 90 31]
	[ 48 83 179 39]
	[ 58 28 23 228]]
	"""

	f1_score = f1_score(y_val, pred, average='macro')
	print(f1_score)

	""" 출력 결과
	0.6792915714446815
	"""

	## (9) 테스트 데이터 예측
	pred = model.predict(X_test_encoded)
	print(pred)


	""" 출력 결과
	['B' 'C' 'C' ... 'B' 'C' 'D']
	"""

	## (10) CSV 내보내기
	result = pd.DataFrame({
	'ID': df2['ID'],
	'pred': pred
	})
	print(result)

	"""출력 결과
	ID pred
	0 458989 A
	1 458994 C
	2 459000 C
	3 459003 C
	4 459005 A
	... ... ...
	2149 467950 A
	2150 467954 D
	2151 467958 A
	2152 467961 C
	2153 467968 D

	[2154 rows x 2 columns]
	"""

	result.to_csv('./outputs/result_q3.csv', index=False)

	import pandas as pd

	df1 = pd.read_csv('./datasets/data_q4-01.csv')
	df2 = pd.read_csv('./datasets/data_q4-02.csv')

	print(df1.info())

	""" 출력 결과
	<class 'pandas.core.frame.DataFrame'>
	RangeIndex: 6899 entries, 0 to 6898
	Data columns (total 9 columns):
	# Column Non-Null Count Dtype
	--- ------ -------------- -----
	0 model 6899 non-null object
	1 year 6899 non-null int64
	2 price 6899 non-null int64
	3 transmission 6899 non-null object
	4 mileage 6899 non-null int64
	5 fuelType 6899 non-null object
	6 tax 6899 non-null int64
	7 mpg 6899 non-null float64
	8 engineSize 6899 non-null float64
	dtypes: float64(2), int64(4), object(3)
	memory usage: 485.2+ KB
	"""

	print(df2.info())


	""" 출력 결과
	<class 'pandas.core.frame.DataFrame'>
	RangeIndex: 3882 entries, 0 to 3881
	Data columns (total 8 columns):
	# Column Non-Null Count Dtype
	--- ------ -------------- -----
	0 model 3882 non-null object
	1 year 3882 non-null int64
	2 transmission 3882 non-null object
	3 mileage 3882 non-null int64
	4 fuelType 3882 non-null object
	5 tax 3882 non-null int64
	6 mpg 3882 non-null float64
	7 engineSize 3882 non-null float64
	dtypes: float64(2), int64(3), object(3)
	memory usage: 242.8+ KB
	"""

	# (1) 결측치 처리
	## 필요 없음.

	# (2) 필요 없는 변수 제거
	## 필요 없음.
	X_train = df1.copy()
	X_test = df2.copy()

	# (3) 독립 변수, 종속 변수 분리
	y = X_train['price']
	X = X_train.drop('price', axis=1)

	# (4) 원-핫 인코딩
	X_encoded = pd.get_dummies(X)
	X_test_encoded = pd.get_dummies(X_test)

	X_test_encoded = X_test_encoded.reindex(columns=X_encoded.columns, fill_value=0) # 훈련/테스트 데이터 열 구성 맞추기

	# (5) 데이터 분할
	from sklearn.model_selection import train_test_split

	X_tr, X_val, y_tr, y_val = train_test_split(X_encoded, y, test_size=0.2)
	print(X_tr.shape, X_val.shape, y_tr.shape, y_val.shape)

	""" 출력 결과
	(5519, 35) (1380, 35) (5519,) (1380,)
	"""

	# (6) 모델링
	from sklearn.ensemble import RandomForestRegressor

	model = RandomForestRegressor()
	model.fit(X_tr, y_tr)

	# (7) 예측
	pred = model.predict(X_val)
	print(pred)

	""" 출력 결과
	[29677.32 21173.22 33422.73 ... 26821.71 19052.17 14527.45]
	"""

	# (8) 평가
	from sklearn.metrics import root_mean_squared_error

	rmse = root_mean_squared_error(y_val, pred)
	print(rmse)

	""" 출력 결과
	3870.9560766020772
	"""

	# (9) 테스트 데이터 예측
	pred = model.predict(X_test_encoded)
	print(pred)

	""" 출력 결과
	[17652.87 29504.3 23954.44 ... 16623.11 10795.43 16917.17]
	"""

	# (10) CSV 내보내기
	result = pd.DataFrame({
	'pred': pred
	})

	print(result)

	""" 출력 결과
	pred
	0 18179.28
	1 29664.95
	2 24153.54
	3 23039.82
	4 20186.82
	... ...
	3877 19082.91
	3878 15840.25
	3879 16553.06
	3880 10671.41
	3881 17202.90

	[3882 rows x 1 columns]
	"""

	result.to_csv('./outputs/result_q4.csv', index=False)

	import pandas as pd

	df1 = pd.read_csv('./datasets/data_q5-01.csv')
	df2 = pd.read_csv('./datasets/data_q5-02.csv')

	print(df1.info())

	""" 출력 결과
	<class 'pandas.core.frame.DataFrame'>
	RangeIndex: 2000 entries, 0 to 1999
	Data columns (total 21 columns):
	# Column Non-Null Count Dtype
	--- ------ -------------- -----
	0 battery_power 2000 non-null int64
	1 blue 2000 non-null int64
	2 clock_speed 2000 non-null float64
	3 dual_sim 2000 non-null int64
	4 fc 2000 non-null int64
	5 four_g 2000 non-null int64
	6 int_memory 2000 non-null int64
	7 m_dep 2000 non-null float64
	8 mobile_wt 2000 non-null int64
	9 n_cores 2000 non-null int64
	10 pc 2000 non-null int64
	11 px_height 2000 non-null int64
	12 px_width 2000 non-null int64
	13 ram 2000 non-null int64
	14 sc_h 2000 non-null int64
	15 sc_w 2000 non-null int64
	16 talk_time 2000 non-null int64
	17 three_g 2000 non-null int64
	18 touch_screen 2000 non-null int64
	19 wifi 2000 non-null int64
	20 price_range 2000 non-null int64
	dtypes: float64(2), int64(19)
	memory usage: 328.3 KB
	"""

	print(df2.info())

	""" 출력 결과
	<class 'pandas.core.frame.DataFrame'>
	RangeIndex: 1000 entries, 0 to 999
	Data columns (total 21 columns):
	# Column Non-Null Count Dtype
	--- ------ -------------- -----
	0 id 1000 non-null int64
	1 battery_power 1000 non-null int64
	2 blue 1000 non-null int64
	3 clock_speed 1000 non-null float64
	4 dual_sim 1000 non-null int64
	5 fc 1000 non-null int64
	6 four_g 1000 non-null int64
	7 int_memory 1000 non-null int64
	8 m_dep 1000 non-null float64
	9 mobile_wt 1000 non-null int64
	10 n_cores 1000 non-null int64
	11 pc 1000 non-null int64
	12 px_height 1000 non-null int64
	13 px_width 1000 non-null int64
	14 ram 1000 non-null int64
	15 sc_h 1000 non-null int64
	16 sc_w 1000 non-null int64
	17 talk_time 1000 non-null int64
	18 three_g 1000 non-null int64
	19 touch_screen 1000 non-null int64
	20 wifi 1000 non-null int64
	dtypes: float64(2), int64(19)
	memory usage: 164.2 KB
	"""

	# (1) 결측치 처리
	## 필요 없음.

	# (2) 필요 없는 변수 제거
	## id 컬럼 삭제
	X_train = df1.copy()
	X_test = df2.copy()

	X_test = X_test.drop('id', axis=1)

	# (3) 독립 변수, 종속 변수 분리
	y = X_train['price_range']
	X = X_train.drop('price_range', axis=1)

	# (4) 원-핫 인코딩
	X_encoded = pd.get_dummies(X)
	X_test_encoded = pd.get_dummies(X_test)

	X_test_encoded = X_test_encoded.reindex(columns=X_encoded.columns, fill_value=0) # 훈련/테스트 데이터 열 구성 맞추기

	# (5) 데이터 분할
	from sklearn.model_selection import train_test_split

	X_tr, X_val, y_tr, y_val = train_test_split(X_encoded, y, test_size=0.2)

	print(X_tr.shape, X_val.shape, y_tr.shape, y_val.shape)

	""" 출력 결과
	(1600, 20) (400, 20) (1600,) (400,)
	"""

	# (6) 모델링
	from sklearn.ensemble import RandomForestClassifier

	model = RandomForestClassifier()
	model.fit(X_tr, y_tr)

	# (7) 예측
	pred = model.predict(X_val)
	print(pred)

	""" 출력 결과
	[0 3 1 1 1 3 2 1 2 0 2 1 0 0 0 0 3 2 0 1 1 1 3 3 0 1 1 2 0 1 1 3 3 0 0 1 1
	2 3 0 0 2 1 3 0 1 3 1 0 1 1 2 3 2 1 2 2 0 3 3 2 1 3 3 1 0 1 3 2 2 3 0 0 0
	3 1 0 3 1 2 0 1 2 2 0 0 2 2 2 1 3 1 2 0 1 1 1 0 0 2 0 0 2 3 3 3 3 3 0 0 0
	1 3 2 0 2 3 0 1 0 2 1 3 2 1 3 0 2 1 3 0 3 0 3 0 3 2 1 2 1 1 3 0 1 2 3 1 2
	1 2 3 0 2 0 2 2 1 0 0 2 0 2 2 0 0 3 3 1 3 3 1 1 1 3 0 0 0 2 2 1 0 2 1 0 1
	1 0 1 3 1 3 3 1 2 2 1 3 3 1 0 2 0 3 0 2 2 3 1 3 2 2 0 1 2 1 0 2 1 2 2 1 2
	0 1 2 1 1 1 0 2 3 0 2 3 2 3 2 2 0 0 2 0 3 1 2 0 1 1 1 2 3 0 1 1 0 2 1 2 2
	3 1 0 2 1 2 2 1 3 0 1 0 1 0 0 3 3 0 0 3 1 3 3 1 1 3 3 1 1 0 3 1 0 0 1 1 0
	3 1 3 3 3 1 1 0 2 2 2 2 3 3 3 3 3 3 0 3 2 2 1 2 1 2 3 0 0 3 3 2 3 1 0 3 1
	3 0 0 3 3 1 1 0 3 3 3 3 2 0 3 0 3 2 2 0 1 0 2 1 2 0 0 2 1 3 2 3 0 2 2 0 2
	2 3 1 2 1 1 3 2 2 1 0 1 1 2 1 0 0 2 3 3 2 3 1 3 1 0 1 1 3 1]
	"""

	# (8) 평가
	from sklearn.metrics import confusion_matrix, f1_score

	cm = confusion_matrix(y_val, pred, labels=[0, 1, 2, 3])
	print(cm)

	""" 출력 결과
	[[111 7 0 0]
	[ 13 89 3 0]
	[ 0 6 82 12]
	[ 0 0 5 72]]
	"""

	f1_score = f1_score(y_val, pred, average='macro')
	print(f1_score)

	""" 출력 결과
	0.8636209242150752
	"""

	# (9) 테스트 데이터 예측
	pred = model.predict(X_test_encoded)
	print(pred)

	""" 출력 결과
	[3 3 2 3 1 3 3 1 3 0 3 3 0 0 2 0 2 1 3 2 1 3 1 1 3 0 2 0 2 0 2 0 3 0 0 1 3
	1 2 1 1 2 0 0 0 1 0 3 1 2 1 0 2 0 3 1 3 1 1 3 3 2 0 1 1 1 1 2 1 1 1 2 2 3
	3 0 2 0 2 3 0 3 3 0 3 0 3 1 3 0 1 1 2 1 2 1 0 2 1 3 1 0 0 3 1 2 0 1 2 3 3
	3 1 3 3 3 3 1 3 0 0 3 2 1 1 0 3 2 3 1 0 2 1 1 3 1 1 0 3 2 1 3 2 2 2 3 3 2
	2 3 2 3 0 0 2 2 3 3 3 3 2 2 3 3 3 3 1 0 3 0 0 0 1 0 0 1 0 0 1 2 1 0 0 1 2
	2 2 1 0 0 0 1 0 3 1 0 2 2 2 3 1 2 3 3 3 1 2 1 0 0 1 2 0 2 3 3 0 2 0 3 2 3
	3 0 0 1 0 3 0 1 0 2 2 1 3 0 2 0 3 1 2 0 0 2 1 3 3 3 1 1 3 0 0 2 3 3 1 3 2
	1 3 2 1 2 3 3 3 1 0 1 2 3 1 1 3 2 0 3 0 1 2 0 0 3 2 3 3 2 0 3 3 2 3 1 2 1
	1 0 2 3 1 0 0 3 0 3 0 1 2 0 2 3 1 3 2 2 1 2 0 0 0 1 3 2 0 0 0 3 2 0 3 3 1
	2 3 2 3 1 3 3 2 2 2 3 3 0 3 0 3 1 3 1 3 3 0 1 1 3 1 3 2 3 0 0 0 0 2 0 0 1
	1 1 2 3 2 0 1 0 0 3 3 0 3 1 2 2 1 2 3 1 1 2 2 1 2 0 1 1 0 3 2 0 0 1 0 0 1
	1 0 0 0 2 2 3 2 3 0 3 0 3 0 1 1 1 2 0 3 2 3 3 1 3 1 3 1 2 2 1 2 2 1 1 0 0
	0 1 2 1 0 3 3 1 2 3 0 0 3 1 1 1 2 2 3 0 3 0 2 3 3 3 0 2 0 2 2 0 1 1 0 0 1
	1 1 3 3 3 2 3 1 2 2 3 3 3 1 0 2 2 2 2 1 0 2 2 0 0 0 3 1 1 2 2 2 0 3 0 2 2
	0 3 0 2 3 0 2 1 3 3 1 1 2 3 2 0 2 1 3 0 3 3 1 2 3 2 3 0 1 2 3 1 3 2 3 1 0
	1 0 3 1 0 3 2 3 2 0 3 3 3 2 3 3 1 2 0 2 3 3 0 0 1 1 2 2 2 0 0 2 2 3 2 0 2
	1 3 3 0 1 3 1 2 1 0 0 0 2 1 0 1 1 2 2 1 2 2 1 0 3 0 0 3 2 0 0 0 0 0 3 0 3
	1 3 2 1 3 2 0 1 1 3 2 3 1 0 3 0 2 0 2 0 0 1 1 1 2 1 3 1 3 2 2 1 3 2 0 1 3
	0 3 3 0 2 1 1 2 0 3 2 0 3 2 3 0 0 3 0 1 2 3 2 2 2 2 1 2 3 0 1 0 2 2 1 0 0
	1 0 0 3 0 1 1 0 1 1 0 3 0 3 3 3 0 0 1 2 2 1 0 1 1 0 1 1 0 0 3 3 0 3 1 2 3
	0 1 0 2 2 0 3 1 0 3 0 1 0 2 3 3 2 3 0 3 2 0 1 0 3 3 2 0 2 1 3 1 0 3 3 0 3
	1 2 1 1 1 3 1 1 2 2 0 0 1 2 0 2 0 1 0 0 3 3 3 3 0 1 2 2 1 0 0 2 1 0 2 0 2
	2 2 1 2 0 2 1 3 0 0 3 1 3 0 0 2 3 2 1 3 2 1 0 0 2 3 0 3 0 0 0 2 2 1 2 0 3
	2 1 2 3 3 0 1 1 2 1 2 2 0 1 3 1 1 3 1 2 3 2 1 1 2 3 3 0 2 3 0 2 3 2 2 2 3
	2 0 1 2 0 2 1 1 2 2 2 1 2 0 0 1 3 1 0 1 1 3 1 0 0 3 2 2 3 0 3 3 2 1 3 0 1
	3 1 2 1 2 2 2 0 3 0 2 3 0 3 2 3 3 1 0 2 3 1 0 1 1 2 1 2 0 2 2 0 2 3 2 3 0
	2 1 1 2 2 3 3 0 2 1 2 1 3 0 1 3 0 1 0 0 3 2 2 0 0 0 0 3 2 3 3 0 0 2 1 0 2
	2]
	"""

	# (10) CSV 내보내기
	result = pd.DataFrame({
	'pred': pred
	})

	print(result)

	""" 출력 결과
	pred
	0 3
	1 3
	2 2
	3 3
	4 1
	.. ...
	995 2
	996 1
	997 0
	998 2
	999 2

	[1000 rows x 1 columns]
	"""

	result.to_csv('./outputs/result_q5.csv', index=False)

	import pandas as pd

	df1 = pd.read_csv('./datasets/data_q6-01.csv')
	df2 = pd.read_csv('./datasets/data_q6-02.csv')

	print(df1.info())

	""" 출력 결과
	<class 'pandas.core.frame.DataFrame'>
	RangeIndex: 2245 entries, 0 to 2244
	Data columns (total 8 columns):
	# Column Non-Null Count Dtype
	--- ------ -------------- -----
	0 ID 2245 non-null object
	1 연월 2245 non-null int64
	2 업종명 2245 non-null object
	3 이용자구분 2245 non-null object
	4 성별 2245 non-null object
	5 이용자수 2245 non-null int64
	6 이용건수 2245 non-null int64
	7 이용금액 2245 non-null int64
	dtypes: int64(4), object(4)
	memory usage: 140.4+ KB
	"""

	print(df2.info())

	""" 출력 결과
	<class 'pandas.core.frame.DataFrame'>
	RangeIndex: 5020 entries, 0 to 5019
	Data columns (total 7 columns):
	# Column Non-Null Count Dtype
	--- ------ -------------- -----
	0 ID 5020 non-null object
	1 연월 5020 non-null int64
	2 업종명 5020 non-null object
	3 이용자구분 5020 non-null object
	4 성별 5020 non-null object
	5 이용자수 5020 non-null int64
	6 이용건수 5020 non-null int64
	dtypes: int64(3), object(4)
	memory usage: 274.7+ KB
	"""

	# (1) 결측치 처리
	## 필요 없음.

	# (2) 필요 없는 변수 제거
	## ID 컬럼 삭제
	X_train = df1.copy()
	X_test = df2.copy()

	X_train = X_train.drop("ID", axis=1)
	X_test = X_test.drop("ID", axis=1)

	# (3) 독립 변수, 종속 변수 분리
	y = X_train["이용금액"]
	X = X_train.drop("이용금액", axis=1)

	# (4) 원-핫 인코딩
	X_encoded = pd.get_dummies(X)
	X_test_encoded = pd.get_dummies(X_test)

	X_test_encoded = X_test_encoded.reindex(columns=X_encoded.columns, fill_value=0) # 훈련/테스트 데이터 열 구성 맞추기

	# (5) 데이터 분할
	from sklearn.model_selection import train_test_split

	X_tr, X_val, y_tr, y_val = train_test_split(X_encoded, y, test_size=0.2)

	print(X_tr.shape, X_val.shape, y_tr.shape, y_val.shape)

	""" 출력 결과
	(1796, 28) (449, 28) (1796,) (449,)
	"""

	# (6) 모델링
	from sklearn.ensemble import RandomForestRegressor

	model = RandomForestRegressor()
	model.fit(X_tr, y_tr)

	# (7) 예측
	pred = model.predict(X_val)
	print(pred[:100])

	""" 출력 결과
	[1.90008207e+09 4.65955478e+08 2.95193996e+09 3.90638913e+08
	2.35675097e+08 2.65152463e+07 4.65501828e+07 4.64963900e+05
	8.01738458e+08 1.13109676e+07 9.39969541e+08 2.55429366e+06
	3.52700899e+07 2.25255431e+07 2.40363667e+08 5.24261291e+08
	6.82240277e+07 2.10644493e+09 3.35211418e+08 3.99681682e+08
	1.85156922e+09 1.54739128e+08 1.41116640e+09 4.26778200e+05
	4.94774415e+08 4.79368719e+09 8.20259763e+08 3.31353793e+09
	1.87295992e+09 3.40675936e+08 2.29268637e+08 9.35716985e+08
	1.70437333e+09 4.60506068e+06 8.81625451e+08 2.96526898e+08
	4.08634663e+08 5.22350997e+09 4.30011570e+08 2.47635484e+08
	8.01235334e+06 3.48341015e+08 1.35211088e+09 6.67125585e+09
	3.13282845e+06 7.17754028e+08 5.57982725e+07 1.78484304e+07
	4.72766825e+08 9.62785377e+08 3.33496207e+08 1.45653110e+08
	6.50008131e+08 1.89599953e+09 1.23010364e+09 9.15219653e+07
	5.70047836e+08 1.78863103e+09 8.56080821e+08 9.51140759e+09
	7.75300879e+08 4.19912040e+08 1.05095150e+06 5.01733880e+09
	4.46393427e+07 9.94192375e+07 2.61651650e+06 4.19109183e+08
	1.56882579e+08 1.25811332e+09 5.71898606e+08 1.01670072e+10
	1.28331156e+08 1.31681496e+08 2.52039078e+08 2.58716089e+07
	1.44338939e+09 4.09569719e+08 2.43890504e+08 1.69064174e+06
	4.62218252e+07 4.92879416e+07 1.48273870e+06 2.31906067e+08
	1.54132082e+07 8.24023964e+08 3.50660196e+09 4.02912682e+08
	1.12496411e+07 1.65915469e+08 1.01712096e+08 1.23282782e+09
	3.31261303e+09 1.52742089e+09 3.49351562e+09 9.05551270e+07
	1.29130468e+09 9.74110522e+08 2.34913754e+07 4.35480839e+08]
	"""

	# (8) 평가
	from sklearn.metrics import root_mean_squared_error

	rmse = root_mean_squared_error(y_val, pred)
	print(rmse)

	""" 출력 결과
	182519728.18885794
	"""

	# (9) 테스트 데이터 예측
	pred = model.predict(X_test_encoded)
	print(pred)

	""" 출력 결과
	[5.79965803e+09 4.09976141e+07 2.43990300e+06 ... 5.42755781e+09
	7.45447840e+08 6.56516711e+08]
	"""

	# (10) CSV 내보내기
	result = pd.DataFrame({
	'ID': df2['ID'],
	'pred': pred
	})

	print(result)

	""" 출력 결과
	ID pred
	0 ID_2575 5.467737e+09
	1 ID_6637 4.008152e+07
	2 ID_5704 2.338352e+06
	3 ID_3606 1.783806e+06
	4 ID_6443 4.369645e+05
	... ... ...
	5015 ID_4523 3.886178e+08
	5016 ID_3483 1.018352e+08
	5017 ID_453 5.447200e+09
	5018 ID_998 1.152426e+09
	5019 ID_3237 5.857870e+08

	[5020 rows x 2 columns]
	"""

	result.to_csv('./outputs/result_q6.csv', index=False)

	import pandas as pd

	df1 = pd.read_csv('./datasets/data_q7-01.csv')
	df2 = pd.read_csv('./datasets/data_q7-02.csv')

	print(df1.info())

	""" 출력 결과
	<class 'pandas.core.frame.DataFrame'>
	RangeIndex: 900 entries, 0 to 899
	Data columns (total 8 columns):
	# Column Non-Null Count Dtype
	--- ------ -------------- -----
	0 date 900 non-null object
	1 day_of_week 900 non-null object
	2 month 900 non-null int64
	3 station_name 895 non-null object
	4 visibility 892 non-null float64
	5 precipitation 900 non-null float64
	6 temperature 900 non-null float64
	7 num_people 900 non-null int64
	dtypes: float64(3), int64(2), object(3)
	memory usage: 56.4+ KB
	"""

	print(df2.info())

	""" 출력 결과
	<class 'pandas.core.frame.DataFrame'>
	RangeIndex: 300 entries, 0 to 299
	Data columns (total 7 columns):
	# Column Non-Null Count Dtype
	--- ------ -------------- -----
	0 date 300 non-null object
	1 day_of_week 300 non-null object
	2 month 300 non-null int64
	3 station_name 300 non-null object
	4 visibility 300 non-null float64
	5 precipitation 300 non-null float64
	6 temperature 300 non-null float64
	dtypes: float64(3), int64(1), object(3)
	memory usage: 16.5+ KB
	"""

	# (1) 결측치 처리
	X_train = df1.copy()
	X_test = df2.copy()

	target_column1 = X_train['station_name']
	target_column2 = X_train['visibility']

	## 최빈값으로 대체
	X_train['station_name'] = target_column1.fillna(target_column1.mode)
	X_train['visibility'] = target_column2.fillna(target_column2.mode)

	# (2) 필요 없는 변수 제거
	## 필요 없음.

	# (3) 독립 변수, 종속 변수 분리
	y = X_train['num_people']
	X = X_train.drop('num_people', axis=1)

	# (4) 원-핫 인코딩
	X_encoded = pd.get_dummies(X)
	X_test_encoded = pd.get_dummies(X_test)

	X_test_encoded = X_test_encoded.reindex(columns=X_encoded.columns, fill_value=0) # 훈련/테스트 데이터 열 구성 맞추기

	# (5) 데이터 분할
	from sklearn.model_selection import train_test_split

	X_tr, X_val, y_tr, y_val = train_test_split(X_encoded, y, test_size=0.2)
	print(X_tr.shape, X_val.shape, y_tr.shape, y_val.shape)

	""" 출력 결과
	(720, 916) (180, 916) (720,) (180,)
	"""

	# (6) 모델링
	from sklearn.ensemble import RandomForestRegressor

	model = RandomForestRegressor()
	model.fit(X_tr, y_tr)

	# (7) 예측
	pred = model.predict(X_val)
	print(pred)

	""" 출력 결과
	[12236.46 12733.06 9336.26 13332.44 8553.55 14111.59 11656.91 13240.12
	11802.8 13838.01 10012.24 9335.87 12446.76 13129.15 12942.51 13564.35
	11506.63 11582.89 12720.87 9484.63 11178.24 12422.58 10516.45 11982.67
	11057.6 15171.71 9680.88 11961.17 13791.65 11144.45 12174.55 11860.55
	12054.97 13899.41 15109.53 11254.85 13535.48 13576.75 11024.56 8897.38
	14461.47 13247.12 12323.59 9413.27 12437.45 12031.8 11940.07 10894.57
	10830.42 13728.16 11830.95 13622.64 11791.73 11853.71 12275.16 15121.62
	11812.97 12685.03 12356.14 14686.27 14524.01 12977.51 12367.66 11177.6
	9392.41 11030.1 12833.11 11300.76 12465.19 13866.05 14231.53 11574.67
	12709.23 15923.4 13073.99 10711.67 13892.44 13151.95 13618.13 13082.44
	14719.13 12044.77 14165.7 11126.25 15306.02 10878.04 11838.32 10307.7
	12501.42 12603.72 9504.5 10612.06 11305.31 9766.67 14549.17 13617.26
	11941.98 15287.39 11708.24 13083.42 12053.59 12916.26 10926.3 15131.65
	10840.68 10896.91 14725.57 11345.14 11859.78 12316.63 9486.17 14127.73
	10503.94 13129.13 12327.28 13155.95 16055.36 12609.08 12220.94 13935.74
	11013.08 15427.32 13283.78 14681.94 12883.63 16854.11 12417.27 13419.06
	13944.63 12128.41 12846.43 13973.34 13042.27 9215.26 14014.56 13895.13
	11449.83 11291.52 11712.63 11867.93 14125.03 14614.49 9370.99 13336.32
	11844.1 13490.19 15247.68 12180.57 15175.84 11628.74 11133.86 13937.42
	11595.94 15576.15 13869.88 14176.7 11274.85 10927.77 12752.74 11683.27
	10707.67 14043.11 14405.62 11629.11 9986.95 11341.57 11402.65 12307.42
	12606.05 11840.43 13645.18 10603.76 13334.25 13896.73 12306.21 11569.46
	16055.71 13872.51 11135.61 11484.46]
	"""

	# (8) 평가
	from sklearn.metrics import mean_absolute_error

	mse = mean_absolute_error(y_val, pred)
	print(mse)

	""" 출력 결과
	629.1398888888889
	"""

	# (9) 테스트 데이터 예측
	pred = model.predict(X_test_encoded)
	print(pred)

	""" 출력 결과
	[12070.17 11837.78 12487.12 13741.28 13921.6 12806.1 12948.99 12181.78
	10363.02 13993.09 12530.8 13532.8 13312.56 13874.18 13947.93 12502.65
	13577.14 12124.83 12851.14 11886.71 13303.52 13422.19 13941.75 13821.01
	14186.37 12203.9 11470.13 10991.97 11915.85 9743.88 14051.56 12532.26
	12481.29 9768.91 12619.09 13856.84 10663.67 12207.56 12422.26 11277.
	13433.01 11496.41 11887.91 14201.2 12078.11 13310.41 13584.11 13648.09
	12133.83 11759.32 12802.43 11294.08 16222.76 13173.95 10372.62 9221.57
	12197.82 11588.91 12580.89 13634.82 13683.65 13026.51 14542.45 11039.62
	12613.09 10925.29 14673.69 9060.45 9234.86 12017.93 15047.73 14234.73
	12791.03 13606. 12274. 14743.8 10581.5 12496.64 12980.62 10429.97
	14575.65 12616.09 11999.2 14104.46 10435.33 15434.21 14557.23 12109.58
	14681.67 10052.42 13076.44 10712.79 13403.83 15092.52 15574.82 14658.26
	11598.32 10963.11 13458.29 14363.23 14825.67 8947.51 15564.43 13259.87
	11442.01 13767.38 15013.07 12343.67 10090.5 11489.02 13938.28 12861.66
	13658.69 14889.84 13746.61 13862.68 12458.25 12886.42 10664.74 11238.15
	14278.66 12131.63 12765.67 14939.48 12385.35 13002.15 11617.37 12261.03
	15125.96 12791.93 15149.38 10920.4 12877.92 12037.61 15158.12 13978.45
	13690.88 14017.78 11535.3 12278.11 9851.28 12080.31 12488.82 12302.05
	13884.63 12421.07 10938.66 12968.42 12664.09 12703.78 10755.93 11475.61
	12403.18 11503.27 13759.99 11976.77 10080.32 8824.29 11917.5 11377.46
	13920.9 15540.7 13059.64 10909.93 14191.86 11981.46 10923.34 10475.18
	14382.32 16308.99 9016.27 12148.03 13619.67 11503.18 13703.97 11599.52
	13627.66 11265.62 13441.59 13334.74 12704.51 13559.1 12253.13 11081.88
	15250.2 12708.14 11314.68 14044.41 10453.28 15837.55 12298.65 10426.15
	14720.06 11752.15 12812.51 12516.28 13891. 13620.81 11834.42 13884.91
	13320.04 12862.61 15074.58 12925.26 11862.35 13940.3 15179.27 11365.45
	12549.91 15724.7 14703.01 11117.68 11424.55 13110.28 14208. 14195.6
	13308.58 12227.53 12350.51 11996.04 12705.86 9491.18 15335.24 9433.83
	13504.78 12113.44 14539.61 12426.03 11408.18 12705.72 11053.25 12616.64
	14614.25 9145.26 13305.27 11757.91 11768.79 14263.87 11014.09 11949.22
	13928.54 14379.71 12685.12 10804.89 12974.54 13794.75 12420. 12649.26
	14134.92 11515.95 14193.01 11853.54 11696.01 10853.63 13179.79 10549.92
	11985.93 12459.23 12369.26 12367.93 13452.42 12488.63 8839.12 16069.07
	14908.72 15293.89 12985.79 14797.71 12789.99 12017.57 12029.58 11229.91
	12126.57 8761.5 12800.93 14219.93 12596.58 15352.03 12108.7 15307.53
	11358.01 13951.89 14165.46 12285.25 9815.1 12473.16 12877. 15298.89
	14479.22 16406.25 14685.8 10965.84 10915.15 13624.83 12935.05 13890.44
	16332.45 15962.11 13871.73 14502.21]
	"""

	# (10) CSV 내보내기
	result = pd.DataFrame({
	'pred': pred
	})
	print(result)

	""" 출력 결과
	pred
	0 11767.44
	1 11823.30
	2 12558.62
	3 13440.26
	4 13793.40
	.. ...
	295 14024.25
	296 16082.65
	297 15066.40
	298 13964.19
	299 14439.47

	[300 rows x 1 columns]
	"""

	result.to_csv('./outputs/result_q7.csv', index=False)

	import pandas as pd

	train = pd.read_csv("data/customer_train.csv")
	test = pd.read_csv("data/customer_test.csv")

	# 사용자 코딩

	print(train.info())

	""" 출력 결과
	<class 'pandas.core.frame.DataFrame'>
	RangeIndex: 3500 entries, 0 to 3499
	Data columns (total 11 columns):
	# Column Non-Null Count Dtype
	--- ------ -------------- -----
	0 회원ID 3500 non-null int64
	1 총구매액 3500 non-null int64
	2 최대구매액 3500 non-null int64
	3 환불금액 1205 non-null float64
	4 주구매상품 3500 non-null object
	5 주구매지점 3500 non-null object
	6 방문일수 3500 non-null int64
	7 방문당구매건수 3500 non-null float64
	8 주말방문비율 3500 non-null float64
	9 구매주기 3500 non-null int64
	10 성별 3500 non-null int64
	dtypes: float64(3), int64(6), object(2)
	memory usage: 300.9+ KB
	"""

	print(test.info())

	""" 출력 결과
	<class 'pandas.core.frame.DataFrame'>
	RangeIndex: 2482 entries, 0 to 2481
	Data columns (total 10 columns):
	# Column Non-Null Count Dtype
	--- ------ -------------- -----
	0 회원ID 2482 non-null int64
	1 총구매액 2482 non-null int64
	2 최대구매액 2482 non-null int64
	3 환불금액 871 non-null float64
	4 주구매상품 2482 non-null object
	5 주구매지점 2482 non-null object
	6 방문일수 2482 non-null int64
	7 방문당구매건수 2482 non-null float64
	8 주말방문비율 2482 non-null float64
	9 구매주기 2482 non-null int64
	dtypes: float64(3), int64(5), object(2)
	memory usage: 194.0+ KB
	"""

	# (1) 결측치 처리
	X_train = train.copy()
	X_test = test.copy()

	target_column = X_train['환불금액']
	target_column = target_column.fillna(0) # 환불 금액이 없는 경우가 많으므로, 0으로 대체

	# (2) 필요 없는 변수 제거
	X_train = X_train.drop('회원ID', axis=1)
	X_test = X_test.drop('회원ID', axis=1)

	# (3) 독립 변수, 종속 변수 분리
	y = train["성별"]
	X = train.drop("성별", axis=1)

	# (4) 원-핫 인코딩
	X_encoded = pd.get_dummies(X)
	X_test_encoded = pd.get_dummies(X_test)

	X_test_encoded = X_test_encoded.reindex(columns=X_encoded.columns, fill_value=0)

	# (5) 데이터 분할
	from sklearn.model_selection import train_test_split

	X_tr, X_val, y_tr, y_val = train_test_split(X_encoded, y, test_size=0.2)
	print(X_tr.shape, X_val.shape, y_tr.shape, y_val.shape)

	""" 출력 결과
	(2800, 74) (700, 74) (2800,) (700,)
	"""

	# (6) 모델링
	from sklearn.ensemble import RandomForestClassifier

	model = RandomForestClassifier()
	model.fit(X_tr, y_tr)

	# (7) 예측
	pred = model.predict(X_val)

	# (8) 평가
	from sklearn.metrics import roc_auc_score

	roc_auc_score = roc_auc_score(y_val, pred)
	print(roc_auc_score)

	""" 출력 결과
	0.5846884367456296
	"""

	# (9) 테스트 데이터 예측
	pred = model.predict(X_test_encoded)
	print(pred)

	""" 출력 결과
	[0 0 0 ... 0 0 1]
	"""

	# (10) CSV 내보내기
	result = pd.DataFrame({
	'pred': pred
	})
	print(result)

	""" 출력 결과
	pred
	0 0
	1 0
	2 0
	3 0
	4 0
	... ...
	2477 0
	2478 0
	2479 1
	2480 1
	2481 0

	[2482 rows x 1 columns]
	"""

	result.to_csv("./outputs/result_q8.csv", index=False)