Pima indian diabetes : 피마 인디언 당뇨병 예측

Notice

Recent Posts

Recent Comments

Link

My GIT Address

250x250

« 2025/06 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Tags more

Archives

Today

Total

관리 메뉴

BASHA TECH

Pima indian diabetes : 피마 인디언 당뇨병 예측 본문

AI/Machine Learning

Pima indian diabetes : 피마 인디언 당뇨병 예측

Basha 2022. 10. 13. 12:36

728x90

# library import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# 데이터 분리
from sklearn.model_selection import train_test_split


# 점수(평가 지표)
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score
from sklearn.metrics import f1_score, confusion_matrix, precision_recall_curve, roc_curve

# 정규화(표준화)
from sklearn.preprocessing import StandardScaler # 일정한 범위로 바꿈

# 학습할 모델
from sklearn.linear_model import LogisticRegression

# 데이터 로딩
dia_df = pd.read_csv('../data/diabetes.csv')
dia_df.head(3)

dia_df.info()

dia_df.describe()

dia_df.describe().T # 대문자 T 소문자 t는 안된다

dia_df.isnull().sum()

# 불균형 데이터 확인 => label로 확인해야함
print(dia_df['Outcome'].value_counts())

# feature dataset X, label dataset y으로 추출
# 맨 끝이 Outcome column => label
X = dia_df.iloc[:,:-1]
y = dia_df.iloc[:,-1]

# 학습/테스트 데이터 분리 => 8:2
X_train, X_test, y_train, y_test = train_test_split(
      X # 분리할 데이터
    , y # 분리할 답
    , test_size=0.2 #분리할 비율
    , random_state=156
    , stratify=y # 레이블 기준으로 계층적으로 데이터 분리
)

def get_clf_eval(y_test, pred=None, pred_proba=None):  
    ''' 주석도 함수가 아니니까 들여쓰기 해줘야 함
    # name  
        get_clf_eval
    # 설명
        모델 평가 지표 출력
    # parameter
    - y_test : 테스트 데이터의 레이블 (답)
    - pred :테스트 데이터의 예측값
    - pred_proba : 테스트 데이터의 예측 확률
    '''
    # 오차 행렬 값
    confusion = confusion_matrix(y_test, pred)
    # 정확도
    acc = accuracy_score(y_test, pred)   
    # 정밀도 점수 
    precision = precision_score(y_test, pred)
    # 재현율 점수
    recall = recall_score(y_test, pred)
    # f1 점수
    f1 = f1_score(y_test, pred)
    # AUC 점수
    roc_auc = roc_auc_score(y_test, pred_proba) # 확률이 들어가야한다는 것 주의
    # 오차 행렬 출력
    print(confusion)
    
    # print
    print("정확도:{0:.4f}, 정밀도:{0:.4f}, \
        재현률:{2:.4f}, F1:{3:.4f}, AUC:{4:.4f}".format(acc, precision, recall, f1, roc_auc))

# 학습 진행
# 1. 알고리즘 오브젝트 생성
lr_clf = LogisticRegression(solver='liblinear')

# 2. 학습 진행 => 모델 생성
lr_clf.fit(X_train, y_train)

# 3. 예측값 => 테스트 데이터로 예측 진행
pred = lr_clf.predict(X_test)

# 4. 예측 확률 => 테스트 데이터로
pred_proba = lr_clf.predict_proba(X_test)[:,1]

get_clf_eval(y_test, pred, pred_proba)

dia_df.columns

# 데이터 전처리
# 1. 이상치 (0값이 있는 데이터) 데이터 파악, 처리
zero_features = ['Glucose', 'BloodPressure'
                 , 'SkinThickness', 'Insulin','BMI']
# 전체 데이터 건수 => 비율 추출
# dia_df.shape[0] => 768
total_count = dia_df['Glucose'].count()

# 각 컬럼 별(5개) 0인 데이터의 비율
for feature in zero_features:
    zero_count = dia_df[dia_df[feature]==0][feature].count()
    print('{0} 컬럼 0의 건수는 {1}건, 비율은 {2:.2f}%'.format(
        feature, zero_count, zero_count/total_count * 100
    ))

# 각 컬럼의 0 값을 평균으로 대체 (replace)
# 1. 각 컬럼의 평균 구한다.
mean_zero_features = dia_df[zero_features].mean()
mean_zero_features

# 2. 대체
dia_df[zero_features] = \
    dia_df[zero_features].replace(0, mean_zero_features)

for feature in zero_features:
    zero_count = dia_df[dia_df[feature]==0][feature].count()
    print('{0} 컬럼 0의 건수는 {1}건, 비율은 {2:.2f}%'.format(
        feature, zero_count, zero_count/total_count * 100
    ))
# 0으로 평균이 모두 바뀌었다.

dia_df.describe()

dia_df.describe()[zero_features] # fancy indexing

# 정규화 진행
X = dia_df.iloc[:,:-1] # 데이터
y = dia_df.iloc[:,-1] # 레이블 분리

# 정규화
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 학습/테스트 데이터 분리
X_train, X_test, y_train, y_test = train_test_split(
      X_scaled
    , y
    , random_state=156
    , test_size=0.2
    , stratify=y
    )

# 모델 생성 => vudrk
lr_clf = LogisticRegression()
lr_clf.fit(X_train, y_train)
pred = lr_clf.predict(X_test)
pred_proba = lr_clf.predict_proba(X_test)[:,1]

get_clf_eval(y_test, pred, pred_proba)

참고 사이트:

https://www.kaggle.com/code/ohseokkim/t2-2-pima-indians-diabetes

[깡초보를 위한] T2-2. Pima Indians Diabetes(피마 인디언 당뇨병)

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

www.kaggle.com

https://www.kaggle.com/code/ohseokkim/diabetes-three-ensemble-models/notebook

Diabetes( Three Ensemble Models )

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

www.kaggle.com

ch03.ipynb

0.04MB

728x90

저작자표시 비영리 변경금지 (새창열림)

'AI > Machine Learning' 카테고리의 다른 글

Ch4. 분류 (0)	2022.10.13
Human Activity Recognition Using Smartphones Data Set (0)	2022.10.13
Ch3. 평가 (0)	2022.10.12
Ch2. 사이킷런으로 시작하는 머신러닝 (0)	2022.10.07
Ch1. 파이썬 기반의 머신러닝과 생태계 이해 (0)	2022.10.06

'AI/Machine Learning' Related Articles

Comments

BASHA TECH

Pima indian diabetes : 피마 인디언 당뇨병 예측 본문

Pima indian diabetes : 피마 인디언 당뇨병 예측

'AI > Machine Learning' 카테고리의 다른 글

티스토리툴바