Data Preprocessing and Model Training for Heart Disease Prediction

Aug 7, 2024, 9:36 PM

Framingham_Kaggle_Imputed.csv

Missing file

Create a classification model and report AUC, precision, recall and F1 using all appropriate algorithms available with Python. Optimize the hyperparameters for the F1 score. Also include multiple hidden layers for the neural network

The dataset has been successfully loaded. Here are the first few rows of the dataset:
TenYearCHDmaleageeducationcurrentSmokercigsPerDayBPMedsprevalentStrokeprevalentHypdiabetestotCholsysBPdiaBPBMIheartRateglucose
013940000001951067026.978077
004620000002501218128.739576
014811200000245127.58025.347570
1061313000102251509528.5865103
0046312300002851308423.108585
Next, we will proceed with the following steps:
  1. Data Preprocessing: Handle any missing values, encode categorical variables, and scale the features.
  2. Model Training: Train multiple classification models including logistic regression, decision trees, random forests, support vector machines, and neural networks.
  3. Hyperparameter Tuning: Optimize the hyperparameters for the F1 score using techniques such as GridSearchCV or RandomizedSearchCV.
  4. Model Evaluation: Evaluate the models using metrics such as AUC, precision, recall, and F1 score.
Let's start with data preprocessing.
The data preprocessing step has been completed. The dataset has been split into training and testing sets with the following shapes:
  • Training set: 3392 samples, 15 features
  • Testing set: 848 samples, 15 features
Next, we will proceed with training multiple classification models. We will include logistic regression, decision trees, random forests, support vector machines, and neural networks. We will also optimize the hyperparameters for the F1 score using GridSearchCV.
Let's start with training the models.

Proceed with training multiple classification models as stated

The models have been trained and the best hyperparameters have been found for each model:
  • Logistic Regression: {'C': 1}
  • Decision Tree: {'max_depth': None}
  • Random Forest: {'max_depth': 20, 'n_estimators': 50}
  • SVM: {'C': 100, 'kernel': 'rbf'}
  • Neural Network: {'alpha': 0.001, 'hidden_layer_sizes': (50, 50, 50)}
Next, we will evaluate these models using metrics such as AUC, precision, recall, and F1 score.
The evaluation results for each model are as follows:
ModelAUCPrecisionRecallF1 Score
Logistic Regression0.7100.5330.0650.116
Decision Tree0.5450.2150.2360.225
Random Forest0.6790.4290.0490.088
SVM0.5310.2200.1060.143
Neural Network0.5140.2000.1300.158
The Decision Tree model has the highest F1 score among the models evaluated.
Would you like to proceed with further analysis or have any other questions?

Can you provide a detailed classification report for the Decision Tree model?

Here is the detailed classification report for the Decision Tree model:
              precision    recall  f1-score   support

           0       0.87      0.85      0.86       725
           1       0.21      0.24      0.22       123

    accuracy                           0.76       848
   macro avg       0.54      0.54      0.54       848
weighted avg       0.77      0.76      0.77       848
  • Precision for class 0: 0.87
  • Recall for class 0: 0.85
  • F1-score for class 0: 0.86
  • Precision for class 1: 0.21
  • Recall for class 1: 0.24
  • F1-score for class 1: 0.22
  • Overall accuracy: 0.76
  • Macro average F1-score: 0.54
  • Weighted average F1-score: 0.77
The Decision Tree model has a high precision and recall for class 0 but performs poorly for class 1. The overall accuracy is 0.76.
Would you like to proceed with further analysis or have any other questions?

Can you compare the performance of all models using a bar chart?

The bar chart above compares the performance of all models using the following metrics: AUC, Precision, Recall, and F1 Score.
  • Logistic Regression has the highest AUC score.
  • Decision Tree has the highest F1 score.
  • Random Forest has a relatively high AUC and Precision but low Recall.
  • SVM and Neural Network have lower scores across all metrics compared to the other models.
Would you like to proceed with further analysis or have any other questions?
Get started with Vizly