Data Preprocessing and Model Training for Heart Disease Prediction

Aug 7, 2024, 9:36 PM

Create a classification model and report AUC, precision, recall and F1 using all appropriate algorithms available with Python. Optimize the hyperparameters for the F1 score. Also include multiple hidden layers for the neural network

The dataset has been successfully loaded. Here are the first few rows of the dataset:

TenYearCHD	male	age	education	currentSmoker	cigsPerDay	prevalentHyp	totChol	sysBP	diaBP	BMI	heartRate	glucose
0	1	39	4	0	0	0	195	106	70	26.97	80	77
0	0	46	2	0	0	0	250	121	81	28.73	95	76
0	1	48	1	1	20	0	245	127.5	80	25.34	75	70
1	0	61	3	1	30	1	225	150	95	28.58	65	103
0	0	46	3	1	23	0	285	130	84	23.10	85	85

Next, we will proceed with the following steps:

Data Preprocessing: Handle any missing values, encode categorical variables, and scale the features.
Model Training: Train multiple classification models including logistic regression, decision trees, random forests, support vector machines, and neural networks.
Hyperparameter Tuning: Optimize the hyperparameters for the F1 score using techniques such as GridSearchCV or RandomizedSearchCV.
Model Evaluation: Evaluate the models using metrics such as AUC, precision, recall, and F1 score.

Let's start with data preprocessing.

The data preprocessing step has been completed. The dataset has been split into training and testing sets with the following shapes:

Training set: 3392 samples, 15 features
Testing set: 848 samples, 15 features

Next, we will proceed with training multiple classification models. We will include logistic regression, decision trees, random forests, support vector machines, and neural networks. We will also optimize the hyperparameters for the F1 score using GridSearchCV.

Let's start with training the models.

Proceed with training multiple classification models as stated

The models have been trained and the best hyperparameters have been found for each model:

Logistic Regression: {'C': 1}
Decision Tree: {'max_depth': None}
Random Forest: {'max_depth': 20, 'n_estimators': 50}
SVM: {'C': 100, 'kernel': 'rbf'}
Neural Network: {'alpha': 0.001, 'hidden_layer_sizes': (50, 50, 50)}

Next, we will evaluate these models using metrics such as AUC, precision, recall, and F1 score.

The evaluation results for each model are as follows:

Model	AUC	Precision	Recall	F1 Score
Logistic Regression	0.710	0.533	0.065	0.116
Decision Tree	0.545	0.215	0.236	0.225
Random Forest	0.679	0.429	0.049	0.088
SVM	0.531	0.220	0.106	0.143
Neural Network	0.514	0.200	0.130	0.158

The Decision Tree model has the highest F1 score among the models evaluated.

Would you like to proceed with further analysis or have any other questions?

Can you provide a detailed classification report for the Decision Tree model?

Here is the detailed classification report for the Decision Tree model:

              precision    recall  f1-score   support

           0       0.87      0.85      0.86       725
           1       0.21      0.24      0.22       123

    accuracy                           0.76       848
   macro avg       0.54      0.54      0.54       848
weighted avg       0.77      0.76      0.77       848

Precision for class 0: 0.87
Recall for class 0: 0.85
F1-score for class 0: 0.86
Precision for class 1: 0.21
Recall for class 1: 0.24
F1-score for class 1: 0.22
Overall accuracy: 0.76
Macro average F1-score: 0.54
Weighted average F1-score: 0.77

The Decision Tree model has a high precision and recall for class 0 but performs poorly for class 1. The overall accuracy is 0.76.

Would you like to proceed with further analysis or have any other questions?

Can you compare the performance of all models using a bar chart?

The bar chart above compares the performance of all models using the following metrics: AUC, Precision, Recall, and F1 Score.

Logistic Regression has the highest AUC score.
Decision Tree has the highest F1 score.
Random Forest has a relatively high AUC and Precision but low Recall.
SVM and Neural Network have lower scores across all metrics compared to the other models.

Would you like to proceed with further analysis or have any other questions?

Get started with Vizly