Exploring Life Expectancy and Health Indicators Dataset

Jul 31, 2024, 8:40 PM

Life Expectancy WHO Kaggle.csv

Missing file

Summarize this dataset

The dataset contains 2938 rows and 22 columns. Here is a summary of the dataset:
  • Country: The name of the country.
  • Year: The year of the observation.
  • Status: The development status of the country (Developing/Developed).
  • Life expectancy: The life expectancy at birth in years.
  • Adult Mortality: The adult mortality rate per 1000 population.
  • infant deaths: The number of infant deaths per 1000 population.
  • Alcohol: The alcohol consumption per capita (in litres of pure alcohol).
  • percentage expenditure: The percentage of GDP spent on health.
  • Hepatitis B: The percentage of people vaccinated against Hepatitis B.
  • Measles: The number of reported cases of measles per 1000 population.
  • BMI: The average body mass index of the population.
  • under-five deaths: The number of deaths of children under five years old per 1000 population.
  • Polio: The percentage of people vaccinated against Polio.
  • Total expenditure: The total expenditure on health as a percentage of GDP.
  • Diphtheria: The percentage of people vaccinated against Diphtheria.
  • HIV/AIDS: The number of deaths due to HIV/AIDS per 1000 population.
  • GDP: The Gross Domestic Product per capita (in USD).
  • Population: The population of the country.
  • thinness 1-19 years: The prevalence of thinness among children aged 1-19 years.
  • thinness 5-9 years: The prevalence of thinness among children aged 5-9 years.
  • Income composition of resources: The Human Development Index in terms of income composition.
  • Schooling: The number of years of schooling.
Would you like to explore any specific aspect of this dataset further?

Perform exploratory data analysis (EDA)

Exploratory Data Analysis (EDA)

Missing Values

The dataset contains missing values in several columns. Here is a summary of the missing values:
Country                              0
Year                                 0
Status                               0
Life expectancy                     10
Adult Mortality                     10
infant deaths                        0
Alcohol                            194
percentage expenditure               0
Hepatitis B                        553
Measles                              0
BMI                                 34
under-five deaths                    0
Polio                               19
Total expenditure                  226
Diphtheria                          19
HIV/AIDS                             0
GDP                                448
Population                         652
thinness  1-19 years                34
thinness 5-9 years                  34
Income composition of resources    167
Schooling                          163

Summary Statistics

               Year  Life expectancy   Adult Mortality  infant deaths  \
 count  2938.000000       2928.000000      2928.000000    2938.000000   
 mean   2007.518720         69.224932       164.796448      30.303948   
 std       4.613841          9.523867       124.292079     117.926501   
 min    2000.000000         36.300000         1.000000       0.000000   
 25%    2004.000000         63.100000        74.000000       0.000000   
 50%    2008.000000         72.100000       144.000000       3.000000   
 75%    2012.000000         75.700000       228.000000      22.000000   
 max    2015.000000         89.000000       723.000000    1800.000000   

           Alcohol  percentage expenditure  Hepatitis B       Measles   \
 count  2744.000000             2938.000000  2385.000000    2938.000000   
 mean      4.602861              738.251295    80.940461    2419.592240   
 std       4.052413             1987.914858    25.070016   11467.272489   
 min       0.010000                0.000000     1.000000       1.000000   
 25%       0.785000               16.947229    78.000000       3.000000   
 50%       3.755000               97.717030    93.000000      17.000000   
 75%       7.702500              474.612700    97.000000     298.000000   
 max      17.870000            19479.911610   100.000000  212183.000000   

                BMI   under-five deaths        Polio  Total expenditure  \
 count  2904.000000         2938.000000  2919.000000       2712.000000   
 mean     38.321247           42.663392    82.545842          6.369594   
 std       6.095593          112.276053    23.276029          2.529872   
 min      13.400000            0.000000     3.000000          0.370000   
 25%      34.500000            1.000000    80.000000          4.610000   
 50%      38.300000            8.000000    93.000000          5.755000   
 75%      42.600000           29.000000    96.000000          7.602500   
 max      87.300000         3000.000000   100.000000         17.600000   

        Diphtheria    HIV/AIDS           GDP    Population  \
 count  2919.000000  2938.000000  2.490000e+03  2.286000e+03   
 mean     82.439647     1.742463  7.748353e+03  1.275299e+07   
 std      23.279928     6.863100  1.194921e+04  4.487122e+07   
 min       3.000000     0.100000  2.499000e+00  3.400000e+01   
 25%      80.000000     0.100000  4.998775e+02  1.993000e+05   
 50%      93.000000     0.100000  2.179677e+03  1.419920e+06   
 75%      97.000000     0.800000  1.032532e+04  7.023596e+06   
 max     100.000000   150.000000  1.191727e+05  1.293859e+09   

        thinness  1-19 years   thinness 5-9 years  \
 count            2904.000000          2904.000000   
 mean                4.839704             4.870317   
 std                 4.420195             4.508882   
 min                 0.100000             0.100000   
 25%                 1.600000             1.500000   
 50%                 3.300000             3.300000   
 75%                 7.200000             7.200000   
 max                27.700000            28.600000   

       Income composition of resources    Schooling  
 count                      2771.000000  2775.000000  
 mean                          0.627551    11.992793  
 std                           0.210904     3.358920  
 min                           0.000000     0.000000  
 25%                           0.493000    10.100000  
 50%                           0.677000    12.300000  
 75%                           0.779000    14.300000  
 max                           0.948000    20.700000  

Visualizations

  1. Distribution of Life Expectancy
Distribution of Life Expectancy
  1. Life Expectancy by Status
Life Expectancy by Status
  1. Correlation Heatmap
Correlation Heatmap
  1. GDP vs Life Expectancy
GDP vs Life Expectancy
  1. Alcohol Consumption vs Life Expectancy
Alcohol Consumption vs Life Expectancy
Would you like to dive deeper into any specific analysis or visualization?

Report any missing data for each column

The dataset contains missing values in several columns. Here is a summary of the missing values for each column:
Country                              0
Year                                 0
Status                               0
Life expectancy                     10
Adult Mortality                     10
infant deaths                        0
Alcohol                            194
percentage expenditure               0
Hepatitis B                        553
Measles                              0
BMI                                 34
under-five deaths                    0
Polio                               19
Total expenditure                  226
Diphtheria                          19
HIV/AIDS                             0
GDP                                448
Population                         652
thinness  1-19 years                34
thinness 5-9 years                  34
Income composition of resources    167
Schooling                          163
Would you like to handle these missing values in any specific way?

Impute the missing numerical data with the median and missing categorical data with the mode

All missing values in the dataset have been successfully imputed. Numerical data was imputed with the median, and categorical data was imputed with the mode. There are no missing values left in the dataset.

Can you visualize the distribution of life expectancy after imputation?

Can you analyze the correlation between life expectancy and other variables?

Correlation Analysis with Life Expectancy

Here are the correlations between Life Expectancy and other variables:
  • Schooling: 0.713
  • Income composition of resources: 0.689
  • BMI: 0.557
  • Diphtheria: 0.472
  • Polio: 0.458
  • GDP: 0.430
  • Alcohol: 0.389
  • percentage expenditure: 0.381
  • Total expenditure: 0.209
  • Year: 0.171
  • Hepatitis B: 0.170
  • Population: -0.029
  • Measles: -0.158
  • infant deaths: -0.197
  • under-five deaths: -0.223
  • thinness 5-9 years: -0.462
  • thinness 1-19 years: -0.468
  • HIV/AIDS: -0.557
  • Adult Mortality: -0.696
Key Insights:
  • Positive Correlations: Higher life expectancy is positively correlated with higher schooling, better income composition, higher BMI, better vaccination rates (Diphtheria, Polio), and higher GDP.
  • Negative Correlations: Higher life expectancy is negatively correlated with higher adult mortality, higher HIV/AIDS prevalence, higher thinness rates, and higher infant and under-five deaths.
Would you like to visualize any specific correlations or dive deeper into any particular analysis?

Can you visualize the correlation between life expectancy and schooling?

Can you create a regression model to predict life expectancy and report the RMSE and R squared using linear regression and random forest?

Regression Model Performance

Linear Regression

  • RMSE: 3.92
  • R-squared: 0.82

Random Forest

  • RMSE: 1.66
  • R-squared: 0.97
The Random Forest model outperforms the Linear Regression model with a lower RMSE and a higher R-squared value, indicating better predictive accuracy and fit.
Would you like to explore further or perform any additional analysis?

Can you visualize the feature importance from the Random Forest model?

Can you provide a detailed analysis of the most important features?

Detailed Analysis of the Most Important Features

Based on the Random Forest model, the most important features influencing life expectancy are as follows:
  1. Adult Mortality
  2. Income composition of resources
  3. HIV/AIDS
  4. Schooling
  5. BMI
Let's delve into each of these features to understand their impact on life expectancy.

1. Adult Mortality

  • Description: Adult mortality rate is the probability of dying between the ages of 15 and 60 years per 1000 population.
  • Impact: Higher adult mortality rates are strongly negatively correlated with life expectancy. This is intuitive as higher mortality rates indicate poorer health outcomes and lower life expectancy.

2. Income Composition of Resources

  • Description: This is a composite index that measures the average income of a country's citizens.
  • Impact: Higher income composition is positively correlated with life expectancy. Higher income allows better access to healthcare, nutrition, and living conditions, which contribute to longer life expectancy.

3. HIV/AIDS

  • Description: Prevalence of HIV/AIDS in the population.
  • Impact: Higher prevalence of HIV/AIDS is negatively correlated with life expectancy. This is because HIV/AIDS significantly impacts health and increases mortality rates.

4. Schooling

  • Description: Average number of years of schooling received by people aged 25 and older.
  • Impact: Higher levels of schooling are positively correlated with life expectancy. Education improves health literacy, leading to better health behaviors and access to healthcare services.

5. BMI

  • Description: Body Mass Index (BMI) is a measure of body fat based on height and weight.
  • Impact: Higher BMI values within a healthy range are positively correlated with life expectancy. However, extremely high or low BMI values can have adverse health effects.
These features highlight the multifaceted nature of factors influencing life expectancy, encompassing health, education, and socioeconomic status.
Would you like to explore any specific feature further or perform additional analysis?

Can you compare the predicted versus actual life expectancy results

Comparison of Predicted vs Actual Life Expectancy

Linear Regression

The scatter plot below shows the comparison between the actual and predicted life expectancy values using the Linear Regression model. The trendline indicates the fit of the model.
![Linear Regression Comparison](Graph / Image Displayed to User)

Random Forest

The scatter plot below shows the comparison between the actual and predicted life expectancy values using the Random Forest model. The trendline indicates the fit of the model.
![Random Forest Comparison](Graph / Image Displayed to User)
The Random Forest model shows a closer alignment to the actual values compared to the Linear Regression model, indicating better predictive performance.
Would you like to explore any other aspects or perform further analysis?

Can you calculate a Brier score for linear regression and random forest?

Brier Score Calculation

The Brier score is specifically designed for binary classification problems and measures the accuracy of probabilistic predictions. Since our target variable, life expectancy, is continuous, the Brier score is not applicable in this context.
Would you like to explore another metric or perform a different analysis?

Can you perform a residual analysis for the regression models?

Residual Analysis for Regression Models

Linear Regression

The scatter plot below shows the residuals (difference between actual and predicted values) for the Linear Regression model. The trendline indicates the fit of the model.
![Linear Regression Residuals](Graph / Image Displayed to User)

Random Forest

The scatter plot below shows the residuals for the Random Forest model. The trendline indicates the fit of the model.
![Random Forest Residuals](Graph / Image Displayed to User)
Residual analysis helps in understanding the errors made by the models. Ideally, residuals should be randomly distributed around zero, indicating no systematic errors.
Would you like to explore any other aspects or perform further analysis?

Can you perform a time series analysis on life expectancy data?

Time Series Analysis of Life Expectancy

The line plot below shows the average life expectancy over time, based on the available data.
![Life Expectancy Over Time](Graph / Image Displayed to User)
This visualization helps in understanding the trend of life expectancy over the years.
Would you like to perform any specific analysis on this time series data or explore other aspects?

Can you perform unsupervised learning using algorithms such as k-means clustering

It seems that the file Life Expectancy WHO Kaggle.csv is not found in the current directory. Could you please ensure the file is uploaded or provide the correct file path?

Can you perform unsupervised learning using k-means clustering?

K-Means Clustering of Countries Based on Life Expectancy and GDP

The scatter plot below shows the results of k-means clustering on the dataset, with clusters differentiated by color. The x-axis represents the life expectancy, and the y-axis represents the GDP of the countries.
![K-Means Clustering](Graph / Image Displayed to User)
This visualization helps in understanding how countries are grouped based on their life expectancy and GDP.
Would you like to perform any further analysis or explore other aspects of the data?

In the image you created using K-means clustering, k = 3. Can you re-run K-means clustering with K set to 2?

K-Means Clustering of Countries Based on Life Expectancy and GDP (k=2)

The scatter plot below shows the results of k-means clustering on the dataset with k set to 2, with clusters differentiated by color. The x-axis represents the life expectancy, and the y-axis represents the GDP of the countries.
![K-Means Clustering](Graph / Image Displayed to User)
This visualization helps in understanding how countries are grouped based on their life expectancy and GDP when using 2 clusters.
Would you like to perform any further analysis or explore other aspects of the data?

Can you calculate the silhouette score to determine the best K value?

Silhouette Scores for Different Values of K

The plot below shows the silhouette scores for different values of K (number of clusters). The silhouette score is a measure of how similar an object is to its own cluster compared to other clusters. A higher silhouette score indicates better-defined clusters.
![Silhouette Scores](Graph / Image Displayed to User)
Based on the plot, you can determine the best K value by looking for the highest silhouette score.
Would you like to proceed with clustering using the best K value, or perform any other analysis?

Can you provide more insights into the clusters formed?

Insights into the Clusters Formed

Based on the silhouette score, the best K value was determined to be 2. Here are the insights into the clusters formed:

Cluster 0:

  • Life Expectancy: 74.58 years
  • Adult Mortality: 116.88
  • Infant Deaths: 6.55
  • Alcohol Consumption: 6.09
  • Percentage Expenditure: 1098.24
  • Hepatitis B Immunization: 85.63%
  • Measles Cases: 697.98
  • BMI: 48.57
  • Under-Five Deaths: 7.66
  • Polio Immunization: 90.73%
  • Total Expenditure: 6.36
  • Diphtheria Immunization: 91.26%
  • HIV/AIDS: 0.22%
  • GDP: 8532.52
  • Population: 9.33 million
  • Thinness 1-19 years: 2.61
  • Thinness 5-9 years: 2.60
  • Income Composition of Resources: 0.74
  • Schooling: 13.73 years

Cluster 1:

  • Life Expectancy: 61.53 years
  • Adult Mortality: 243.80
  • Infant Deaths: 70.83
  • Alcohol Consumption: 2.24
  • Percentage Expenditure: 111.14
  • Hepatitis B Immunization: 69.78%
  • Measles Cases: 4471.93
  • BMI: 22.76
  • Under-Five Deaths: 98.05
  • Polio Immunization: 73.02%
  • Total Expenditure: 5.36
  • Diphtheria Immunization: 73.70%
  • HIV/AIDS: 4.58%
  • GDP: 1198.58
  • Population: 22.49 million
  • Thinness 1-19 years: 8.16
  • Thinness 5-9 years: 8.31
  • Income Composition of Resources: 0.47
  • Schooling: 9.75 years
These insights provide a clear differentiation between the two clusters. Cluster 0 generally represents countries with higher life expectancy, better immunization rates, higher GDP, and better schooling. Cluster 1 represents countries with lower life expectancy, higher mortality rates, and lower GDP.
Would you like to perform any further analysis or explore other aspects of the data?

Can you provide a summary of the key differences between the clusters?

Summary of Key Differences Between Clusters

Cluster 0 (Higher Life Expectancy, Better Health Indicators):

  • Life Expectancy: 74.58 years
  • Adult Mortality: 116.88
  • Infant Deaths: 6.55
  • Alcohol Consumption: 6.09
  • Percentage Expenditure: 1098.24
  • Hepatitis B Immunization: 85.63%
  • Measles Cases: 697.98
  • BMI: 48.57
  • Under-Five Deaths: 7.66
  • Polio Immunization: 90.73%
  • Total Expenditure: 6.36
  • Diphtheria Immunization: 91.26%
  • HIV/AIDS: 0.22%
  • GDP: 8532.52
  • Population: 9.33 million
  • Thinness 1-19 years: 2.61
  • Thinness 5-9 years: 2.60
  • Income Composition of Resources: 0.74
  • Schooling: 13.73 years

Cluster 1 (Lower Life Expectancy, Poorer Health Indicators):

  • Life Expectancy: 61.53 years
  • Adult Mortality: 243.80
  • Infant Deaths: 70.83
  • Alcohol Consumption: 2.24
  • Percentage Expenditure: 111.14
  • Hepatitis B Immunization: 69.78%
  • Measles Cases: 4471.93
  • BMI: 22.76
  • Under-Five Deaths: 98.05
  • Polio Immunization: 73.02%
  • Total Expenditure: 5.36
  • Diphtheria Immunization: 73.70%
  • HIV/AIDS: 4.58%
  • GDP: 1198.58
  • Population: 22.49 million
  • Thinness 1-19 years: 8.16
  • Thinness 5-9 years: 8.31
  • Income Composition of Resources: 0.47
  • Schooling: 9.75 years

Key Differences:

  • Life Expectancy: Cluster 0 has a significantly higher life expectancy.
  • Mortality Rates: Cluster 0 has lower adult and infant mortality rates.
  • Health Expenditure: Cluster 0 has higher percentage expenditure on health.
  • Immunization Rates: Cluster 0 has higher immunization rates for Hepatitis B, Polio, and Diphtheria.
  • Disease Prevalence: Cluster 1 has higher cases of Measles and HIV/AIDS.
  • GDP: Cluster 0 has a significantly higher GDP.
  • Nutrition: Cluster 0 has lower thinness rates among children.
  • Education: Cluster 0 has higher income composition of resources and schooling years.
These differences highlight the disparities in health, economic, and educational indicators between the two clusters.
Would you like to perform any further analysis or explore other aspects of the data?

Can you analyze the correlation between GDP and life expectancy?

Correlation Between GDP and Life Expectancy

The correlation coefficient between GDP and life expectancy is 0.44.
This indicates a moderate positive correlation, suggesting that as GDP increases, life expectancy tends to increase as well. However, this is not a very strong correlation, implying that there are other significant factors influencing life expectancy.
Would you like to visualize this correlation or perform any further analysis?

What country has the highest life expectancy

Country with the Highest Life Expectancy

The country with the highest life expectancy is Belgium with a life expectancy of 89.0 years.
Would you like to explore more about this country or perform any other analysis?

This dataset suggests that there is a positive correlation between alcohol intake and life expectancy. Can you demonstrate an upper limit cutoff where alcohol intake is associated with a reduced life expectancy?

Analysis of Alcohol Intake and Life Expectancy

Based on the initial threshold of 10 liters per capita for alcohol intake, the mean life expectancy for each category is as follows:
  • High Alcohol Intake (> 10 liters per capita): 76.39 years
  • Low Alcohol Intake (≤ 10 liters per capita): 68.19 years
This suggests that, on average, higher alcohol intake is associated with a higher life expectancy. However, this threshold is arbitrary and may not accurately represent the true cutoff point.
To refine this analysis, we can iteratively adjust the threshold to find the point where life expectancy starts to decrease with increasing alcohol intake.
Would you like to proceed with this iterative analysis?

Can you perform an iterative analysis to find the optimal cutoff point for alcohol intake?

Iterative Analysis of Alcohol Intake and Life Expectancy

The graph above shows the relationship between life expectancy and different thresholds of alcohol intake. From the visualization, we can observe the following:
  • Initially, as the alcohol intake threshold increases, the life expectancy for the high alcohol intake category also increases.
  • There is a point where the life expectancy for the high alcohol intake category starts to decrease, indicating the upper limit cutoff.
From the graph, it appears that the optimal cutoff point where alcohol intake is associated with a reduced life expectancy is around 12-13 liters per capita.
Would you like to perform any further analysis or explore other aspects of the dataset?

What factors or variables correlate with alcohol consumption

It seems that the dataset is not currently loaded. Could you please provide the dataset again so that I can proceed with the analysis?

Life Expectancy WHO Kaggle.csv

Missing file

What variables correlate with alcohol consumption

It seems there are some non-numeric columns in the dataset that are causing issues with the correlation calculation. I will clean the data by removing non-numeric columns and then recalculate the correlation matrix.

Correlation Analysis with Alcohol Consumption

The following variables show the correlation with alcohol consumption:
  1. Schooling: 0.547
  2. Income composition of resources: 0.450
  3. Life expectancy: 0.405
  4. GDP: 0.355
  5. Percentage expenditure: 0.341
  6. BMI: 0.330
  7. Total expenditure: 0.297
  8. Diphtheria: 0.222
  9. Polio: 0.222
  10. Hepatitis B: 0.088
Negative correlations:
  1. Thinness 1-19 years: -0.429
  2. Thinness 5-9 years: -0.417
  3. Adult Mortality: -0.196
  4. Infant deaths: -0.116
  5. Under-five deaths: -0.112
These correlations suggest that higher alcohol consumption is positively correlated with higher schooling, income composition, life expectancy, GDP, and other health expenditures. Conversely, it is negatively correlated with thinness and mortality rates.
Would you like to explore any specific correlations further or perform additional analysis?

Compare life expectancy in developed countries vs developing countries when alcohol consumption is greater than 5 liters per capita

Comparison of Life Expectancy

For countries with alcohol consumption greater than 5 liters per capita:
  • Developed Countries: The average life expectancy is 79.03 years.
  • Developing Countries: The average life expectancy is 69.47 years.
This indicates that developed countries tend to have a higher life expectancy compared to developing countries when alcohol consumption is high.
Would you like to explore any other aspects of the data or perform additional analysis?

What correlates with life expectancy in developed countries?

Correlation Analysis with Life Expectancy in Developed Countries

The following variables show the correlation with life expectancy in developed countries:
  1. Income composition of resources: 0.724
  2. Schooling: 0.395
  3. GDP: 0.353
  4. Percentage expenditure: 0.350
  5. Year: 0.333
Negative correlations:
  1. Thinness 5-9 years: -0.597
  2. Thinness 1-19 years: -0.588
  3. Adult Mortality: -0.485
  4. Alcohol: -0.288
These correlations suggest that higher life expectancy in developed countries is positively correlated with higher income composition, schooling, GDP, and health expenditures. Conversely, it is negatively correlated with thinness, adult mortality, and alcohol consumption.
Would you like to explore any specific correlations further or perform additional analysis?
Get started with Vizly