Data Cleaning and Preparation for Titanic Dataset Analysis

Sep 28, 2024, 2:32 PM

titanic (1) (1).csv

Missing file

In the given Titanic dataset, perform basic data exploration and

analysis to gain insights into the demographics of passengers
onboard the Titanic. The project involves tasks such as data
cleaning, visualization, and basic statistical analysis to understand
various aspects of the passengers, such as their demographics,
ticket fares, and family relations, and also effectively handle
missing values in this dataset using various techniques in Python

Basic Data Exploration and Analysis of Titanic Dataset

Let's start by performing some basic data exploration and analysis to gain insights into the demographics of passengers onboard the Titanic. This involves tasks such as data cleaning, visualization, and basic statistical analysis to understand various aspects of the passengers, such as their demographics, ticket fares, and family relations. We'll also handle missing values effectively using various techniques in Python.

Steps to be performed:

  1. Data Cleaning: Handle missing values and correct data types.
  2. Data Visualization: Visualize the demographics and other aspects of the passengers.
  3. Statistical Analysis: Perform basic statistical analysis to gain insights.
Let's start with data cleaning.

Data Cleaning

From the output, we can observe the following:
  • The Age column has 177 missing values.
  • The Cabin column has 687 missing values.
  • The Embarked column has 2 missing values.
We'll handle these missing values as follows:
  1. Age: Fill missing values with the median age.
  2. Cabin: Since a large portion of the data is missing, we'll drop this column for now.
  3. Embarked: Fill missing values with the most common port of embarkation.
Let's proceed with these steps.

Data Cleaning Completed

All missing values have been handled:
  • Age: Filled with the median age.
  • Cabin: Dropped the column due to a large number of missing values.
  • Embarked: Filled with the most common port of embarkation.
Now, let's move on to data visualization to gain insights into the demographics and other aspects of the passengers.

Can you do more detailed deep analysis

Detailed Data Analysis

We'll perform a more detailed analysis by exploring the following aspects:
  1. Demographics: Distribution of passengers by age, gender, and class.
  2. Survival Analysis: Survival rates based on different factors such as age, gender, class, and family relations.
  3. Ticket Fares: Distribution of ticket fares and their relation to survival.
  4. Family Relations: Analysis of family size and its impact on survival.
Let's start with the demographics.

Demographics Analysis

  1. Age Distribution: The age distribution of passengers shows a wide range of ages, with a significant number of passengers in their 20s and 30s.
  2. Gender Distribution: There were more male passengers than female passengers onboard the Titanic.
  3. Class Distribution: The majority of passengers were in the third class, followed by the first and second classes.
Next, let's analyze the survival rates based on different factors such as age, gender, class, and family relations.

Survival Analysis

  1. Survival Rate by Gender: The survival rate for females was significantly higher than that for males.
  2. Survival Rate by Class: Passengers in the first class had the highest survival rate, followed by the second class, with the third class having the lowest survival rate.
  3. Survival Rate by Age: The survival rate varies across different age groups, with children having a higher survival rate compared to adults.
Next, let's analyze the distribution of ticket fares and their relation to survival.

Ticket Fares Analysis

  1. Distribution of Ticket Fares: The distribution of ticket fares shows that most passengers paid lower fares, with a few passengers paying significantly higher fares.
  2. Ticket Fares and Survival: Passengers who paid higher fares had a higher survival rate compared to those who paid lower fares.
Next, let's analyze the family relations and their impact on survival.

Family Relations Analysis

Survival Rate by Family Size: Passengers with smaller family sizes had a higher survival rate compared to those with larger family sizes. This indicates that having fewer family members onboard may have increased the chances of survival.

Summary of Detailed Analysis

  1. Demographics:
    • Age: Wide range with many passengers in their 20s and 30s.
    • Gender: More males than females.
    • Class: Majority in third class.
  2. Survival Analysis:
    • Gender: Higher survival rate for females.
    • Class: Higher survival rate for first class passengers.
    • Age: Higher survival rate for children.
  3. Ticket Fares:
    • Most passengers paid lower fares.
    • Higher fares correlated with higher survival rates.
  4. Family Relations:
    • Smaller family sizes had higher survival rates.
Would you like to explore any other specific aspects of the dataset or perform any additional analysis?

Can you perform a logistic regression analysis to predict survival?

Logistic Regression Analysis

To perform a logistic regression analysis to predict survival, we'll follow these steps:
  1. Data Preprocessing: Handle missing values and encode categorical variables.
  2. Feature Selection: Select relevant features for the model.
  3. Model Training: Train a logistic regression model.
  4. Model Evaluation: Evaluate the model's performance using metrics such as accuracy, precision, recall, and the ROC curve.
Let's start with data preprocessing.

Logistic Regression Analysis Results

  1. Accuracy: The model achieved an accuracy of 80.45% on the test set.
  2. Precision: The precision of the model is 78.26%.
  3. Recall: The recall of the model is 72.97%.
  4. ROC AUC: The ROC AUC score is 88.19%.
These metrics indicate that the logistic regression model performs reasonably well in predicting survival on the Titanic dataset.
Next, let's visualize the ROC curve to better understand the model's performance.
Get started with Vizly