Exploratory Data Analysis of House Prices Dataset

Jul 3, 2024, 11:57 PM

Generate insights

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

# Load the data
df = pd.read_csv('HousePrices.csv')

# Display basic information about the dataset
print(df.info())

# Show the first few rows of the dataset
print(df.head())

RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1460 non-null int64
1 MSSubClass 1460 non-null int64
2 MSZoning 1460 non-null object
3 LotFrontage 1201 non-null float64
4 LotArea 1460 non-null int64
5 Street 1460 non-null object
6 Alley 91 non-null object
7 LotShape 1460 non-null object
8 LandContour 1460 non-null object
9 Utilities 1460 non-null object
10 LotConfig 1460 non-null object
11 LandSlope 1460 non-null object
12 Neighborhood 1460 non-null object
13 Condition1 1460 non-null object
14 Condition2 1460 non-null object
15 BldgType 1460 non-null object
16 HouseStyle 1460 non-null object
17 OverallQual 1460 non-null int64
18 OverallCond 1460 non-null int64
19 YearBuilt 1460 non-null int64
20 YearRemodAdd 1460 non-null int64
21 RoofStyle 1460 non-null object
22 RoofMatl 1460 non-null object
23 Exterior1st 1460 non-null object
24 Exterior2nd 1460 non-null object
25 MasVnrType 588 non-null object
26 MasVnrArea 1452 non-null float64
27 ExterQual 1460 non-null object
28 ExterCond 1460 non-null object
29 Foundation 1460 non-null object
30 BsmtQual 1423 non-null object
31 BsmtCond 1423 non-null object
32 BsmtExposure 1422 non-null object
33 BsmtFinType1 1423 non-null object
34 BsmtFinSF1 1460 non-null int64
35 BsmtFinType2 1422 non-null object
36 BsmtFinSF2 1460 non-null int64
37 BsmtUnfSF 1460 non-null int64
38 TotalBsmtSF 1460 non-null int64
39 Heating 1460 non-null object
40 HeatingQC 1460 non-null object
41 CentralAir 1460 non-null object
42 Electrical 1459 non-null object
43 1stFlrSF 1460 non-null int64
44 2ndFlrSF 1460 non-null int64
45 LowQualFinSF 1460 non-null int64
46 GrLivArea 1460 non-null int64
47 BsmtFullBath 1460 non-null int64
48 BsmtHalfBath 1460 non-null int64
49 FullBath 1460 non-null int64
50 HalfBath 1460 non-null int64
51 BedroomAbvGr 1460 non-null int64
52 KitchenAbvGr 1460 non-null int64
53 KitchenQual 1460 non-null object
54 TotRmsAbvGrd 1460 non-null int64
55 Functional 1460 non-null object
56 Fireplaces 1460 non-null int64
57 FireplaceQu 770 non-null object
58 GarageType 1379 non-null object
59 GarageYrBlt 1379 non-null float64
60 GarageFinish 1379 non-null object
61 GarageCars 1460 non-null int64
62 GarageArea 1460 non-null int64
63 GarageQual 1379 non-null object
64 GarageCond 1379 non-null object
65 PavedDrive 1460 non-null object
66 WoodDeckSF 1460 non-null int64
67 OpenPorchSF 1460 non-null int64
68 EnclosedPorch 1460 non-null int64
69 3SsnPorch 1460 non-null int64
70 ScreenPorch 1460 non-null int64
71 PoolArea 1460 non-null int64
72 PoolQC 7 non-null object
73 Fence 281 non-null object
74 MiscFeature 54 non-null object
75 MiscVal 1460 non-null int64
76 MoSold 1460 non-null int64
77 YrSold 1460 non-null int64
78 SaleType 1460 non-null object
79 SaleCondition 1460 non-null object
80 SalePrice 1460 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB
None
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape \
0 1 60 RL 65.0 8450 Pave NaN Reg
1 2 20 RL 80.0 9600 Pave NaN Reg
2 3 60 RL 68.0 11250 Pave NaN IR1
3 4 70 RL 60.0 9550 Pave NaN IR1
4 5 60 RL 84.0 14260 Pave NaN IR1

LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold \
0 Lvl AllPub ... 0 NaN NaN NaN 0 2
1 Lvl AllPub ... 0 NaN NaN NaN 0 5
2 Lvl AllPub ... 0 NaN NaN NaN 0 9
3 Lvl AllPub ... 0 NaN NaN NaN 0 2
4 Lvl AllPub ... 0 NaN NaN NaN 0 12

YrSold SaleType SaleCondition SalePrice
0 2008 WD Normal 208500
1 2007 WD Normal 181500
2 2008 WD Normal 223500
3 2006 WD Abnorml 140000
4 2008 WD Normal 250000

[5 rows x 81 columns]
# Analyze the distribution of sale prices
price_stats = df['SalePrice'].describe()
print("Sale Price Statistics:")
print(price_stats)

# Create a histogram of sale prices
fig = px.histogram(df, x='SalePrice', nbins=50, title='Distribution of House Sale Prices')
fig.update_layout(xaxis_title='Sale Price', yaxis_title='Count')
fig.show()

# Analyze the correlation between overall quality and sale price
fig = px.scatter(df, x='OverallQual', y='SalePrice', title='Overall Quality vs Sale Price')
fig.update_layout(xaxis_title='Overall Quality', yaxis_title='Sale Price')
fig.show()

# Analyze the average sale price by neighborhood
neighborhood_avg_price = df.groupby('Neighborhood')['SalePrice'].mean().sort_values(ascending=False)
top_10_neighborhoods = neighborhood_avg_price.head(10)

fig = px.bar(x=top_10_neighborhoods.index, y=top_10_neighborhoods.values,
title='Top 10 Neighborhoods by Average Sale Price')
fig.update_layout(xaxis_title='Neighborhood', yaxis_title='Average Sale Price')
fig.show()

# Analyze the relationship between living area and sale price
fig = px.scatter(df, x='GrLivArea', y='SalePrice', title='Living Area vs Sale Price')
fig.update_layout(xaxis_title='Above Ground Living Area (sq ft)', yaxis_title='Sale Price')
fig.show()
Loading...
Based on the analysis of the House Prices dataset, here are some key insights:
  1. Price Distribution: The average sale price is $180,921, with a standard deviation of $79,442. The minimum price is $34,900, and the maximum is $755,000. The median price (50th percentile) is $163,000.
  2. Price Distribution Visualization: The histogram shows that the distribution of house prices is right-skewed, meaning there are more lower-priced houses and fewer high-priced outliers.
  3. Overall Quality vs. Price: There's a strong positive correlation between the overall quality of the house and its sale price. Higher quality houses tend to sell for higher prices.
  4. Neighborhood Impact: The top 10 neighborhoods by average sale price have been identified. This shows that location plays a significant role in determining house prices.
  5. Living Area vs. Price: There's a positive correlation between the above-ground living area and the sale price. Larger houses tend to be more expensive.
Further analysis could include:
  1. Investigating the impact of other features like the year built, number of bedrooms, or presence of amenities on the sale price.
  2. Analyzing the seasonal trends in house prices using the MoSold (Month Sold) feature.
  3. Exploring the relationship between lot size and price.
  4. Investigating the impact of recent remodeling (YearRemodAdd) on house prices.
Would you like me to perform any of these additional analyses or explore any other aspects of the dataset?

Can you provide a correlation matrix for the numerical features?

import pandas as pd
import numpy as np
import plotly.express as px

# Select numerical columns
numerical_columns = df.select_dtypes(include=[np.number]).columns

# Calculate the correlation matrix
corr_matrix = df[numerical_columns].corr()

# Create a heatmap using Plotly
fig = px.imshow(corr_matrix,
x=corr_matrix.columns,
y=corr_matrix.columns,
color_continuous_scale='RdBu_r',
aspect='auto',
title='Correlation Matrix of Numerical Features')

# Update layout for better readability
fig.update_layout(
xaxis=dict(tickangle=45),
yaxis=dict(tickangle=0),
width=1000,
height=800
)

# Show the plot
fig.show()

# Display top 10 correlations with SalePrice
sale_price_corr = corr_matrix['SalePrice'].sort_values(ascending=False)
print("Top 10 features correlated with SalePrice:")
print(sale_price_corr.head(11)) # 11 because SalePrice itself will be included
Loading...
I have provided the correlation matrix for the numerical features as requested. Here's a summary of the findings:
  1. The correlation matrix heatmap visualizes the relationships between all numerical features in the dataset.
  2. The top 10 features most correlated with SalePrice are:
    • OverallQual (0.791)
    • GrLivArea (0.709)
    • GarageCars (0.640)
    • GarageArea (0.623)
    • TotalBsmtSF (0.614)
    • 1stFlrSF (0.606)
    • FullBath (0.561)
    • TotRmsAbvGrd (0.534)
    • YearBuilt (0.523)
    • YearRemodAdd (0.507)
  3. OverallQual (overall quality) has the strongest positive correlation with SalePrice, which confirms our earlier observation from the scatter plot.
  4. GrLivArea (above ground living area) also shows a strong positive correlation, supporting our previous analysis.
  5. Features related to the garage (GarageCars and GarageArea) have a significant positive correlation with the sale price.
  6. The year the house was built (YearBuilt) and the year of remodeling (YearRemodAdd) both show moderate positive correlations with the sale price.
This correlation analysis provides valuable insights into which features have the strongest relationships with the sale price, which could be useful for feature selection in predictive modeling or for understanding the key factors influencing house prices in this dataset.
Would you like to explore any specific relationships further or analyze any other aspects of the data?