Data Cleaning: Handling Missing Values in the Dataset

Aug 25, 2024, 11:34 AM

Reviews.csv

Missing file

Look at the text data to understand its structure and any pre-processing needed.

Pre-processing:

Clean the text data (remove HTML tags, stop words, etc.).
Create sentiment labels based on the Score. clean this data set from missing values handle outliers prepare this data for modeling handle categorial data and those data that need to convert into numeric give me completely analysis of this data what are the top rated products ? show review trends over the time and give meaningful insights

It seems there is an issue with the CSV file, possibly due to inconsistent number of fields in some rows. Let's try to load the file with a different approach to handle this issue.

It seems the error_bad_lines argument is no longer supported in the latest version of pandas. Let's try using the on_bad_lines argument instead to skip problematic lines.

The dataset has been successfully loaded. Here is a preview of the first few rows:

Id	ProductId	UserId	ProfileName	HelpfulnessNumerator	HelpfulnessDenominator	Score	Time	Summary	Text
1	B001E4KFG0	A3SGXH7AUHU8GW	delmartian	1	1	5	1303862400	Good Quality Dog Food	I have bought several of the Vitality canned d...
2	B00813GRG4	A1D87F6ZCVE5NK	dll pa	0	0	1	1346976000	Not as Advertised	Product arrived labeled as Jumbo Salted Peanut...
3	B000LQOCH0	ABXLMWJIXXAIN	Natalia Corres "Natalia Corres"	1	1	4	1219017600	"Delight" says it all	This is a confection that has been around a fe...
4	B000UA0QIQ	A395BORC6FGVXV	Karl	3	3	2	1307923200	Cough Medicine	If you are looking for the secret ingredient i...
5	B006K2ZZ7K	A1UQRSCLF8GW1T	Michael D. Bigham "M. Wassir"	0	0	5	1350777600	Great taffy	Great taffy at a great price. There was a wid...

Next, we will proceed with the following steps:

Clean the text data (remove HTML tags, stop words, etc.).
Create sentiment labels based on the Score.
Clean the dataset from missing values.
Handle outliers.
Prepare the data for modeling.
Handle categorical data and convert necessary columns to numeric.
Analyze the data to find the top-rated products.
Show review trends over time and provide meaningful insights.

The dataset has been successfully loaded. Here is a preview of the first few rows:

Id	ProductId	UserId	ProfileName	HelpfulnessNumerator	HelpfulnessDenominator	Score	Time	Summary	Text
1	B001E4KFG0	A3SGXH7AUHU8GW	delmartian	1	1	5	1303862400	Good Quality Dog Food	I have bought several of the Vitality canned d...
2	B00813GRG4	A1D87F6ZCVE5NK	dll pa	0	0	1	1346976000	Not as Advertised	Product arrived labeled as Jumbo Salted Peanut...
3	B000LQOCH0	ABXLMWJIXXAIN	Natalia Corres "Natalia Corres"	1	1	4	1219017600	"Delight" says it all	This is a confection that has been around a fe...
4	B000UA0QIQ	A395BORC6FGVXV	Karl	3	3	2	1307923200	Cough Medicine	If you are looking for the secret ingredient i...
5	B006K2ZZ7K	A1UQRSCLF8GW1T	Michael D. Bigham "M. Wassir"	0	0	5	1350777600	Great taffy	Great taffy at a great price. There was a wid...

Next, we will proceed with the following steps:

Clean the text data (remove HTML tags, stop words, etc.).
Create sentiment labels based on the Score.
Clean the dataset from missing values.
Handle outliers.
Prepare the data for modeling.
Handle categorical data and convert necessary columns to numeric.
Analyze the data to find the top-rated products.
Show review trends over time and provide meaningful insights.

The text data has been successfully cleaned. Here is a preview of the original and cleaned text:

Original Text	Cleaned Text
I have bought several of the Vitality canned d...	bought several vitality canned dog food produc...
Product arrived labeled as Jumbo Salted Peanut...	product arrived labeled jumbo salted peanuts p...
This is a confection that has been around a fe...	confection around centuries light pillowy citr...
If you are looking for the secret ingredient i...	looking secret ingredient robitussin believe f...
Great taffy at a great price. There was a wid...	great taffy great price wide assortment yummy ...

Next, we will create sentiment labels based on the Score column.

The sentiment labels have been created based on the Score column. Here is the distribution of sentiment labels:

Positive: 90,198 reviews
Negative: 17,190 reviews
Neutral: 9,183 reviews

Next, we will clean the dataset from missing values.

The dataset contains a few missing values in the ProfileName and Summary columns. Here are the counts of missing values:

ProfileName: 7 missing values
Summary: 5 missing values

Next, we will remove the rows with missing values.

continue to the next process

The dataset has been successfully cleaned of missing values. There are no more missing values in any of the columns.

Next, we will handle outliers in the dataset. We will focus on the HelpfulnessNumerator and HelpfulnessDenominator columns to identify and handle any outliers.

The box plots for HelpfulnessNumerator and HelpfulnessDenominator have been displayed. These plots help us identify the presence of outliers in these columns.

Next, we will handle the outliers by removing rows where the values in these columns are significantly higher than the majority of the data.

The outliers have been successfully removed from the dataset. Here are the shapes of the dataframe before and after removing outliers:

Original Shape: (116,559, 12)
Filtered Shape: (115,212, 12)

Next, we will prepare the data for modeling. This includes handling categorical data and converting necessary columns to numeric.

The Time column has been successfully converted to datetime format. Here is a preview of the converted Time column:

Time
2011-04-27
2012-09-07
2008-08-18
2011-06-13
2012-10-21

Next, we will handle categorical data and convert necessary columns to numeric.

The categorical columns have been successfully converted to numeric using one-hot encoding. The dataset is now ready for modeling.

Next, we will analyze the data to find the top-rated products and show review trends over time.

Can you visualize the review trends over time?

Get started with Vizly