Data Cleaning: Handling Missing Values in the Dataset

Aug 25, 2024, 11:34 AM

Reviews.csv

Missing file

Look at the text data to understand its structure and any pre-processing needed.

Pre-processing:
Clean the text data (remove HTML tags, stop words, etc.).
Create sentiment labels based on the Score. clean this data set from missing values handle outliers prepare this data for modeling handle categorial data and those data that need to convert into numeric give me completely analysis of this data what are the top rated products ? show review trends over the time and give meaningful insights
It seems there is an issue with the CSV file, possibly due to inconsistent number of fields in some rows. Let's try to load the file with a different approach to handle this issue.
It seems the error_bad_lines argument is no longer supported in the latest version of pandas. Let's try using the on_bad_lines argument instead to skip problematic lines.
The dataset has been successfully loaded. Here is a preview of the first few rows:
IdProductIdUserIdProfileNameHelpfulnessNumeratorHelpfulnessDenominatorScoreTimeSummaryText
1B001E4KFG0A3SGXH7AUHU8GWdelmartian1151303862400Good Quality Dog FoodI have bought several of the Vitality canned d...
2B00813GRG4A1D87F6ZCVE5NKdll pa0011346976000Not as AdvertisedProduct arrived labeled as Jumbo Salted Peanut...
3B000LQOCH0ABXLMWJIXXAINNatalia Corres "Natalia Corres"1141219017600"Delight" says it allThis is a confection that has been around a fe...
4B000UA0QIQA395BORC6FGVXVKarl3321307923200Cough MedicineIf you are looking for the secret ingredient i...
5B006K2ZZ7KA1UQRSCLF8GW1TMichael D. Bigham "M. Wassir"0051350777600Great taffyGreat taffy at a great price. There was a wid...
Next, we will proceed with the following steps:
  1. Clean the text data (remove HTML tags, stop words, etc.).
  2. Create sentiment labels based on the Score.
  3. Clean the dataset from missing values.
  4. Handle outliers.
  5. Prepare the data for modeling.
  6. Handle categorical data and convert necessary columns to numeric.
  7. Analyze the data to find the top-rated products.
  8. Show review trends over time and provide meaningful insights.
The dataset has been successfully loaded. Here is a preview of the first few rows:
IdProductIdUserIdProfileNameHelpfulnessNumeratorHelpfulnessDenominatorScoreTimeSummaryText
1B001E4KFG0A3SGXH7AUHU8GWdelmartian1151303862400Good Quality Dog FoodI have bought several of the Vitality canned d...
2B00813GRG4A1D87F6ZCVE5NKdll pa0011346976000Not as AdvertisedProduct arrived labeled as Jumbo Salted Peanut...
3B000LQOCH0ABXLMWJIXXAINNatalia Corres "Natalia Corres"1141219017600"Delight" says it allThis is a confection that has been around a fe...
4B000UA0QIQA395BORC6FGVXVKarl3321307923200Cough MedicineIf you are looking for the secret ingredient i...
5B006K2ZZ7KA1UQRSCLF8GW1TMichael D. Bigham "M. Wassir"0051350777600Great taffyGreat taffy at a great price. There was a wid...
Next, we will proceed with the following steps:
  1. Clean the text data (remove HTML tags, stop words, etc.).
  2. Create sentiment labels based on the Score.
  3. Clean the dataset from missing values.
  4. Handle outliers.
  5. Prepare the data for modeling.
  6. Handle categorical data and convert necessary columns to numeric.
  7. Analyze the data to find the top-rated products.
  8. Show review trends over time and provide meaningful insights.
The text data has been successfully cleaned. Here is a preview of the original and cleaned text:
Original TextCleaned Text
I have bought several of the Vitality canned d...bought several vitality canned dog food produc...
Product arrived labeled as Jumbo Salted Peanut...product arrived labeled jumbo salted peanuts p...
This is a confection that has been around a fe...confection around centuries light pillowy citr...
If you are looking for the secret ingredient i...looking secret ingredient robitussin believe f...
Great taffy at a great price. There was a wid...great taffy great price wide assortment yummy ...
Next, we will create sentiment labels based on the Score column.
The sentiment labels have been created based on the Score column. Here is the distribution of sentiment labels:
  • Positive: 90,198 reviews
  • Negative: 17,190 reviews
  • Neutral: 9,183 reviews
Next, we will clean the dataset from missing values.
The dataset contains a few missing values in the ProfileName and Summary columns. Here are the counts of missing values:
  • ProfileName: 7 missing values
  • Summary: 5 missing values
Next, we will remove the rows with missing values.

continue to the next process

The dataset has been successfully cleaned of missing values. There are no more missing values in any of the columns.
Next, we will handle outliers in the dataset. We will focus on the HelpfulnessNumerator and HelpfulnessDenominator columns to identify and handle any outliers.
The box plots for HelpfulnessNumerator and HelpfulnessDenominator have been displayed. These plots help us identify the presence of outliers in these columns.
Next, we will handle the outliers by removing rows where the values in these columns are significantly higher than the majority of the data.
The outliers have been successfully removed from the dataset. Here are the shapes of the dataframe before and after removing outliers:
  • Original Shape: (116,559, 12)
  • Filtered Shape: (115,212, 12)
Next, we will prepare the data for modeling. This includes handling categorical data and converting necessary columns to numeric.
The Time column has been successfully converted to datetime format. Here is a preview of the converted Time column:
Time
2011-04-27
2012-09-07
2008-08-18
2011-06-13
2012-10-21
Next, we will handle categorical data and convert necessary columns to numeric.
The categorical columns have been successfully converted to numeric using one-hot encoding. The dataset is now ready for modeling.
Next, we will analyze the data to find the top-rated products and show review trends over time.

Can you visualize the review trends over time?

Get started with Vizly