Vizly: AI-powered data analysis

Try for yourself

Comparison of Discount Rates Across Categories and Sub-Categories for High-Rated Products

bigbasket.csv

Missing file

provide R code. Compare the discount rates across different categories and sub-categories within categories for the high-rated

product dataset that you have created above. Provide summary measures and appropriate visualizations.

Summary Measures for Discount Rates Across Categories and Sub-Categories

The table below shows the summary measures (mean, median, and standard deviation) for discount rates across different categories and sub-categories for high-rated products (rating >= 4):
CategorySub-CategoryMean Discount (%)Median Discount (%)SD Discount (%)
Baby CareBaby Accessories0.000.000.00
Baby CareBaby Bath & Hygiene3.990.008.43
Baby CareBaby Food & Formula0.000.000.00
Baby CareDiapers & Wipes12.908.5012.70
Baby CareFeeding & Nursing0.000.000.00
Baby CareMothers & Maternity1.570.005.88
Bakery, Cakes & DairyBakery Snacks0.000.000.00
Bakery, Cakes & DairyCakes & Pastries0.950.003.93
Bakery, Cakes & DairyCookies, Rusk & Khari6.010.008.68
Bakery, Cakes & DairyDairy9.058.339.40
...............

Visualization

The boxplot below visualizes the discount rates across different categories and sub-categories for high-rated products:
Discount Rates Across Categories and Sub-Categories

Choose any three categories. Perform a comparative analysis for the sale price and rating across all sub-

categories within each category for all the products listed, irrespective of their ratings.

Summary Measures for Sale Price and Rating Across Sub-Categories within Selected Categories

The table below shows the summary measures (mean, median, and standard deviation) for sale price and rating across different sub-categories within the selected categories (Baby Care, Bakery, Cakes & Dairy, Beauty & Hygiene):
CategorySub-CategoryMean Sale PriceMedian Sale PriceSD Sale PriceMean RatingMedian RatingSD Rating
Baby CareAtta, Flour & Sooji300.00300.00NANANANA
Baby CareBaby Accessories410.00249.00455.004.004.000.00
Baby CareBaby Bath & Hygiene419.00264.00464.004.004.000.00
Baby CareBaby Food & Formula361.00300.00220.004.004.000.00
Baby CareDiapers & Wipes742.00510.00626.004.004.000.00
Baby CareFeeding & Nursing269.00192.00192.004.004.000.00
Baby CareMothers & Maternity475.00313.00428.004.004.000.00
Bakery, Cakes & DairyBakery Snacks104.00109.0034.404.004.000.00
Bakery, Cakes & DairyBreads & Buns76.0045.0090.204.004.000.00
Bakery, Cakes & DairyCakes & Pastries146.00139.00130.004.004.000.00
........................

Visualizations

The boxplots below visualize the sale price and rating across different sub-categories within the selected categories:

Sale Price Across Sub-Categories within Selected Categories

Sale Price Across Sub-Categories

Rating Across Sub-Categories within Selected Categories

Rating Across Sub-Categories

what inference can be drawn from the above analysis

Inferences from the Comparative Analysis

Based on the summary measures and visualizations for sale price and rating across sub-categories within the selected categories (Baby Care, Bakery, Cakes & Dairy, Beauty & Hygiene), we can draw the following inferences:
  1. Sale Price Variability:
    • Baby Care: The sale prices within the Baby Care category show significant variability across sub-categories. For example, 'Diapers & Wipes' have a higher mean sale price compared to other sub-categories like 'Feeding & Nursing'. This indicates that products like diapers are generally more expensive.
    • Bakery, Cakes & Dairy: The sale prices in this category also vary, but the range is generally lower compared to Baby Care. Sub-categories like 'Cakes & Pastries' have higher sale prices compared to 'Breads & Buns'.
    • Beauty & Hygiene: (Assuming similar trends based on the available data) This category is likely to show variability in sale prices based on the type of product, such as hair care products versus skincare products.
  2. Rating Consistency:
    • Across all three categories, the ratings appear to be quite consistent, with most sub-categories having a mean rating of around 4. This suggests that customers generally rate products in these categories favorably.
    • The standard deviation for ratings is very low, indicating that there is little variation in customer satisfaction within each sub-category.
  3. Price and Rating Relationship:
    • There doesn't seem to be a direct correlation between the sale price and the rating of the products. High-priced items do not necessarily have higher ratings and vice versa. This suggests that factors other than price, such as product quality and brand reputation, might be influencing customer ratings.
  4. Sub-Category Insights:
    • Baby Care: Sub-categories like 'Diapers & Wipes' have higher sale prices, which could be due to the essential nature and higher cost of production of these items.
    • Bakery, Cakes & Dairy: Sub-categories like 'Cakes & Pastries' have higher sale prices, possibly due to the premium nature of these products compared to everyday items like 'Breads & Buns'.

Conclusion

The analysis provides valuable insights into the pricing and customer satisfaction trends across different sub-categories within the selected categories. Businesses can use this information to make informed decisions about pricing strategies, inventory management, and marketing efforts to enhance customer satisfaction and optimize sales.
Do you have any other questions or need further analysis?

multi_stocks.csv

Missing file

(a) Carry out the regression using data for the year 2021 as the training data and find the β for both the stocks

individually. What is the regression equation in each case? Show the fitted lines along with confidence
bands.
(b) Do the values of the slopes indicate that the stocks have above-average risk? (For the purposes of this case,
assume that the risk is average if the slope is in the range 1 ± 0.1, below average if it is less than 0.9, and
above average if it is more than 1.1.)
(c) Give a 95% confidence interval for the slopes. Can we say the risks are above average with 95% confidence?
(d) Report the predicted returns for both the stocks using the 2022 data, along with visualizations of confidence
and prediction intervals of fit. Report the MSEs in both the cases.
(e) Construct residual plots. Do the residuals appear random?
(f) Construct normal probability plots. Do the residuals appear to be normally distributed?

Regression Analysis for FB and AMZN Returns

Regression Equations

  • FB Returns: FB_returns=−0.0542+1.2936×SP500_returns\text{FB\_returns} = -0.0542 + 1.2936 \times \text{SP500\_returns}
  • AMZN Returns: AMZN_returns=−0.0954+1.0463×SP500_returns\text{AMZN\_returns} = -0.0954 + 1.0463 \times \text{SP500\_returns}

Interpretation of Slopes

  • FB Returns: The slope (β) is 1.2936, which is greater than 1.1, indicating that FB has an above-average risk.
  • AMZN Returns: The slope (β) is 1.0463, which falls within the range of 1 ± 0.1, indicating that AMZN has an average risk.

Visualizations

The plots below show the fitted regression lines along with 95% confidence bands for FB and AMZN returns against SP500 returns:
  • FB Returns vs SP500 Returns
FB Returns vs SP500 Returns
  • AMZN Returns vs SP500 Returns
AMZN Returns vs SP500 Returns

95% Confidence Intervals for the Slopes

  • FB Returns: The 95% confidence interval for the slope is (1.0614, 1.5257).
  • AMZN Returns: The 95% confidence interval for the slope is (0.8553, 1.2373).

Interpretation

  • FB Returns: The confidence interval for the slope does not entirely lie above 1.1, so we cannot say with 95% confidence that FB has above-average risk.
  • AMZN Returns: The confidence interval for the slope includes values less than 1.1, so we cannot say with 95% confidence that AMZN has above-average risk.
Next, we will predict the returns for both stocks using the 2022 data and visualize the confidence and prediction intervals of the fit. We will also report the Mean Squared Errors (MSEs) for both cases.

Predicted Returns for 2022 and Mean Squared Errors (MSEs)

Mean Squared Errors (MSEs)

  • MSE for FB: 0.8645
  • MSE for AMZN: 0.7017

Visualizations

The plots below show the predicted returns for FB and AMZN along with confidence and prediction intervals for the year 2022:
  • FB Returns Prediction with Confidence and Prediction Intervals
FB Returns Prediction
  • AMZN Returns Prediction with Confidence and Prediction Intervals
AMZN Returns Prediction

Residual Plots

The residual plots for FB and AMZN returns against SP500 returns are shown below:
  • Residual Plot for FB Returns
Residual Plot for FB Returns
  • Residual Plot for AMZN Returns
Residual Plot for AMZN Returns

Interpretation

  • The residuals for both FB and AMZN returns appear to be randomly scattered around the horizontal line at zero, indicating that the linear regression models are appropriate for the data.
Next, we will construct normal probability plots to check if the residuals appear to be normally distributed.

Normal Probability Plots

The normal probability plots for the residuals of FB and AMZN returns are shown below:
  • Normal Probability Plot for FB Returns Residuals
Normal Probability Plot for FB Returns Residuals
  • Normal Probability Plot for AMZN Returns Residuals
Normal Probability Plot for AMZN Returns Residuals

Interpretation

  • The residuals for both FB and AMZN returns appear to follow a straight line in the normal probability plots, indicating that the residuals are approximately normally distributed.

Conclusion

Based on the analysis, we can conclude the following:
  1. Regression Equations:
    • FB Returns: FB_returns=−0.0542+1.2936×SP500_returns\text{FB\_returns} = -0.0542 + 1.2936 \times \text{SP500\_returns}
    • AMZN Returns: AMZN_returns=−0.0954+1.0463×SP500_returns\text{AMZN\_returns} = -0.0954 + 1.0463 \times \text{SP500\_returns}
  2. Risk Assessment:
    • FB has an above-average risk based on the slope of the regression line.
    • AMZN has an average risk based on the slope of the regression line.
    • However, with 95% confidence, we cannot conclusively say that either stock has above-average risk.
  3. Predicted Returns for 2022:
    • The MSE for FB is 0.8645.
    • The MSE for AMZN is 0.7017.
  4. Residual Analysis:
    • The residuals for both FB and AMZN returns appear to be randomly scattered and approximately normally distributed.
Do you have any other questions or need further analysis?

from the AmesHousing package in R do the following:

(a) There are 81 predictor variables and one response variable Sale Price. You shall
be working with only one predictor variable: Gr Liv Area. You may find the details
of all these variables using ?ames raw. Create a subset of the original dataset with
these two variables only. Name it new.house.dat.
(b) Split the new data into training and test sets using a 70−30 split. Fit a simple linear
regression model for the training data with Sale Price as the response variable and
the other variable as covariate.
(c) Comment on the model fit based on R2. Perform a full residual analysis checking
potential violations of basic model assumptions including normality and homoscedasticity
assumption for the error term. Do you see any unusual observations in your
dataset, like potential outliers, leverage points or influential points? Comment based
on relevant measures and visualization tools.
(d) Report the training and test MSEs.

Model Fit and R-squared

The simple linear regression model summary is as follows:
  • Intercept: 14045.872
  • Slope (Gr Liv Area): 110.726
  • R-squared: 0.4879
The R-squared value of 0.4879 indicates that approximately 48.79% of the variability in the Sale Price can be explained by the Gr Liv Area.
Next, we will perform a full residual analysis to check for potential violations of basic model assumptions, including normality and homoscedasticity of the error term. We will also identify any unusual observations such as potential outliers, leverage points, or influential points.

Residual Analysis

Residuals vs Fitted Values

The plot of residuals vs fitted values shows that the residuals are randomly scattered around the horizontal line at zero, indicating that the linearity assumption is reasonable. However, there might be some heteroscedasticity as the spread of residuals seems to increase with fitted values.

Normal Probability Plot of Residuals

The normal probability plot shows that the residuals approximately follow a straight line, indicating that the residuals are approximately normally distributed.

Residuals vs Gr Liv Area

The plot of residuals vs Gr Liv Area shows that the residuals are randomly scattered around the horizontal line at zero, indicating that the linearity assumption is reasonable. However, there might be some heteroscedasticity as the spread of residuals seems to increase with Gr Liv Area.

Leverage vs Residuals

The plot of leverage vs residuals shows that most observations have low leverage values, indicating that there are no significant leverage points.

Cook's Distance

The plot of Cook's distance shows that most observations have low Cook's distance values, indicating that there are no significant influential points.
Next, we will calculate the training and test Mean Squared Errors (MSEs).

Summary of Linear Regression Analysis on Ames Housing Data

(a) Subset Creation

  • Created a subset of the Ames housing dataset with only 'Gr Liv Area' and 'Sale Price'.

(b) Data Splitting and Model Fitting

  • Split the data into training (70%) and test (30%) sets.
  • Fitted a simple linear regression model with 'Sale Price' as the response variable and 'Gr Liv Area' as the predictor variable.

(c) Model Fit and Residual Analysis

  • R-squared: 0.4879, indicating that approximately 48.79% of the variability in 'Sale Price' can be explained by 'Gr Liv Area'.
  • Residual analysis showed that the residuals are approximately normally distributed and randomly scattered, indicating reasonable linearity. However, there might be some heteroscedasticity.
  • No significant leverage points or influential points were identified based on leverage and Cook's distance plots.

(d) Mean Squared Errors (MSEs)

  • Training MSE: 3294900784.34
  • Test MSE: 2956979064.48
Do you have any other questions or need further analysis?

world_data.csv

Missing file

1. The dataset world data.csv presents information on several health and income variables

for 184 countries from 1960 to 2016. Details of the variables are provided in the Appendix.
(a) For each of the six pairs of countries below, which country had the highest child
mortality rates in 2015? Which pairs do you think are most similar?
• Sri Lanka or Turkey
• Poland or South Korea
• Malaysia or Russia
• Pakistan or Vietnam
• Thailand or South Africa
(b) There is a preconceived notion that the world is divided into two groups: the western
world (Western Europe and North America), characterized by long life spans
and small families, versus the developing world (Africa, Asia, and Latin America)
characterized by short life spans and large families.
• Draw a scatterplot colored by continent to see whether it was really like the
notion mentioned above in the year 1975. Report your views. Has the scenario
changed in 2015? Explain.
• Compare the fertility rates in India and Japan over the years.

Child Mortality Rates in 2015

For each of the six pairs of countries, the country with the highest child mortality rates in 2015 is as follows:
  • Sri Lanka (8.4) or Turkey (11.6): Turkey had the highest child mortality rate.
  • Poland (4.5) or South Korea (2.9): Poland had the highest child mortality rate.
  • Malaysia (6.0) or Russia (8.2): Russia had the highest child mortality rate.
  • Pakistan (65.8) or Vietnam (17.3): Pakistan had the highest child mortality rate.
  • Thailand (10.5) or South Africa (33.6): South Africa had the highest child mortality rate.

Similarity of Pairs

Based on the child mortality rates, the pairs that are most similar are:
  • Poland and South Korea: Both countries have relatively low child mortality rates.
  • Malaysia and Russia: Both countries have moderate child mortality rates.
Next, we will draw scatterplots to analyze the preconceived notion about the western world and the developing world in the years 1975 and 2015.

Scatterplots of Fertility vs Life Expectancy

1975

The scatterplot for 1975 shows a clear distinction between the western world (Western Europe and North America) and the developing world (Africa, Asia, and Latin America). The western world is characterized by long life spans and small families, while the developing world is characterized by short life spans and large families.

2015

The scatterplot for 2015 shows a significant change in the scenario. The distinction between the western world and the developing world is not as clear as it was in 1975. Many countries in the developing world have improved their life expectancy and reduced their fertility rates, moving closer to the western world in terms of these metrics.
Next, we will compare the fertility rates in India and Japan over the years.

Fertility Rates in India and Japan Over the Years

The plot shows the fertility rates in India and Japan over the years:
  • India: The fertility rate in India has been steadily decreasing over the years. In the 1960s, the fertility rate was quite high, but it has significantly reduced over the decades.
  • Japan: The fertility rate in Japan has also been decreasing over the years. However, Japan started with a lower fertility rate compared to India and has reached a very low fertility rate in recent years.

Summary

  1. Child Mortality Rates in 2015:
    • Turkey had the highest child mortality rate compared to Sri Lanka.
    • Poland had the highest child mortality rate compared to South Korea.
    • Russia had the highest child mortality rate compared to Malaysia.
    • Pakistan had the highest child mortality rate compared to Vietnam.
    • South Africa had the highest child mortality rate compared to Thailand.
  2. Similarity of Pairs:
    • Poland and South Korea are most similar with relatively low child mortality rates.
    • Malaysia and Russia are most similar with moderate child mortality rates.
  3. Scatterplots of Fertility vs Life Expectancy:
    • In 1975, there was a clear distinction between the western world and the developing world.
    • In 2015, the distinction is not as clear, with many developing countries improving their life expectancy and reducing fertility rates.
  4. Fertility Rates in India and Japan:
    • Both countries have shown a decreasing trend in fertility rates over the years.
    • Japan started with a lower fertility rate and has reached a very low fertility rate in recent years.
Do you have any other questions or need further analysis?

(c) The GDP per person is often used as a rough summary of a country’s wealth. An

interpretable measure is dollars per day, obtained by dividing the GDP per person
by 365. Using current US dollars as a unit, a person surviving on an income of less
than $2 a day is defined to be living in ‘absolute poverty’. Add this variable named
dollars per day to the data table. Note: The GDP values are adjusted for inflation
and represent current US dollars, so these values are meant to be comparable across
the years.
(d) Present appropriate visualizations of dollars per day for the years 1975 and 2015.
Do you see any improvement over these years?
(e) Draw a boxplot to compare the dollars per day across continents for the year 2010.
(f) Explore how income of a country affects infant survival. Take the year 2015 for your
analysis.

Visualizations of Dollars per Day

1975

The histogram for 1975 shows that the majority of countries had a low GDP per person, with many countries having less than $2 per day, indicating a high level of absolute poverty.

2015

The histogram for 2015 shows a significant improvement, with many countries having a higher GDP per person. The number of countries with less than $2 per day has decreased, indicating a reduction in absolute poverty.
Next, we will draw a boxplot to compare the dollars per day across continents for the year 2010.

Boxplot of Dollars per Day Across Continents (2010)

The boxplot shows the distribution of dollars per day across different continents for the year 2010. It is evident that there are significant differences in GDP per person across continents. North America and Europe have higher median values compared to Africa, Asia, and Latin America.
Next, we will explore how the income of a country affects infant survival, taking the year 2015 for our analysis.

2. The Pareto distribution, named after the Italian civil engineer, economist, and sociologist

Vilfredo Pareto, is a power-law probability distribution that is used in description of
social, scientific, geophysical, actuarial, and many other types of observable phenomena.
It was originally applied to describing the distribution of wealth in a society, fitting the
trend that a large portion of wealth is held by a small fraction of the population. The
Pareto(a, b) distribution has cdf P(X ≤ x) given by
F(x) = 1 −

b
x
a
, x ≥ b > 0, a > 0,
and pdf given by:
f(x) =
aba
xa+1 , x ≥ b > 0, a > 0.
(a) Write a function with x, a, b as the inputs and the cdf F(x) as the output. Using
this function, create a vector of F(x) values for a sequence of x values in the interval
[2, 8] with jumps of 0.01, corresponding to a = 2, b = 2.
(b) The inverse CDF transformation F−1 is such that F(x) = u; here, it is given by:
x = b(1 − u)−1/a.
A method to generate a random sample from the Pareto(a, b) dist

Pareto Distribution CDF Function

The CDF function for the Pareto distribution has been defined and used to calculate the CDF values for a sequence of x values in the interval [2, 8] with jumps of 0.01, corresponding to parameters a=2a = 2 and b=2b = 2.
Next, we will use the inverse CDF transformation to generate a random sample from the Pareto(2, 2) distribution.

Then, we will have, X ∼ Pareto(a, b). Using these steps, generate a random sample

of size 100 from the Pareto(2,2) distribution. Show the histogram of the random
sample you generated.

Vizly

AI-powered data analysis

Try for yourself