Data Visualization With Sweetviz: Part 2

In the last Blog, we studied data visualization for the Stroke data set. I decided to write another blog on Sweetviz to make it straightforward for the readers.

I decided to do a similar article for the Happiness score dataset. Download the dataset from Kaggle.

When we unzip the file, we get 5 different CSV files for years from 2015–2019. For simplicity, I will only analyze 2019 data.

import pyforest
import sweetviz

The output shows different features, while for classification overall rank and score are important, we will try to figure which factors are most important which impact happiness score.

report = sweetviz.analyze(df)
report = sweetviz.compare_intra(df, df["Score"] >= 5, ["happy_countries", "sad_countries"])

Considering “Score” as a classification feature, I have divided the dataset into “Happy countries” and “Sad Countries”. Countries with a score greater than equal to 5 are considered happy countries, while others are labelled as sad countries.

Thank god the number of happy countries is greater than the sad ones. Sigh!

Consider features “Country of origin” is completely redundant because the name does not impact happiness in any manner. We can ignore the score too because that is our classification feature.

Interestingly top 3 features clearly convey that happiness is directly proportional to the feature. How do we know that? Look at those blue sticks, don’t they increase in number when the value of the feature increases? Take the example of “Social support”, most of the blue sticks come after the value of social support becomes 1.30. However, Generosity has no impact on happiness score.

Which feature according to you impacts happiness the most?

It’s Health life expectancy according to me, longer the live expectancy happier the country.

I thought I should end, but let’s also do a comparison between 2 different data sets(2 different CSV files with the same features) before ending.

import sweetviz

We are comparing happiness data for the years 2018 and 2019.

my_report =[df, "2019"], [df2, "2018"])

I’m not going to paste the visualization screenshot here, do that by yourself and figure what changed between the years 2018 and 2019.

Geopolitics and Data Science enthusiast.