In the last story, I ended without any visualization code. I wanted to do a complete story for visualization for the same “vaccination adverse reactions” database, however, in the morning I stumbled upon a new python open source library which I found riveting, so I decided to do a separate story for this.
We need to install the python library and then import it.
pip install sweetviz
While my work is mostly related to geopolitics and current affairs, I have to work on some random data to explain the importance of visualization. I found stroke prediction data on Kaggle and downloaded it.
df=pdread_csv("stroke_csv") ##renamed the csv file to stroke.
Now that we have read the CSV file, we need to write few lines for visualization.
report = sweetviz.analyze(df)
Yes, That’s it. You will have a report with data analysis, visualization of each column.
This is how the report looks like, each column is analyzed and visualized. If you click on the associations' button, you get the following table
The “Associations” button unlocks a very powerful analysis of associations and correlations.
Basically, in addition to showing the traditional numerical correlations, it unifies in a single graph both numerical correlation, the uncertainty coefficient (for categorical-categorical), and correlation ratio (for categorical-numerical).
From this correlation matrix, it is clearly evident that Age, BMI, Avg glucose level are the main dependent features that contribute to Stroke with Age being the most important one.
Let’s explore a little more, one feature I loved the most was we can study the characteristics of different subpopulations within that dataset. For example, we can divide the complete dataset with “age” above 50 and below 50. In my example, I am dividing the “stroke” column into people that got the stroke and the ones that did not.
report = sweetviz.compare_intra(df, df["stroke"] == 0, ["safe", "risky"])report.show_html()
The column “stroke” has divided the entire dataset into “risky” and “non-risky”. To explain how significant that is, just look at the visualization for “age”. How many people above “50” are in the risk group. Isn’t that cool?
Look at the “avg_glucose_level” and you would easily see how the risk factor increases with the level of glucose going up.
Data analysis and visualization is a mighty tool in our arsenal. Knowing the method to visualize makes it effortless for us to understand and solve complex real-world problems. If you have read this till the end, and now you know, you are already a data analyst.