Creating visualizations using Seaborn for univariate and bivariate analysis – Exploring Data for Machine Learning

In this section, we are going to explore each variable separately. We are going to summarize the data for each feature and analyze the pattern present in it.

Univariate analysis is an analysis using individual features. We will also perform a bivariate analysis later in this section.

Univariate analysis

Now, let us do a univariate analysis for the age, education, work class, hours per week, and occupation features.

First, let’s get the counts of unique values for each column using the following code snippet:
df.nunique()

Figure 1.14 – Unique values for each column

As shown in the results, there are 73 unique values for age, 9 unique values for workclass, 16 unique values for education, 15 unique values for occupation, and so on.

Now, let us see the unique values count for age in the DataFrame:
df[“age”].value_counts()

The result is as follows:

Figure 1.15 – Value counts for age

We can see in the results that there are 898 observations (rows) with the age of 36. Similarly, there are 6 observations with the age of 83.

Histogram of age

Histograms are used to visualize the distribution of continuous data. Continuous data is data that can take on any value within a range (e.g., age, height, weight, temperature, etc.).

Let us plot a histogram using Seaborn to see the distribution of age in the dataset:
#univariate analysis
sns.histplot(data=df[‘age’],kde=True)

We get the following results:

Figure 1.16 – The histogram of age

As we can see in the age histogram, there are many people in the age range of 23 to 45 in the given observations in the dataset.

Bar plot of education

Now, let us check the distribution of education in the given dataset:
df[‘education’].value_counts()
Let us plot the bar chart for education.
colors = [“white”,”red”, “green”, “blue”, “orange”, “yellow”, “purple”]
df.education.value_counts().plot.bar(color=colors,legend=True)

Figure 1.17 – The bar chart of education

As we see, the HS.grad count is higher than that for the Bachelors degree holders. Similarly, the Masters degree holders count is lower than the Bachelors degree holders count.

Bar chart of workclass

Now, let’s see the distribution of workclass in the dataset:
df[‘workclass’].value_counts()

Let’s plot the bar chart to visualize the distribution of different values of workclass:

Figure 1.18 – Bar chart of workclass

As shown in the workclass bar chart, there are more private employees than other kinds.

No Responses

Leave a Reply

Your email address will not be published. Required fields are marked *



Terms of Use | About yeagerback | Privacy Policy | Cookies | Accessibility Help | Contact yeagerback