Prerequisites

This tutorial assumes you have basic knowledge and understanding of;

  1. Python programming
  2. Jupyter notebooks
  3. Pandas

Matplotlib

Matplotlib is a library that aids in the visualization of data in machine learning.

We can plot histograms, line graphs, pie charts, scatter plots etc.

Let's dive right into using matplotlib.

Importing the libraries

We'll start by importing the libraries which will include pandas, numpy and matplotlib

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Simple Plotting

Let's start off with some basic examples.

Line graph

Let's say that we want to plot a curve of x-squared.

We can start off by creating a list of the values of x.

In [2]:
x = [i for i in range(-5, 6)]
y = [(j ** 2) for j in x]

We can then plot a graph of y against x.

In [3]:
plt.title('A graph of x-squared') # This gives a title to our graph
plt.xlabel('x-values') # This gives a label to the horizontal axis of our graph
plt.ylabel('y-vales') # This gives a label to the verticle axis of our graph
plt.plot(x, y)
Out[3]:
[<matplotlib.lines.Line2D at 0x1f893200b48>]

Stack Plot

Let's say we want to find an insight on how a student spends their day.

We can begin by collecting some data, say for a week. We record the hours the student spends sleeping, eating, working and playing.

We can create a dataframe from such data.

In [4]:
df = pd.DataFrame({
    'sleeping': [7,8,5,9,7,10,9],
    'eating': [3,3,4,3,2,1,1],
    'working': [10,8,7,10,9,2,2],
    'playing': [4,1,2,3,4,6,6]
}, index = [1,2,3,4,5,6,7]
)

# We can the give our index a name

df.index.name = 'days'
In [5]:
df
Out[5]:
sleeping eating working playing
days
1 7 3 10 4
2 8 3 8 1
3 5 4 7 2
4 9 3 10 3
5 7 2 9 4
6 10 1 2 6
7 9 1 2 6

Now it is hard to interprete such raw data. We can, however, plot a stack plot to visualize this data.

To do this, we can use the stackplot(x, 'all y elements')

In [6]:
plt.figure(figsize=(10,5))   # This creates the figure on which the stackplot is drawn

plt.stackplot(
    df.index, df['sleeping'],
    df['eating'], df['working'], df['playing'],
    labels=['Sleeping', 'Eating', 'Working', 'Playing']    # The labels help us show the legend on the stack plot
)

plt.title('How the student spends each day')
plt.ylabel('Hours')
plt.xlabel('Days')
plt.legend(loc='upper right')  # This shows the legend on the top-right corner of the graph.
Out[6]:
<matplotlib.legend.Legend at 0x1f893335988>

Pie Chart

We can go ahead an visualize how the student spends each day using a pie chart.

Here we create slices from a given day. These slices can be seen as the different slices a pie chart can have.

So we shall select data for a given day and use the pie() function.

In [7]:
plt.figure(figsize=(5,5))

plt.pie(
    x=df.loc[3],   # This sets the x values to the values of day 3

    labels=df.columns,  # This sets the column headers as the labels of the pie chart

    startangle=90,   # This sets the pie chart to start at angle 90. But it is optional.

    shadow=True,   # This creates a shadow around the slices of the pie chart. But it is optional.

    explode=(0.1,0,0.1,0),  # This splits all the slices to become individual. But it is optional.

    autopct='%1.1f%%'    # This adds the percentages in the slices.
)

plt.title('How the student spends a day')
Out[7]:
Text(0.5, 1.0, 'How the student spends a day')

Scatter Plot

A scatter plot can be useful in giving the correlation between different values. For the scatter plot, let us import a dataset with the salaries of a given company for employees at different levels.

Importing the dataset to work on

You can download the dataset we are going to use. Download Dataset

In [8]:
df = pd.read_csv(
    'Position_Salaries.csv'
)

Taking a glance at how our data looks.

In [9]:
# This time, let us look at the first 3 rows

df.head(3)
Out[9]:
Position Level Salary
0 Business Analyst 1 45000
1 Junior Consultant 2 50000
2 Senior Consultant 3 60000

We can get very interesting insight if we visualized this dataset.

Let's find out how the salaries relate to the position levels.

This can be achieved by plotting the Level against the salary.

We use the scatter(x, y) function and provide the relevant x and y values. Note that we are using our matplotlib.pyplot which we imported as plt.

Plotting the scatter plot

In [10]:
plt.figure(figsize=(10, 5))

plt.scatter(df['Level'], df['Salary'])
Out[10]:
<matplotlib.collections.PathCollection at 0x1f893467348>

This is a very intersting insight. We can easily see that the salary increases with increase in the position level in that given company.

Bar Graph

We can also do bar graphs.

In [11]:
plt.figure(figsize=(10, 5))

# Let's create the first set of bars
x = [2,4,6,8,10]
y = [2,3,1,4,5]

# Let's create the second set of bars
x2 = [1,3,5,9,7]
y2 = [7,8,2,5,2]

plt.bar(x,y,label='Bars1')   # This plots the first bargraph
plt.bar(x2,y2,label='Bars2') # This plots the second bargraph

plt.xlabel('x-values')
plt.ylabel('y-values')
plt.title('Interesting bar graph\nCheck it out')
plt.legend()
Out[11]:
<matplotlib.legend.Legend at 0x1f8934c8348>

Conclusion

Congratulations upon completing the Matplotlib tutorial.

Keep in ming that visualizing data is very key in machine learning. This is because visualization gives you a hint on what the best model for the machine learning problem could be.

Enjoy Machine Learning!