Prerequisites¶

This tutorial assumes you have basic knowledge and understanding of;

Python programming
Jupyter notebooks

Pandas¶

Pandas generally has two structures:

Series
DataFrames

Pandas Series¶

A series is an ordered sequence of elements. It looks a lot like a list but the two have a lot of differences. Series can have a name, can have strings as the index. Let's look at one example where we can use Pandas Series.

Importing the pandas library¶

In order to use Pandas, we will need to import it using the code below.

import pandas as pd

Example 1¶

Let's create a series of the Population of East Africa.

east_africa = pd.Series([10, 20.3, 50, 125, 60, 7.5])

We can view the series by just typing its name.

east_africa

0     10.0
1     20.3
2     50.0
3    125.0
4     60.0
5      7.5
dtype: float64

We can see that it is similar to a list in python.

And we can index it just the same way we index list.

For example:

# this returns the value at index 2

east_africa[2]

50.0

However unlike lists, we can;

name the series
give our index different names.

Changing the index can be important in that we can reference the values not by sequence, but by name.

east_africa.name = "Population of East Africa"

east_africa.index = [
    'Uganda',
    'Kenya',
    'Tanzania',
    'Rwanda',
    'S. Sudan',
    'Burundi'
]

east_africa

Uganda       10.0
Kenya        20.3
Tanzania     50.0
Rwanda      125.0
S. Sudan     60.0
Burundi       7.5
Name: Population of East Africa, dtype: float64

And we see that the series index changed to our countries. The name of the series can also be seen at the bottom.

Indexing Series¶

We can index series in a variety of ways;

Sequential Position
Label (index name)
Boolean indexing
Using loc and iloc

Sequential Position¶

east_africa[2]

50.0

Label (index name)¶

We could achieve a similar result by simply using

east_africa['Tanzania']

50.0

Boolean¶

There is a new way in which we can index our series by simply passing a logical expression as our index.

east_africa[east_africa > 25]

Tanzania     50.0
Rwanda      125.0
S. Sudan     60.0
Name: Population of East Africa, dtype: float64

This returns the same series but without all countries whose population exceeds 25

.loc and .iloc¶

.loc uses the label of each row.

.iloc uses the numerical position of the rows.

These can be recalled easily as:

loc - locate
iloc - index locate

# Returns the value of Uganda 

east_africa.loc['Uganda']

10.0

# Returns the value of the last country

east_africa.iloc[-1]

7.5

Slicing Series¶

We can also slice the pandas series just as we slice lists.

# Returns data for all countries starting with Kenya

east_africa['Kenya': ]

Kenya        20.3
Tanzania     50.0
Rwanda      125.0
S. Sudan     60.0
Burundi       7.5
Name: Population of East Africa, dtype: float64

However, unlike in lists, slicing in a pandas series includes the value of the last item.

# Includes the value for S. Sudan

east_africa['Kenya': 'S. Sudan']

Kenya        20.3
Tanzania     50.0
Rwanda      125.0
S. Sudan     60.0
Name: Population of East Africa, dtype: float64

Other functions that can be applied to a pandas series¶

describe() : Returns different attributes of the series
mean() : returns the mean of the series
std() : returns the standard deviation of the series
min() : returns the minimun value in the series

You can be certain that you will meet very many different functions and most probably the function you want to perform already exists. All you need to do is to make use of Google.com

east_africa.describe()

count      6.000000
mean      45.466667
std       44.474787
min        7.500000
25%       12.575000
50%       35.150000
75%       57.500000
max      125.000000
Name: Population of East Africa, dtype: float64

east_africa.mean()

45.46666666666667

east_africa.min()

7.5

Pandas Data frames¶

A DataFrame is the most important structure of pandas. It is a tabular structure integrated with with Series.

Let us create our first dataframe called df. And it will contain the data about our East African countries.

df = pd.DataFrame({
    'Population': [10, 20.3, 50, 125, 60, 7.5],
    'GDP': [210, 820.3, 750, 2125, 860, 87.5],
    'Surface Area': [140, 206.33, 502, 1257, 680, 67.5],
    'HDI': [140, 220.3, 570, 1225, 670, 72.5],
    'Continent': ['Africa', 'Europe', 'Asia', 'North America', 'S. America', 'Australia']
}, columns=['Population', 'GDP', 'Continent', 'Surface Area', 'HDI']
)

# columns indicate the order in which the columns should appear but it is optional to include it

df

But we realize that the best index of the dataframe would be the countries to which the data belongs.

We can easily set a new index by recalling our knowledge of pandas Series.

df.index = [
    'Uganda',
    'Kenya',
    'Tanzaia',
    'Rwanda',
    'S. Sudan',
    'Burundi'
]

# We can even give a name to our index column for easier future reference

df.index.name = 'Countries'

df

We can check out a summary of the dataframe using the describe() method.

df.describe()

Indexing¶

We can index a given column by passing in the column header.

df['Continent']

Countries
Uganda             Africa
Kenya              Europe
Tanzaia              Asia
Rwanda      North America
S. Sudan       S. America
Burundi         Australia
Name: Continent, dtype: object

We can index a given row using;

.loc and passing in the index name of the row
.iloc and passing in the sequential position of the given row.

df.loc['Rwanda']

Population                125
GDP                      2125
Continent       North America
Surface Area             1257
HDI                      1225
Name: Rwanda, dtype: object

df.iloc[2:]

df.loc['Kenya': 'S. Sudan']

df.loc['Kenya', ['GDP', 'Population']]

GDP           820.3
Population     20.3
Name: Kenya, dtype: object

info() can even help us inspect the data for missing values.

In our examples, all columns have no null values i.e. all values are non-null.

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6 entries, Uganda to Burundi
Data columns (total 5 columns):
Population      6 non-null float64
GDP             6 non-null float64
Continent       6 non-null object
Surface Area    6 non-null float64
HDI             6 non-null float64
dtypes: float64(4), object(1)
memory usage: 448.0+ bytes

Broadcasting¶

Broadcasting enables us to apply vectors to the dataframe.

Taking for example our dataframe.

df

Let's create a series named crisis. It is going to have only one column and an index column.

crisis = pd.Series ({
    'GDP': 20,
    'Population': 10
})

crisis

GDP           20
Population    10
dtype: int64

We can use this series to affect give or all data in the dataframe.

For example we can add crisis to the respective columns with similar indices as the crisis series.

df[['GDP', 'Population']] + crisis

Dropping Stuff¶

We can drop any given columns that we don't want using the drop() function.

df.drop('Kenya')

Note¶

All the above operations on the dataframe and series don't affect the original series and dataframes.

For example, if we wanted to permanently drop a column, we need to add the attribtue inplace=True

df.drop(columns='Population', inplace=True)

df

df.drop('Kenya', inplace=True)

df

Reading external data¶

So far, we've looked at how pandas can be used on data we have generated. This will rarely be the case.

Most of the times we need to work on external data which may be in form of .csv, .xls, .sql.

That is where pandas comes in even really handy. It enables us to easily read data of different formats.

First we need to import the numpy library since it works hand in hand with the pandas library.

import numpy as np

To read an external file in .csv format, we use _.readcsv() function.

You can download the dataset used here. Download Dataset

df = pd.read_csv(
    'Social_Network_Ads.csv',
    index_col = 'User_ID'
)

df.head()

Inspecting for missing data¶

We can do this two way as we have seen before;

using isnull() function
using info() function

df.isnull().sum()

Gender             0
Age                0
EstimatedSalary    0
Purchased          0
dtype: int64

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 400 entries, 15624510 to 15594041
Data columns (total 4 columns):
Gender             400 non-null object
Age                400 non-null int64
EstimatedSalary    400 non-null int64
Purchased          400 non-null int64
dtypes: int64(3), object(1)
memory usage: 15.6+ KB

We can see that there are no null values in our dataset since all columns have 400 non-null values.

Handling with missing data¶

You can download the dataset used here. Download Dataset

df = pd.read_csv(
    'Social_Network_Ads-Copy1.csv',
    index_col = 0
)

Checking our data, we can see that there are indeed missing values.

df

Inspecting our data even more for missing values¶

We can use our isnull() function.

Since isnull() returns a 1 for any null value, we can sum them up to get the total number of null values in each column of the dataframe.

# This returns the number of missing values in each column

df.isnull().sum()

Gender             0
Age                2
EstimatedSalary    1
Purchased          3
dtype: int64

Dropping null values¶

You can drop all null values by simply using the dropna() function.

This function is harsh in that it drops all rows with any NaN value.

df.dropna()

We can however decide whether to drop a row or not by specifying the number of NaN values that qualify a row to be dropped.

# This drops rows with three or more valid values

df.dropna(thresh=3)

df.dropna(thresh=3).shape

(5, 4)

We can see that our dataframe has less rows from 7 to 5.

Remember¶

The dropna() doesn't permanetly change the original dataframe. And we can check and see that our dataframe hasn't really changeed.

df.shape

(7, 4)

Replacing null values with the mean¶

This can be achieved by assigning values to all the null values using the following command.

df.fillna(df.mean())

Filling null values with nearby values¶

Null values can also be replaced by;

values before the NaN value and this is called forward fill
values after the NaN value and this is called backward fill.

Care should be taken when using the forward and backward fill. The values above or below should not be NaN values.

# Forward fill

df.fillna(method='ffill')

# Backward fill

df.fillna(method='bfill')

We can change the axis in which we fill our NaN values. But it can be seen that in this particular example it doesn't make logical sense. But you get the idea.

df.fillna(method='ffill', axis=1)

Conclusion¶

Congratulations for completing this introductory tutorial on the use of Pandas.

You can be confident to deal with data during your pursuit for machine learning.

However, there are plenty of operations that we have not explored. This was just the tip of the iceberg in Pandas but it is fundamentally enough for machine learning. More will be learnt as you move on.

	GDP	Population
Countries
Uganda	230.0	20.0
Kenya	840.3	30.3
Tanzaia	770.0	60.0
Rwanda	2145.0	135.0
S. Sudan	880.0	70.0
Burundi	107.5	17.5

	Gender	Age	EstimatedSalary	Purchased
User_ID
15624510	Male	19	19000	0
15810944	Male	35	20000	0
15668575	Female	26	43000	0
15603246	Female	27	57000	0
15804002	Male	19	76000	0

	Gender	Age	EstimatedSalary	Purchased
User_ID
15624510	Female	19.0	19000.0	0.0
15810944	Male	35.0	20000.0	0.0
15668575	Female	NaN	43000.0	NaN
15603246	Male	NaN	NaN	NaN
15804002	Male	19.0	76000.0	0.0
15728773	Male	27.0	58000.0	0.0
15598044	Female	27.0	84000.0	NaN

	Gender	Age	EstimatedSalary	Purchased
User_ID
15624510	Female	19.0	19000.0	0.0
15810944	Male	35.0	20000.0	0.0
15804002	Male	19.0	76000.0	0.0
15728773	Male	27.0	58000.0	0.0

	Gender	Age	EstimatedSalary	Purchased
User_ID
15624510	Female	19.0	19000.0	0.0
15810944	Male	35.0	20000.0	0.0
15804002	Male	19.0	76000.0	0.0
15728773	Male	27.0	58000.0	0.0
15598044	Female	27.0	84000.0	NaN

	Population	GDP	Continent	Surface Area	HDI
0	10.0	210.0	Africa	140.00	140.0
1	20.3	820.3	Europe	206.33	220.3
2	50.0	750.0	Asia	502.00	570.0
3	125.0	2125.0	North America	1257.00	1225.0
4	60.0	860.0	S. America	680.00	670.0
5	7.5	87.5	Australia	67.50	72.5

	Population	GDP	Surface Area	HDI
count	6.000000	6.000000	6.000000	6.000000
mean	45.466667	808.800000	475.471667	482.966667
std	44.474787	723.523144	448.272570	435.493406
min	7.500000	87.500000	67.500000	72.500000
25%	12.575000	345.000000	156.582500	160.075000
50%	35.150000	785.150000	354.165000	395.150000
75%	57.500000	850.075000	635.500000	645.000000
max	125.000000	2125.000000	1257.000000	1225.000000