Prerequisites

This tutorial assumes you have basic knowledge and understanding of;

  1. Python programming
  2. Jupyter notebooks

Pandas

Pandas generally has two structures:

  1. Series
  2. DataFrames

Pandas Series

A series is an ordered sequence of elements. It looks a lot like a list but the two have a lot of differences. Series can have a name, can have strings as the index. Let's look at one example where we can use Pandas Series.

Importing the pandas library

In order to use Pandas, we will need to import it using the code below.

In [1]:
import pandas as pd

Example 1

Let's create a series of the Population of East Africa.

In [2]:
east_africa = pd.Series([10, 20.3, 50, 125, 60, 7.5])

We can view the series by just typing its name.

In [3]:
east_africa
Out[3]:
0     10.0
1     20.3
2     50.0
3    125.0
4     60.0
5      7.5
dtype: float64

We can see that it is similar to a list in python.

And we can index it just the same way we index list.

For example:

In [4]:
# this returns the value at index 2

east_africa[2]
Out[4]:
50.0

However unlike lists, we can;

  1. name the series
  2. give our index different names.

Changing the index can be important in that we can reference the values not by sequence, but by name.

In [5]:
east_africa.name = "Population of East Africa"

east_africa.index = [
    'Uganda',
    'Kenya',
    'Tanzania',
    'Rwanda',
    'S. Sudan',
    'Burundi'
]
In [6]:
east_africa
Out[6]:
Uganda       10.0
Kenya        20.3
Tanzania     50.0
Rwanda      125.0
S. Sudan     60.0
Burundi       7.5
Name: Population of East Africa, dtype: float64

And we see that the series index changed to our countries. The name of the series can also be seen at the bottom.

Indexing Series

We can index series in a variety of ways;

  1. Sequential Position
  2. Label (index name)
  3. Boolean indexing
  4. Using loc and iloc

Sequential Position

In [7]:
east_africa[2]
Out[7]:
50.0

Label (index name)

We could achieve a similar result by simply using

In [8]:
east_africa['Tanzania']
Out[8]:
50.0

Boolean

There is a new way in which we can index our series by simply passing a logical expression as our index.

In [9]:
east_africa[east_africa > 25]
Out[9]:
Tanzania     50.0
Rwanda      125.0
S. Sudan     60.0
Name: Population of East Africa, dtype: float64

This returns the same series but without all countries whose population exceeds 25

.loc and .iloc

.loc uses the label of each row.

.iloc uses the numerical position of the rows.

These can be recalled easily as:

  1. loc - locate
  2. iloc - index locate
In [10]:
# Returns the value of Uganda 

east_africa.loc['Uganda']
Out[10]:
10.0
In [11]:
# Returns the value of the last country

east_africa.iloc[-1]
Out[11]:
7.5

Slicing Series

We can also slice the pandas series just as we slice lists.

In [12]:
# Returns data for all countries starting with Kenya

east_africa['Kenya': ]
Out[12]:
Kenya        20.3
Tanzania     50.0
Rwanda      125.0
S. Sudan     60.0
Burundi       7.5
Name: Population of East Africa, dtype: float64

However, unlike in lists, slicing in a pandas series includes the value of the last item.

In [13]:
# Includes the value for S. Sudan

east_africa['Kenya': 'S. Sudan']
Out[13]:
Kenya        20.3
Tanzania     50.0
Rwanda      125.0
S. Sudan     60.0
Name: Population of East Africa, dtype: float64

Other functions that can be applied to a pandas series

  1. describe() : Returns different attributes of the series
  2. mean() : returns the mean of the series
  3. std() : returns the standard deviation of the series
  4. min() : returns the minimun value in the series

You can be certain that you will meet very many different functions and most probably the function you want to perform already exists. All you need to do is to make use of Google.com

In [14]:
east_africa.describe()
Out[14]:
count      6.000000
mean      45.466667
std       44.474787
min        7.500000
25%       12.575000
50%       35.150000
75%       57.500000
max      125.000000
Name: Population of East Africa, dtype: float64
In [15]:
east_africa.mean()
Out[15]:
45.46666666666667
In [16]:
east_africa.min()
Out[16]:
7.5

Pandas Data frames

A DataFrame is the most important structure of pandas. It is a tabular structure integrated with with Series.

Let us create our first dataframe called df. And it will contain the data about our East African countries.

In [17]:
df = pd.DataFrame({
    'Population': [10, 20.3, 50, 125, 60, 7.5],
    'GDP': [210, 820.3, 750, 2125, 860, 87.5],
    'Surface Area': [140, 206.33, 502, 1257, 680, 67.5],
    'HDI': [140, 220.3, 570, 1225, 670, 72.5],
    'Continent': ['Africa', 'Europe', 'Asia', 'North America', 'S. America', 'Australia']
}, columns=['Population', 'GDP', 'Continent', 'Surface Area', 'HDI']
)

# columns indicate the order in which the columns should appear but it is optional to include it
In [18]:
df
Out[18]:
Population GDP Continent Surface Area HDI
0 10.0 210.0 Africa 140.00 140.0
1 20.3 820.3 Europe 206.33 220.3
2 50.0 750.0 Asia 502.00 570.0
3 125.0 2125.0 North America 1257.00 1225.0
4 60.0 860.0 S. America 680.00 670.0
5 7.5 87.5 Australia 67.50 72.5

But we realize that the best index of the dataframe would be the countries to which the data belongs.

We can easily set a new index by recalling our knowledge of pandas Series.

In [19]:
df.index = [
    'Uganda',
    'Kenya',
    'Tanzaia',
    'Rwanda',
    'S. Sudan',
    'Burundi'
]

# We can even give a name to our index column for easier future reference

df.index.name = 'Countries'
In [20]:
df
Out[20]:
Population GDP Continent Surface Area HDI
Countries
Uganda 10.0 210.0 Africa 140.00 140.0
Kenya 20.3 820.3 Europe 206.33 220.3
Tanzaia 50.0 750.0 Asia 502.00 570.0
Rwanda 125.0 2125.0 North America 1257.00 1225.0
S. Sudan 60.0 860.0 S. America 680.00 670.0
Burundi 7.5 87.5 Australia 67.50 72.5

We can check out a summary of the dataframe using the describe() method.

In [21]:
df.describe()
Out[21]:
Population GDP Surface Area HDI
count 6.000000 6.000000 6.000000 6.000000
mean 45.466667 808.800000 475.471667 482.966667
std 44.474787 723.523144 448.272570 435.493406
min 7.500000 87.500000 67.500000 72.500000
25% 12.575000 345.000000 156.582500 160.075000
50% 35.150000 785.150000 354.165000 395.150000
75% 57.500000 850.075000 635.500000 645.000000
max 125.000000 2125.000000 1257.000000 1225.000000

Indexing

We can index a given column by passing in the column header.

In [22]:
df['Continent']
Out[22]:
Countries
Uganda             Africa
Kenya              Europe
Tanzaia              Asia
Rwanda      North America
S. Sudan       S. America
Burundi         Australia
Name: Continent, dtype: object

We can index a given row using;

  1. .loc and passing in the index name of the row
  2. .iloc and passing in the sequential position of the given row.
In [23]:
df.loc['Rwanda']
Out[23]:
Population                125
GDP                      2125
Continent       North America
Surface Area             1257
HDI                      1225
Name: Rwanda, dtype: object
In [24]:
df.iloc[2:]
Out[24]:
Population GDP Continent Surface Area HDI
Countries
Tanzaia 50.0 750.0 Asia 502.0 570.0
Rwanda 125.0 2125.0 North America 1257.0 1225.0
S. Sudan 60.0 860.0 S. America 680.0 670.0
Burundi 7.5 87.5 Australia 67.5 72.5
In [25]:
df.loc['Kenya': 'S. Sudan']
Out[25]:
Population GDP Continent Surface Area HDI
Countries
Kenya 20.3 820.3 Europe 206.33 220.3
Tanzaia 50.0 750.0 Asia 502.00 570.0
Rwanda 125.0 2125.0 North America 1257.00 1225.0
S. Sudan 60.0 860.0 S. America 680.00 670.0
In [26]:
df.loc['Kenya', ['GDP', 'Population']]
Out[26]:
GDP           820.3
Population     20.3
Name: Kenya, dtype: object

info() can even help us inspect the data for missing values.

In our examples, all columns have no null values i.e. all values are non-null.

In [27]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 6 entries, Uganda to Burundi
Data columns (total 5 columns):
Population      6 non-null float64
GDP             6 non-null float64
Continent       6 non-null object
Surface Area    6 non-null float64
HDI             6 non-null float64
dtypes: float64(4), object(1)
memory usage: 448.0+ bytes

Broadcasting

Broadcasting enables us to apply vectors to the dataframe.

Taking for example our dataframe.

In [28]:
df
Out[28]:
Population GDP Continent Surface Area HDI
Countries
Uganda 10.0 210.0 Africa 140.00 140.0
Kenya 20.3 820.3 Europe 206.33 220.3
Tanzaia 50.0 750.0 Asia 502.00 570.0
Rwanda 125.0 2125.0 North America 1257.00 1225.0
S. Sudan 60.0 860.0 S. America 680.00 670.0
Burundi 7.5 87.5 Australia 67.50 72.5

Let's create a series named crisis. It is going to have only one column and an index column.

In [29]:
crisis = pd.Series ({
    'GDP': 20,
    'Population': 10
})
In [30]:
crisis
Out[30]:
GDP           20
Population    10
dtype: int64

We can use this series to affect give or all data in the dataframe.

For example we can add crisis to the respective columns with similar indices as the crisis series.

In [31]:
df[['GDP', 'Population']] + crisis
Out[31]:
GDP Population
Countries
Uganda 230.0 20.0
Kenya 840.3 30.3
Tanzaia 770.0 60.0
Rwanda 2145.0 135.0
S. Sudan 880.0 70.0
Burundi 107.5 17.5

Dropping Stuff

We can drop any given columns that we don't want using the drop() function.

In [32]:
df.drop('Kenya')
Out[32]:
Population GDP Continent Surface Area HDI
Countries
Uganda 10.0 210.0 Africa 140.0 140.0
Tanzaia 50.0 750.0 Asia 502.0 570.0
Rwanda 125.0 2125.0 North America 1257.0 1225.0
S. Sudan 60.0 860.0 S. America 680.0 670.0
Burundi 7.5 87.5 Australia 67.5 72.5

Note

All the above operations on the dataframe and series don't affect the original series and dataframes.

For example, if we wanted to permanently drop a column, we need to add the attribtue inplace=True

In [33]:
df.drop(columns='Population', inplace=True)
In [34]:
df
Out[34]:
GDP Continent Surface Area HDI
Countries
Uganda 210.0 Africa 140.00 140.0
Kenya 820.3 Europe 206.33 220.3
Tanzaia 750.0 Asia 502.00 570.0
Rwanda 2125.0 North America 1257.00 1225.0
S. Sudan 860.0 S. America 680.00 670.0
Burundi 87.5 Australia 67.50 72.5
In [35]:
df.drop('Kenya', inplace=True)
In [36]:
df
Out[36]:
GDP Continent Surface Area HDI
Countries
Uganda 210.0 Africa 140.0 140.0
Tanzaia 750.0 Asia 502.0 570.0
Rwanda 2125.0 North America 1257.0 1225.0
S. Sudan 860.0 S. America 680.0 670.0
Burundi 87.5 Australia 67.5 72.5

Reading external data

So far, we've looked at how pandas can be used on data we have generated. This will rarely be the case.

Most of the times we need to work on external data which may be in form of .csv, .xls, .sql.

That is where pandas comes in even really handy. It enables us to easily read data of different formats.

First we need to import the numpy library since it works hand in hand with the pandas library.

In [37]:
import numpy as np

To read an external file in .csv format, we use _.readcsv() function.

You can download the dataset used here. Download Dataset

In [38]:
df = pd.read_csv(
    'Social_Network_Ads.csv',
    index_col = 'User_ID'
)
In [39]:
df.head()
Out[39]:
Gender Age EstimatedSalary Purchased
User_ID
15624510 Male 19 19000 0
15810944 Male 35 20000 0
15668575 Female 26 43000 0
15603246 Female 27 57000 0
15804002 Male 19 76000 0

Inspecting for missing data

We can do this two way as we have seen before;

  1. using isnull() function
  2. using info() function
In [40]:
df.isnull().sum()
Out[40]:
Gender             0
Age                0
EstimatedSalary    0
Purchased          0
dtype: int64
In [41]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 400 entries, 15624510 to 15594041
Data columns (total 4 columns):
Gender             400 non-null object
Age                400 non-null int64
EstimatedSalary    400 non-null int64
Purchased          400 non-null int64
dtypes: int64(3), object(1)
memory usage: 15.6+ KB

We can see that there are no null values in our dataset since all columns have 400 non-null values.

Handling with missing data

You can download the dataset used here. Download Dataset

In [42]:
df = pd.read_csv(
    'Social_Network_Ads-Copy1.csv',
    index_col = 0
)

Checking our data, we can see that there are indeed missing values.

In [43]:
df
Out[43]:
Gender Age EstimatedSalary Purchased
User_ID
15624510 Female 19.0 19000.0 0.0
15810944 Male 35.0 20000.0 0.0
15668575 Female NaN 43000.0 NaN
15603246 Male NaN NaN NaN
15804002 Male 19.0 76000.0 0.0
15728773 Male 27.0 58000.0 0.0
15598044 Female 27.0 84000.0 NaN

Inspecting our data even more for missing values

We can use our isnull() function.

Since isnull() returns a 1 for any null value, we can sum them up to get the total number of null values in each column of the dataframe.

In [44]:
# This returns the number of missing values in each column

df.isnull().sum()
Out[44]:
Gender             0
Age                2
EstimatedSalary    1
Purchased          3
dtype: int64

Dropping null values

You can drop all null values by simply using the dropna() function.

This function is harsh in that it drops all rows with any NaN value.

In [45]:
df.dropna()
Out[45]:
Gender Age EstimatedSalary Purchased
User_ID
15624510 Female 19.0 19000.0 0.0
15810944 Male 35.0 20000.0 0.0
15804002 Male 19.0 76000.0 0.0
15728773 Male 27.0 58000.0 0.0

We can however decide whether to drop a row or not by specifying the number of NaN values that qualify a row to be dropped.

In [46]:
# This drops rows with three or more valid values

df.dropna(thresh=3)
Out[46]:
Gender Age EstimatedSalary Purchased
User_ID
15624510 Female 19.0 19000.0 0.0
15810944 Male 35.0 20000.0 0.0
15804002 Male 19.0 76000.0 0.0
15728773 Male 27.0 58000.0 0.0
15598044 Female 27.0 84000.0 NaN
In [47]:
df.dropna(thresh=3).shape
Out[47]:
(5, 4)

We can see that our dataframe has less rows from 7 to 5.

Remember

The dropna() doesn't permanetly change the original dataframe. And we can check and see that our dataframe hasn't really changeed.

In [48]:
df.shape
Out[48]:
(7, 4)

Replacing null values with the mean

This can be achieved by assigning values to all the null values using the following command.

In [49]:
df.fillna(df.mean())
Out[49]:
Gender Age EstimatedSalary Purchased
User_ID
15624510 Female 19.0 19000.0 0.0
15810944 Male 35.0 20000.0 0.0
15668575 Female 25.4 43000.0 0.0
15603246 Male 25.4 50000.0 0.0
15804002 Male 19.0 76000.0 0.0
15728773 Male 27.0 58000.0 0.0
15598044 Female 27.0 84000.0 0.0

Filling null values with nearby values

Null values can also be replaced by;

  1. values before the NaN value and this is called forward fill
  2. values after the NaN value and this is called backward fill.

Care should be taken when using the forward and backward fill. The values above or below should not be NaN values.

In [50]:
# Forward fill

df.fillna(method='ffill')
Out[50]:
Gender Age EstimatedSalary Purchased
User_ID
15624510 Female 19.0 19000.0 0.0
15810944 Male 35.0 20000.0 0.0
15668575 Female 35.0 43000.0 0.0
15603246 Male 35.0 43000.0 0.0
15804002 Male 19.0 76000.0 0.0
15728773 Male 27.0 58000.0 0.0
15598044 Female 27.0 84000.0 0.0
In [51]:
# Backward fill

df.fillna(method='bfill')
Out[51]:
Gender Age EstimatedSalary Purchased
User_ID
15624510 Female 19.0 19000.0 0.0
15810944 Male 35.0 20000.0 0.0
15668575 Female 19.0 43000.0 0.0
15603246 Male 19.0 76000.0 0.0
15804002 Male 19.0 76000.0 0.0
15728773 Male 27.0 58000.0 0.0
15598044 Female 27.0 84000.0 NaN

We can change the axis in which we fill our NaN values. But it can be seen that in this particular example it doesn't make logical sense. But you get the idea.

In [52]:
df.fillna(method='ffill', axis=1)
Out[52]:
Gender Age EstimatedSalary Purchased
User_ID
15624510 Female 19 19000 0
15810944 Male 35 20000 0
15668575 Female Female 43000 43000
15603246 Male Male Male Male
15804002 Male 19 76000 0
15728773 Male 27 58000 0
15598044 Female 27 84000 84000

Conclusion

Congratulations for completing this introductory tutorial on the use of Pandas.

You can be confident to deal with data during your pursuit for machine learning.

However, there are plenty of operations that we have not explored. This was just the tip of the iceberg in Pandas but it is fundamentally enough for machine learning. More will be learnt as you move on.

Enjoy Machine Learning!