This tutorial assumes you have basic knowledge and understanding of;
A series is an ordered sequence of elements. It looks a lot like a list but the two have a lot of differences. Series can have a name, can have strings as the index. Let's look at one example where we can use Pandas Series.
In order to use Pandas, we will need to import it using the code below.
import pandas as pd
Let's create a series of the Population of East Africa.
east_africa = pd.Series([10, 20.3, 50, 125, 60, 7.5])
We can view the series by just typing its name.
east_africa
We can see that it is similar to a list in python.
And we can index it just the same way we index list.
For example:
# this returns the value at index 2
east_africa[2]
However unlike lists, we can;
Changing the index can be important in that we can reference the values not by sequence, but by name.
east_africa.name = "Population of East Africa"
east_africa.index = [
'Uganda',
'Kenya',
'Tanzania',
'Rwanda',
'S. Sudan',
'Burundi'
]
east_africa
And we see that the series index changed to our countries. The name of the series can also be seen at the bottom.
We can index series in a variety of ways;
east_africa[2]
We could achieve a similar result by simply using
east_africa['Tanzania']
There is a new way in which we can index our series by simply passing a logical expression as our index.
east_africa[east_africa > 25]
This returns the same series but without all countries whose population exceeds 25
.loc uses the label of each row.
.iloc uses the numerical position of the rows.
These can be recalled easily as:
# Returns the value of Uganda
east_africa.loc['Uganda']
# Returns the value of the last country
east_africa.iloc[-1]
We can also slice the pandas series just as we slice lists.
# Returns data for all countries starting with Kenya
east_africa['Kenya': ]
However, unlike in lists, slicing in a pandas series includes the value of the last item.
# Includes the value for S. Sudan
east_africa['Kenya': 'S. Sudan']
You can be certain that you will meet very many different functions and most probably the function you want to perform already exists. All you need to do is to make use of Google.com
east_africa.describe()
east_africa.mean()
east_africa.min()
A DataFrame is the most important structure of pandas. It is a tabular structure integrated with with Series.
Let us create our first dataframe called df. And it will contain the data about our East African countries.
df = pd.DataFrame({
'Population': [10, 20.3, 50, 125, 60, 7.5],
'GDP': [210, 820.3, 750, 2125, 860, 87.5],
'Surface Area': [140, 206.33, 502, 1257, 680, 67.5],
'HDI': [140, 220.3, 570, 1225, 670, 72.5],
'Continent': ['Africa', 'Europe', 'Asia', 'North America', 'S. America', 'Australia']
}, columns=['Population', 'GDP', 'Continent', 'Surface Area', 'HDI']
)
# columns indicate the order in which the columns should appear but it is optional to include it
df
But we realize that the best index of the dataframe would be the countries to which the data belongs.
We can easily set a new index by recalling our knowledge of pandas Series.
df.index = [
'Uganda',
'Kenya',
'Tanzaia',
'Rwanda',
'S. Sudan',
'Burundi'
]
# We can even give a name to our index column for easier future reference
df.index.name = 'Countries'
df
We can check out a summary of the dataframe using the describe() method.
df.describe()
We can index a given column by passing in the column header.
df['Continent']
We can index a given row using;
df.loc['Rwanda']
df.iloc[2:]
df.loc['Kenya': 'S. Sudan']
df.loc['Kenya', ['GDP', 'Population']]
info() can even help us inspect the data for missing values.
In our examples, all columns have no null values i.e. all values are non-null.
df.info()
Broadcasting enables us to apply vectors to the dataframe.
Taking for example our dataframe.
df
Let's create a series named crisis. It is going to have only one column and an index column.
crisis = pd.Series ({
'GDP': 20,
'Population': 10
})
crisis
We can use this series to affect give or all data in the dataframe.
For example we can add crisis to the respective columns with similar indices as the crisis series.
df[['GDP', 'Population']] + crisis
We can drop any given columns that we don't want using the drop() function.
df.drop('Kenya')
All the above operations on the dataframe and series don't affect the original series and dataframes.
For example, if we wanted to permanently drop a column, we need to add the attribtue inplace=True
df.drop(columns='Population', inplace=True)
df
df.drop('Kenya', inplace=True)
df
So far, we've looked at how pandas can be used on data we have generated. This will rarely be the case.
Most of the times we need to work on external data which may be in form of .csv, .xls, .sql.
That is where pandas comes in even really handy. It enables us to easily read data of different formats.
First we need to import the numpy library since it works hand in hand with the pandas library.
import numpy as np
To read an external file in .csv format, we use _.readcsv() function.
You can download the dataset used here. Download Dataset
df = pd.read_csv(
'Social_Network_Ads.csv',
index_col = 'User_ID'
)
df.head()
We can do this two way as we have seen before;
df.isnull().sum()
df.info()
We can see that there are no null values in our dataset since all columns have 400 non-null values.
You can download the dataset used here. Download Dataset
df = pd.read_csv(
'Social_Network_Ads-Copy1.csv',
index_col = 0
)
Checking our data, we can see that there are indeed missing values.
df
We can use our isnull() function.
Since isnull() returns a 1 for any null value, we can sum them up to get the total number of null values in each column of the dataframe.
# This returns the number of missing values in each column
df.isnull().sum()
You can drop all null values by simply using the dropna() function.
This function is harsh in that it drops all rows with any NaN value.
df.dropna()
We can however decide whether to drop a row or not by specifying the number of NaN values that qualify a row to be dropped.
# This drops rows with three or more valid values
df.dropna(thresh=3)
df.dropna(thresh=3).shape
We can see that our dataframe has less rows from 7 to 5.
The dropna() doesn't permanetly change the original dataframe. And we can check and see that our dataframe hasn't really changeed.
df.shape
This can be achieved by assigning values to all the null values using the following command.
df.fillna(df.mean())
Null values can also be replaced by;
Care should be taken when using the forward and backward fill. The values above or below should not be NaN values.
# Forward fill
df.fillna(method='ffill')
# Backward fill
df.fillna(method='bfill')
We can change the axis in which we fill our NaN values. But it can be seen that in this particular example it doesn't make logical sense. But you get the idea.
df.fillna(method='ffill', axis=1)
Congratulations for completing this introductory tutorial on the use of Pandas.
You can be confident to deal with data during your pursuit for machine learning.
However, there are plenty of operations that we have not explored. This was just the tip of the iceberg in Pandas but it is fundamentally enough for machine learning. More will be learnt as you move on.