Table of content
Assemble Data sets together for analysis using Pandas
Understand the needs of concatenating data sets and performing the operations on them
Understand the needs of merging data sets and performing the operations on them
In this blog, we will learn what are the ways to combine different sets of data, put them together into a dataset and to analyse them correctly. Why do we need to do that - you may wonder?
In many business scenarios or working environments, the data that you will need to use for analysis does not necessary come to you in one complete package or repository. More often than not, you will need to collect or sample data from several places and combine that together.
How to we ensure that the analysis performed will be correct by combining the various sources will be correct ?
As there’s a saying - garbage in garbage out - one of the first steps in performing data analysis work is to ensure that the source data is factually relevant and correct. To do that, the input source data usually goes through a process of tidying or “cleaning” to ensure the input data used contains the right data. For example, If the data is to be calculated (e.g. summed up or multiplied), it has to be of a data type that is numeric - and not textual.
So let’s explore the concepts of combining, cleaning and transforming now.
Assembling Data
Combining DataFrames
It is common to find information contained in various tables or dataframes. This is because practitioners organise and group information into a manageable, logical and relevant manner for processing efficiency and data integrity reasons.
Tidying data is a framework to structure data sets so that they can be easily analysed. It is a goal for one to achieve when one is cleaning data. The idea of tidy data can be seen to meet the following criteria:
Each row in an observation
Each column is a variable
Each type of observational unit forms a table
For example, we may have one table containing information about different companies, and another table containing stock prices of companies. If we want to look at the stock prices within the tech industry, we first find all the tech companies from the first table containing company information, and then combine that data with the stock price data to get the data to answer our question. The data could have been split up into separate tables to reduce redundant information (there is no need to store company information with each stock price entry), but this arrangement means that we need to combine relevant data to answer questions.
Concatenation
An easy way to combine data is using concatenation. This can be thought of as stacking dataframes either vertically or horizontally. We use the concat function for this purpose.
Stacking DataFrame
Let’s create some example data.
For this exercise, we will be introducing numpy - a Python library for processing large and multi-dimensional arrays. It also comes along with a large collection of mathematical functions that can be used to operate the data elements in the arrays.
Here, we will import the numpy library so that we can use the np.zeros method to create arrays in this example.
import numpy as np
import pandas as pd
# We use the np.zeros function to create a 4 x 4 dataframe
# Each datapoint is zero
data = np.zeros((4,4))
# Create a first array set, each value is 1
data1 = data + 1
# Create second array set, each value is 2
data2 = data + 2
# Create third array set, each value is 3
data3 = data + 3
# Set the first array as the data for the first dataframe
df1 = pd.DataFrame(data= data1)
# Set the second array as the data for the second dataframe
df2 = pd.DataFrame(data= data2)
# Set the third array as the data for the third dataframe
df3 = pd.DataFrame(data= data3)
We can print out the dataframe to see the sample data set:
print df1
out:
0 1 2 3
0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0
print df2
out:
0 1 2 3
0 2.0 2.0 2.0 2.0
1 2.0 2.0 2.0 2.0
2 2.0 2.0 2.0 2.0
3 2.0 2.0 2.0 2.0
print df3
out:
0 1 2 3
0 3.0 3.0 3.0 3.0
1 3.0 3.0 3.0 3.0
2 3.0 3.0 3.0 3.0
3 3.0 3.0 3.0 3.0
To stack the dataframes on top of each other, we use the concat function, and pass in the dataframes to be concatenated in a list:
vertical_stacked = pd.concat([df1,df2,df3])
print vertical_stacked
out:
0 1 2 3
0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0
0 2.0 2.0 2.0 2.0
1 2.0 2.0 2.0 2.0
2 2.0 2.0 2.0 2.0
3 2.0 2.0 2.0 2.0
0 3.0 3.0 3.0 3.0
1 3.0 3.0 3.0 3.0
2 3.0 3.0 3.0 3.0
3 3.0 3.0 3.0 3.0
Stacking the dataframes horizontally uses the same concat method, with an additional parameter axis=1 to the function (note that the default is axis=0).
horizontal_stacked = pd.concat([df1,df2,df3], axis=1)
print horizontal_stacked
out:
0 1 2 3 0 1 2 3 0 1 2 3
0 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0 3.0 3.0 3.0 3.0
1 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0 3.0 3.0 3.0 3.0
2 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0 3.0 3.0 3.0 3.0
3 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0 3.0 3.0 3.0 3.0
If the dataframes to be stacked have different dimensions (i.e. the number of rows or number of columns are not the same), the missing parts will be shown as NaN value. This is the Pandas way of representing “Missing Value”.
For example, let’s concatenate using a subset of say df2, using the iloc method, df2.iloc[:2,:] (choose up to row 2):
horizontal_stacked = pd.concat([df1,df2.iloc[:2,:],df3],
axis=1)
print horizontal_stacked
out:
0 1 2 3 0 1 2 3 0 1 2 3
0 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0 3.0 3.0 3.0 3.0
1 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0 3.0 3.0 3.0 3.0
2 1.0 1.0 1.0 1.0 NaN NaN NaN NaN 3.0 3.0 3.0 3.0
3 1.0 1.0 1.0 1.0 NaN NaN NaN NaN 3.0 3.0 3.0 3.0
We see that the combined dataframe has missing values in the 3rd and 4th rows.
Concatenation with Different Indices
What happens when we combine dataframes which have different column names? Let’s consider the following example:
We modify the earlier dataframe so that each has a different set of column names:
df1.columns = ['A','B','C','D']
print df1
df2.columns = ['C','F','G','H']
print df2
df3.columns = ['A','C','F','H']
print df3
out:
A B C D
0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0
C F G H
0 2.0 2.0 2.0 2.0
1 2.0 2.0 2.0 2.0
2 2.0 2.0 2.0 2.0
3 2.0 2.0 2.0 2.0
A C F H
0 3.0 3.0 3.0 3.0
1 3.0 3.0 3.0 3.0
2 3.0 3.0 3.0 3.0
3 3.0 3.0 3.0 3.0
If we simply stack the dataframes together, you can see that the columns are automatically aligned:
stacked = pd.concat([df1,df2,df3])
print stacked
out
A B C D F G H
0 1.0 1.0 1.0 1.0 NaN NaN NaN
1 1.0 1.0 1.0 1.0 NaN NaN NaN
2 1.0 1.0 1.0 1.0 NaN NaN NaN
3 1.0 1.0 1.0 1.0 NaN NaN NaN
0 NaN NaN 2.0 NaN 2.0 2.0 2.0
1 NaN NaN 2.0 NaN 2.0 2.0 2.0
2 NaN NaN 2.0 NaN 2.0 2.0 2.0
3 NaN NaN 2.0 NaN 2.0 2.0 2.0
0 3.0 NaN 3.0 NaN 3.0 NaN 3.0
1 3.0 NaN 3.0 NaN 3.0 NaN 3.0
2 3.0 NaN 3.0 NaN 3.0 NaN 3.0
3 3.0 NaN 3.0 NaN 3.0 NaN 3.0
Any missing values are represented by NaN in the combined dataframe.
We can keep only columns which share in common the items to be concatenated. This is done by setting the parameter join='inner'. In this example, we can see that it should just show the common column C:
print pd.concat([df1,df2,df3], join='inner', ignore_index=True)
out:
C
0 1.0
1 1.0
2 1.0
3 1.0
4 2.0
5 2.0
6 2.0
7 2.0
8 3.0
9 3.0
10 3.0
11 3.0
In the example above, we also passed in the parameter ignore_index=True so that all the indexes are reset. This is needed when you want to reset the index column (the first column without name) to ensure that it does not contain any duplicate values.
Merging DataFrames
Merging Multiple Data Sets The previous example alluded to some database concepts. The join='inner' parameter joins table and displays common column items. The default join='outer' parameter will join tables and display all items.
To see an example how a database like SQL performs join, you can refer to:
https:// www.w3schools.com/sql/sql_join_inner.asp
Instead of concatenating dataframes by referencing their row or column index, sometimes we want to combine based on common data values. This task is known in the database world as performing a ‘join’.
Pandas has a pd.join method, although the same task can be performed more explicitly and flexibly using the pd.merge function.
Let’s see how the pd.merge function works. We will create some sample datasets for a survey.
First, we create the data for surveyors.
import pandas as pd
persons = [['dyer','William','Dyer'],
['pb','Frank','Pabodie'],
['lake','Anderson','Lake']]
person = pd.DataFrame(data=persons,
columns=['id','personal','family'])
We create the data for survey sites.
sites = [['DR-1','location 1'],
['DR-2','location 2']]
site = pd.DataFrame(data=sites,
columns=['name','location'])
Next, the data for survey conducted at various sites:
visits = [['1','DR-1','2018-02-20'],
['2','DR-1','2018-02-22'],
['3','DR-2','2018-02-25'],
['4','DR-3','2018-03-01'],
['5','DR-3','2018-03-05'],
['6','DR-3','2018-04-15']
]
visit = pd.DataFrame(data=visits,
columns=['id','site','date'])
Finally, the information is collected for each survey. Notice that one survey could be conducted by one or more persons, and more than one type of information is collected at each survey:
surveys = [['1','lake','rad', '9.7'],
['1','lake','sal', '0.21'],
['2','dyer','rad', '9.82'],
['2','dyer','sal', '0.13'],
['3','dyer','rad', '7.7'],
['3','dyer','sal', '0.09'],
['4','pb','rad', '6.9'],
['4','dyer','temp', '-16.00'],
['5','pb','rad', '8.41'],
['5','lake','sal', '0.08'],
['6','pb','rad', '10.7'],
['6','pb','temp', '-18.95'],
]
survey = pd.DataFrame(data=surveys,
columns=['taken','person','quant','reading'])
Our data is split into multiple parts, where each part is an observational unit. Say if we want to look at survey dates, as well as the location information of the survey site, then we need to merge (combine) multiple dataframes, using the pd.merge method.
When we call the method, the dataframe that is being called is referred to the one on the “left”. Within the merge function, the first parameter is the ‘right” dataframe, while the next parameter, how, specifies how the final merged result looks like. The table below shows how Pandas parameter relates to SQL. Next, we use the on parameter to state which columns to match. If the left and right columns do not have the same name, we use the left_on and right_on parameters instead.
One-to-one Merge
The simplest type of merge is to join two dataframes where the columns to be joined/ matched do not contain duplicate values.
We modify the visit dataframe so that there are no duplicated site values.
visit_subset = visit.iloc[[0,2]]
We can perform the one-to-one merge as follows. The column to be matched for the left dataframe, site, is name, while the column for the right dataframe, visit_subset, to be matched is site:
print site.merge(visit_subset, left_on='name', right_on='site')
out:
id site date
0 1 DR-1 2018-02-20
2 3 DR-2 2018-02-25
Many-to-One Merge
If we do the same merge, but this time without the subsetted visit dataframe, we are performing a many-to-one merge. In this kind of merge, one of the dataframes has key values that repeat. The dataframe that contains the single observations will be duplicated in the merge.
merged_df = site.merge(visit, left_on='name', right_on='site')
print merged_df
Out:
name location id site date
0 DR-1 location 1 1 DR-1 2018-02-20
1 DR-1 location 1 2 DR-1 2018-02-22
2 DR-2 location 2 3 DR-2 2018-02-25
You will notice that the “right” dataframe, visit, contains more data values for the site column, compared to the case for the ‘left’ dataframe site. If we specify the parameter how=’right”, we can see that some information is missing in the left dataframe:
print site.merge(visit, how='right', \
left_on='name', right_on='site')
out:
name location id site date
0 DR-1 location 1 1 DR-1 2018-02-20
1 DR-1 location 1 2 DR-1 2018-02-22
2 DR-2 location 2 3 DR-2 2018-02-25
3 NaN NaN 4 DR-3 2018-03-01
4 NaN NaN 5 DR-3 2018-03-05
5 NaN NaN 6 DR-3 2018-04-15
Many-to-Many Merge
There will be times when we want to perform a match based on multiple columns. As an example, suppose we have two dataframes, one comes from person merged with survey, and another dataframe comes from visit merged with survey.
ps = person.merge(survey, left_on='id', right_on='person')
print ps
Out:
id personal family taken person quant reading
0 dyer William Dyer 2 dyer rad 9.82
1 dyer William Dyer 2 dyer sal 0.13
2 dyer William Dyer 3 dyer rad 7.7
3 dyer William Dyer 3 dyer sal 0.09
4 dyer William Dyer 4 dyer temp -16.00
5 pb Frank Pabodie 4 pb rad 6.9
6 pb Frank Pabodie 5 pb rad 8.41
7 pb Frank Pabodie 6 pb rad 10.7
8 pb Frank Pabodie 6 pb temp -18.95
9 lake Anderson Lake 1 lake rad 9.7
10 lake Anderson Lake 1 lake sal 0.21
11 lake Anderson Lake 5 lake sal 0.08
vs = visit.merge(survey, left_on='id', right_on='taken')
print vs
Out:
id site date taken person quant reading
0 1 DR-1 2018-02-20 1 lake rad 9.7
1 1 DR-1 2018-02-20 1 lake sal 0.21
2 2 DR-1 2018-02-22 2 dyer rad 9.82
3 2 DR-1 2018-02-22 2 dyer sal 0.13
4 3 DR-2 2018-02-25 3 dyer rad 7.7
5 3 DR-2 2018-02-25 3 dyer sal 0.09
6 4 DR-3 2018-03-01 4 pb rad 6.9
7 4 DR-3 2018-03-01 4 dyer temp -16.00
8 5 DR-3 2018-03-05 5 pb rad 8.41
9 5 DR-3 2018-03-05 5 lake sal 0.08
10 6 DR-3 2018-04-15 6 pb rad 10.7
11 6 DR-3 2018-04-15 6 pb temp -18.95
We can perform a many-to-many merging for these two dataframes, by passing the multiple columns to match on in a Python list.
ps_vs = ps.merge(vs, left_on=['id','taken', 'quant',
'reading'],\
right_on=['person','id', 'quant', 'reading'])
print ps_vs
You need to ensure that the order of the column names in the list corresponds properly. If everything is okay, you should get:
id_x personal family taken_x person_x quant reading id_y
site \
0 dyer William Dyer 2 dyer rad 9.82 2
DR-1
1 dyer William Dyer 2 dyer sal 0.13 2
DR-1
2 dyer William Dyer 3 dyer rad 7.7 3
DR-2
3 dyer William Dyer 3 dyer sal 0.09 3
DR-2
4 dyer William Dyer 4 dyer temp -16.00 4
DR-3
5 pb Frank Pabodie 4 pb rad 6.9 4
DR-3
6 pb Frank Pabodie 5 pb rad 8.41 5
DR-3
7 pb Frank Pabodie 6 pb rad 10.7 6
DR-3
8 pb Frank Pabodie 6 pb temp -18.95 6
DR-3
9 lake Anderson Lake 1 lake rad 9.7 1
DR-1
10 lake Anderson Lake 1 lake sal 0.21 1
DR-1
11 lake Anderson Lake 5 lake sal 0.08 5
DR-3
Comments