Data Manipulation¶
Note: This is an edited version of Cliburn Chan’s original tutorial. All changes remain licensed as the original, under the terms of the MIT license.
In [1]:
import numpy as np
import pandas as pd
(1) Combining data sets¶
Combining rows¶
In [2]:
df_versiocolor = pd.read_csv('data/versicolor.csv')
df_virginica = pd.read_csv('data/virginica.csv')
df_sertosa = pd.read_csv('data/setosa.csv')
dfs = [df_versiocolor, df_virginica, df_sertosa]
In [3]:
df_versiocolor.head(3)
Out[3]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
0 | 7.0 | 3.2 | 4.7 | 1.4 | versicolor |
1 | 6.4 | 3.2 | 4.5 | 1.5 | versicolor |
2 | 6.9 | 3.1 | 4.9 | 1.5 | versicolor |
In [4]:
df_virginica.head(3)
Out[4]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
0 | 6.3 | 3.3 | 6.0 | 2.5 | virginica |
1 | 5.8 | 2.7 | 5.1 | 1.9 | virginica |
2 | 7.1 | 3.0 | 5.9 | 2.1 | virginica |
In [5]:
df_sertosa.head(3)
Out[5]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
In [6]:
for df in dfs:
print(df.shape)
(50, 5)
(50, 5)
(50, 5)
Each DataFrame only contains data about one species of iris¶
In [7]:
for df in dfs:
print(df.Species.unique())
['versicolor']
['virginica']
['setosa']
Combine with concat
¶
We can append rows of one data frame to another using concat
. Here
axis = 0
means combining by rows, in contrast to axis = 1
which
means combining by columns.
In [8]:
df = pd.concat(dfs, axis=0)
df.shape
Out[8]:
(150, 5)
Note that the indices are maintained¶
Here we have to use iloc
because ix
and loc
work with the
integer index values, not the positions, and there are no values equal
to 50 or 51.
In [9]:
df.iloc[48:52]
Out[9]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
48 | 5.1 | 2.5 | 3.0 | 1.1 | versicolor |
49 | 5.7 | 2.8 | 4.1 | 1.3 | versicolor |
0 | 6.3 | 3.3 | 6.0 | 2.5 | virginica |
1 | 5.8 | 2.7 | 5.1 | 1.9 | virginica |
We can ask for a new consecutive index¶
In [10]:
df = df.reset_index(drop=True)
In [11]:
df.iloc[48:52]
Out[11]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
48 | 5.1 | 2.5 | 3.0 | 1.1 | versicolor |
49 | 5.7 | 2.8 | 4.1 | 1.3 | versicolor |
50 | 6.3 | 3.3 | 6.0 | 2.5 | virginica |
51 | 5.8 | 2.7 | 5.1 | 1.9 | virginica |
Combined DataFrame contains all 3 species¶
In [12]:
df.Species.unique()
Out[12]:
array(['versicolor', 'virginica', 'setosa'], dtype=object)
Combining columns¶
When combining rows, we are usually just adding new observations with the same variables and there is little ambiguity about what we are trying to do.
When combining columns, we are usually trying to merge information based
on some unique identifier, for example, to combine clinical and
laboratory information for specific patients with unique PIDs. Often the
ordering of the PIDs in the two data frames are not exactly lined up so
“horizontal stacking” will not work. Usually we need to merge
the
data on a unique identifier from one or more columns.
Simplest case - rows line up perfectly across data frames¶
In [13]:
df_sepal = pd.read_csv('data/sepal.csv')
df_petal = pd.read_csv('data/petal.csv')
In [14]:
df_sepal.head(3)
Out[14]:
Species | Sepal.Length | Sepal.Width | |
---|---|---|---|
0 | setosa | 5.1 | 3.5 |
1 | setosa | 4.9 | 3.0 |
2 | setosa | 4.7 | 3.2 |
In [15]:
df_petal.head(3)
Out[15]:
Species | Petal.Length | Petal.Width | |
---|---|---|---|
0 | setosa | 1.4 | 0.2 |
1 | setosa | 1.4 | 0.2 |
2 | setosa | 1.3 | 0.2 |
In [16]:
df_sepal.shape, df_petal.shape
Out[16]:
((150, 3), (150, 3))
Using concat
¶
In [17]:
pd.concat([df_sepal, df_petal], axis=1).head()
Out[17]:
Species | Sepal.Length | Sepal.Width | Species | Petal.Length | Petal.Width | |
---|---|---|---|---|---|---|
0 | setosa | 5.1 | 3.5 | setosa | 1.4 | 0.2 |
1 | setosa | 4.9 | 3.0 | setosa | 1.4 | 0.2 |
2 | setosa | 4.7 | 3.2 | setosa | 1.3 | 0.2 |
3 | setosa | 4.6 | 3.1 | setosa | 1.5 | 0.2 |
4 | setosa | 5.0 | 3.6 | setosa | 1.4 | 0.2 |
Joining on a single unique column¶
Combining values for the same subject across different variables.
Make up dummy data sets for illustration¶
In [18]:
np.random.seed(123456)
pid1 = np.random.choice(100, 6, replace=False)
val1 = np.random.normal(10, 1, 6)
df1 = pd.DataFrame({'pid': pid1, 'val': val1})
df1
Out[18]:
pid | val | |
---|---|---|
0 | 22 | 10.044553 |
1 | 11 | 12.455353 |
2 | 59 | 9.895324 |
3 | 62 | 11.379907 |
4 | 33 | 11.083937 |
5 | 41 | 11.451440 |
In [19]:
pid2 = np.random.permutation(pid1)
val2 = np.random.normal(15, 1, 6)
df2 = pd.DataFrame({'pid': pid2, 'val': val2})
df2
Out[19]:
pid | val | |
---|---|---|
0 | 41 | 16.024180 |
1 | 11 | 15.569605 |
2 | 22 | 15.875906 |
3 | 62 | 12.788628 |
4 | 59 | 15.974466 |
5 | 33 | 12.993253 |
Using merge to join on columns containing unique information¶
Note that the pid
order is not the same, so we cannot simply stack
horizontally.
In [20]:
pd.merge(df1, df2, on='pid', suffixes=['_visit_1', '_visit_2'])
Out[20]:
pid | val_visit_1 | val_visit_2 | |
---|---|---|---|
0 | 22 | 10.044553 | 15.875906 |
1 | 11 | 12.455353 | 15.569605 |
2 | 59 | 9.895324 | 15.974466 |
3 | 62 | 11.379907 | 12.788628 |
4 | 33 | 11.083937 | 12.993253 |
5 | 41 | 11.451440 | 16.024180 |
In [21]:
df1a = df1.set_index('pid') # we'll use these later
df2a = df2.set_index('pid')
Using merge on multiple columns¶
Sometimes we need multiple columns to define a unique identifier.
Note: In the data frames being merged, not all the unique identifiers need to be shared in both data frames. For example, when merging laboratory and clinical data, a patient may have laboratory results but no clinical results (perhaps due to a typo) or have clinical data but no laboratory data (because results are still being processed by the lab). We illustrate different ways to resolve the merge in this scenario in the example below.
In [22]:
df1['stim'] = np.random.choice(['cmv', 'flu'], 6, replace=True)
df1 = df1[['pid', 'stim', 'val']]
df1
Out[22]:
pid | stim | val | |
---|---|---|---|
0 | 22 | cmv | 10.044553 |
1 | 11 | flu | 12.455353 |
2 | 59 | cmv | 9.895324 |
3 | 62 | flu | 11.379907 |
4 | 33 | flu | 11.083937 |
5 | 41 | flu | 11.451440 |
In [23]:
df2['stim'] = np.random.choice(['cmv', 'flu'], 6, replace=True)
df2 = df2[['pid', 'stim', 'val']]
df2
Out[23]:
pid | stim | val | |
---|---|---|---|
0 | 41 | flu | 16.024180 |
1 | 11 | cmv | 15.569605 |
2 | 22 | flu | 15.875906 |
3 | 62 | flu | 12.788628 |
4 | 59 | flu | 15.974466 |
5 | 33 | flu | 12.993253 |
In [24]:
pd.merge(df1, df2, on = ['pid', 'stim'], suffixes = ['_visit_1', '_visit_2'])
Out[24]:
pid | stim | val_visit_1 | val_visit_2 | |
---|---|---|---|---|
0 | 62 | flu | 11.379907 | 12.788628 |
1 | 33 | flu | 11.083937 | 12.993253 |
2 | 41 | flu | 11.451440 | 16.024180 |
In [25]:
pd.merge(df1, df2, on = ['pid', 'stim'], how = 'left', suffixes = ['_visit_1', '_visit_2'])
Out[25]:
pid | stim | val_visit_1 | val_visit_2 | |
---|---|---|---|---|
0 | 22 | cmv | 10.044553 | NaN |
1 | 11 | flu | 12.455353 | NaN |
2 | 59 | cmv | 9.895324 | NaN |
3 | 62 | flu | 11.379907 | 12.788628 |
4 | 33 | flu | 11.083937 | 12.993253 |
5 | 41 | flu | 11.451440 | 16.024180 |
In [26]:
pd.merge(df1, df2, on = ['pid', 'stim'], how = 'right', suffixes = ['_visit_1', '_visit_2'])
Out[26]:
pid | stim | val_visit_1 | val_visit_2 | |
---|---|---|---|---|
0 | 62 | flu | 11.379907 | 12.788628 |
1 | 33 | flu | 11.083937 | 12.993253 |
2 | 41 | flu | 11.451440 | 16.024180 |
3 | 11 | cmv | NaN | 15.569605 |
4 | 22 | flu | NaN | 15.875906 |
5 | 59 | flu | NaN | 15.974466 |
In [27]:
pd.merge(df1, df2, on = ['pid', 'stim'], how = 'outer', suffixes = ['_visit_1', '_visit_2'])
Out[27]:
pid | stim | val_visit_1 | val_visit_2 | |
---|---|---|---|---|
0 | 22 | cmv | 10.044553 | NaN |
1 | 11 | flu | 12.455353 | NaN |
2 | 59 | cmv | 9.895324 | NaN |
3 | 62 | flu | 11.379907 | 12.788628 |
4 | 33 | flu | 11.083937 | 12.993253 |
5 | 41 | flu | 11.451440 | 16.024180 |
6 | 11 | cmv | NaN | 15.569605 |
7 | 22 | flu | NaN | 15.875906 |
8 | 59 | flu | NaN | 15.974466 |
Using join
when indexes are the keys¶
Note that merge/join have a rich and powerful (but complex!) API, and they are deeply interconnected. Much more details can be found in the official documentation.
Note in particular this section:
merge
is a function in the pandas namespace, and it is also available as a DataFrame instance method, with the calling DataFrame being implicitly considered the left object in the join.The related
DataFrame.join
method, usesmerge
internally for the index-on-index (by default) and column(s)-on-index join. If you are joining on index only, you may wish to useDataFrame.join
to save yourself some typing.
In [28]:
df1a
Out[28]:
val | |
---|---|
pid | |
22 | 10.044553 |
11 | 12.455353 |
59 | 9.895324 |
62 | 11.379907 |
33 | 11.083937 |
41 | 11.451440 |
In [29]:
df2a
Out[29]:
val | |
---|---|
pid | |
41 | 16.024180 |
11 | 15.569605 |
22 | 15.875906 |
62 | 12.788628 |
59 | 15.974466 |
33 | 12.993253 |
In [30]:
df1a.join(df2a, lsuffix='_1', rsuffix='_2')
Out[30]:
val_1 | val_2 | |
---|---|---|
pid | ||
22 | 10.044553 | 15.875906 |
11 | 12.455353 | 15.569605 |
59 | 9.895324 | 15.974466 |
62 | 11.379907 | 12.788628 |
33 | 11.083937 | 12.993253 |
41 | 11.451440 | 16.024180 |
(2) Separate multiple values in a single column¶
Remember that for a tidy data frame, each column must contain a single variable.
In [31]:
from collections import OrderedDict
In [32]:
d = OrderedDict()
d['pid-visit-stim'] = ['1-1-cmv', '1-1-hiv', '1-2-cmv', '1-2-hiv', '1-3-cmv', '1-3-hiv', '2-1-cmv', '2-1-hiv', '2-2-cmv', '2-2-hiv']
d['tnf'] = [1.0, 2.0, 1.1, 2.1, 1.2, 2.2, 3, 4, 3.1, 4.1]
d['ifn'] = [11.0, 12.0, 11.1, 12.1, 11.2, 12.2, 13, 14, 13.1, 14.1]
d['il2'] = [0.0, 0.0, 0.1, 0.1, 0.2, 0.2, 0.1, 0.3, 0.1, 0.1]
df = pd.DataFrame(d)
In [33]:
df.head()
Out[33]:
pid-visit-stim | tnf | ifn | il2 | |
---|---|---|---|---|
0 | 1-1-cmv | 1.0 | 11.0 | 0.0 |
1 | 1-1-hiv | 2.0 | 12.0 | 0.0 |
2 | 1-2-cmv | 1.1 | 11.1 | 0.1 |
3 | 1-2-hiv | 2.1 | 12.1 | 0.1 |
4 | 1-3-cmv | 1.2 | 11.2 | 0.2 |
In [34]:
df1 = pd.DataFrame(df['pid-visit-stim'].str.split('-').tolist(),
columns = ['pid', 'visit', 'stim'])
df1.head()
Out[34]:
pid | visit | stim | |
---|---|---|---|
0 | 1 | 1 | cmv |
1 | 1 | 1 | hiv |
2 | 1 | 2 | cmv |
3 | 1 | 2 | hiv |
4 | 1 | 3 | cmv |
In [35]:
df1 = pd.concat([df1, df], axis=1)
df1.head()
Out[35]:
pid | visit | stim | pid-visit-stim | tnf | ifn | il2 | |
---|---|---|---|---|---|---|---|
0 | 1 | 1 | cmv | 1-1-cmv | 1.0 | 11.0 | 0.0 |
1 | 1 | 1 | hiv | 1-1-hiv | 2.0 | 12.0 | 0.0 |
2 | 1 | 2 | cmv | 1-2-cmv | 1.1 | 11.1 | 0.1 |
3 | 1 | 2 | hiv | 1-2-hiv | 2.1 | 12.1 | 0.1 |
4 | 1 | 3 | cmv | 1-3-cmv | 1.2 | 11.2 | 0.2 |
In [36]:
df1.drop('pid-visit-stim', axis=1, inplace=True)
df1.head()
Out[36]:
pid | visit | stim | tnf | ifn | il2 | |
---|---|---|---|---|---|---|
0 | 1 | 1 | cmv | 1.0 | 11.0 | 0.0 |
1 | 1 | 1 | hiv | 2.0 | 12.0 | 0.0 |
2 | 1 | 2 | cmv | 1.1 | 11.1 | 0.1 |
3 | 1 | 2 | hiv | 2.1 | 12.1 | 0.1 |
4 | 1 | 3 | cmv | 1.2 | 11.2 | 0.2 |
Wrap into a convenient function¶
In [37]:
def separate(df, column, sep):
df1 = pd.DataFrame(df[column].str.split(sep).tolist(), columns = column.split(sep))
df1 = pd.concat([df1, df], axis=1)
return df1.drop(column, axis = 1)
In [38]:
separate(df, 'pid-visit-stim', '-').head()
Out[38]:
pid | visit | stim | tnf | ifn | il2 | |
---|---|---|---|---|---|---|
0 | 1 | 1 | cmv | 1.0 | 11.0 | 0.0 |
1 | 1 | 1 | hiv | 2.0 | 12.0 | 0.0 |
2 | 1 | 2 | cmv | 1.1 | 11.1 | 0.1 |
3 | 1 | 2 | hiv | 2.1 | 12.1 | 0.1 |
4 | 1 | 3 | cmv | 1.2 | 11.2 | 0.2 |
Exercises¶
You are given three DataFrames d1
, d2
, and d3
. Answer the
questions below.
In [39]:
d1 = pd.DataFrame({'subject': ['a', 'b', 'c', 'd'], 'group': ['A', 'B', 'C', 'D'], 'value': [1, 2, 3, 4]})
d2 = pd.DataFrame({'subject': ['w', 'x', 'y', 'z'], 'group': ['A', 'B', 'C', 'D'], 'value': [2, 1, 4, 3]})
d3 = pd.DataFrame({'subject': ['a', 'x', 'w', 'd', 'b', 'c', 'y', 'z'], 'age': [10, 20, 30, 40, 50, 60, 70, 80]})
In [40]:
d1
Out[40]:
group | subject | value | |
---|---|---|---|
0 | A | a | 1 |
1 | B | b | 2 |
2 | C | c | 3 |
3 | D | d | 4 |
In [41]:
d2
Out[41]:
group | subject | value | |
---|---|---|---|
0 | A | w | 2 |
1 | B | x | 1 |
2 | C | y | 4 |
3 | D | z | 3 |
In [42]:
d3
Out[42]:
age | subject | |
---|---|---|
0 | 10 | a |
1 | 20 | x |
2 | 30 | w |
3 | 40 | d |
4 | 50 | b |
5 | 60 | c |
6 | 70 | y |
7 | 80 | z |
1. Combine d1
and d2
to create a DataFrame called df1
with 8 rows and 3 columns.
2. Combine df1
with d3
to create a DataFrame called df2
with 8 rows and 4 columns.
3.. Convert the given DataFrame d
from wide to tall format and
save in df
.
4. Name the column in df
with group values group
and remove
the group_
part from the group name values in each cell, leaving
only the number.