Data Manipulation¶
Note: This is an edited version of Cliburn Chan’s original tutorial. All changes remain licensed as the original, under the terms of the MIT license.
In [1]:
import numpy as np
import pandas as pd
(1) Combining data sets¶
Combining rows¶
In [2]:
df_versiocolor = pd.read_csv('data/versicolor.csv')
df_virginica = pd.read_csv('data/virginica.csv')
df_sertosa = pd.read_csv('data/setosa.csv')
dfs = [df_versiocolor, df_virginica, df_sertosa]
In [3]:
df_versiocolor.head(3)
Out[3]:
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
|---|---|---|---|---|---|
| 0 | 7.0 | 3.2 | 4.7 | 1.4 | versicolor | 
| 1 | 6.4 | 3.2 | 4.5 | 1.5 | versicolor | 
| 2 | 6.9 | 3.1 | 4.9 | 1.5 | versicolor | 
In [4]:
df_virginica.head(3)
Out[4]:
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
|---|---|---|---|---|---|
| 0 | 6.3 | 3.3 | 6.0 | 2.5 | virginica | 
| 1 | 5.8 | 2.7 | 5.1 | 1.9 | virginica | 
| 2 | 7.1 | 3.0 | 5.9 | 2.1 | virginica | 
In [5]:
df_sertosa.head(3)
Out[5]:
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | 
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 
In [6]:
for df in dfs:
    print(df.shape)
(50, 5)
(50, 5)
(50, 5)
Each DataFrame only contains data about one species of iris¶
In [7]:
for df in dfs:
    print(df.Species.unique())
['versicolor']
['virginica']
['setosa']
Combine with concat¶
We can append rows of one data frame to another using concat. Here
axis = 0 means combining by rows, in contrast to axis = 1 which
means combining by columns.
In [8]:
df = pd.concat(dfs, axis=0)
df.shape
Out[8]:
(150, 5)
Note that the indices are maintained¶
Here we have to use iloc because ix and loc work with the
integer index values, not the positions, and there are no values equal
to 50 or 51.
In [9]:
df.iloc[48:52]
Out[9]:
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
|---|---|---|---|---|---|
| 48 | 5.1 | 2.5 | 3.0 | 1.1 | versicolor | 
| 49 | 5.7 | 2.8 | 4.1 | 1.3 | versicolor | 
| 0 | 6.3 | 3.3 | 6.0 | 2.5 | virginica | 
| 1 | 5.8 | 2.7 | 5.1 | 1.9 | virginica | 
We can ask for a new consecutive index¶
In [10]:
df = df.reset_index(drop=True)
In [11]:
df.iloc[48:52]
Out[11]:
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
|---|---|---|---|---|---|
| 48 | 5.1 | 2.5 | 3.0 | 1.1 | versicolor | 
| 49 | 5.7 | 2.8 | 4.1 | 1.3 | versicolor | 
| 50 | 6.3 | 3.3 | 6.0 | 2.5 | virginica | 
| 51 | 5.8 | 2.7 | 5.1 | 1.9 | virginica | 
Combined DataFrame contains all 3 species¶
In [12]:
df.Species.unique()
Out[12]:
array(['versicolor', 'virginica', 'setosa'], dtype=object)
Combining columns¶
When combining rows, we are usually just adding new observations with the same variables and there is little ambiguity about what we are trying to do.
When combining columns, we are usually trying to merge information based
on some unique identifier, for example, to combine clinical and
laboratory information for specific patients with unique PIDs. Often the
ordering of the PIDs in the two data frames are not exactly lined up so
“horizontal stacking” will not work. Usually we need to merge the
data on a unique identifier from one or more columns.
Simplest case - rows line up perfectly across data frames¶
In [13]:
df_sepal = pd.read_csv('data/sepal.csv')
df_petal = pd.read_csv('data/petal.csv')
In [14]:
df_sepal.head(3)
Out[14]:
| Species | Sepal.Length | Sepal.Width | |
|---|---|---|---|
| 0 | setosa | 5.1 | 3.5 | 
| 1 | setosa | 4.9 | 3.0 | 
| 2 | setosa | 4.7 | 3.2 | 
In [15]:
df_petal.head(3)
Out[15]:
| Species | Petal.Length | Petal.Width | |
|---|---|---|---|
| 0 | setosa | 1.4 | 0.2 | 
| 1 | setosa | 1.4 | 0.2 | 
| 2 | setosa | 1.3 | 0.2 | 
In [16]:
df_sepal.shape, df_petal.shape
Out[16]:
((150, 3), (150, 3))
Using concat¶
In [17]:
pd.concat([df_sepal, df_petal], axis=1).head()
Out[17]:
| Species | Sepal.Length | Sepal.Width | Species | Petal.Length | Petal.Width | |
|---|---|---|---|---|---|---|
| 0 | setosa | 5.1 | 3.5 | setosa | 1.4 | 0.2 | 
| 1 | setosa | 4.9 | 3.0 | setosa | 1.4 | 0.2 | 
| 2 | setosa | 4.7 | 3.2 | setosa | 1.3 | 0.2 | 
| 3 | setosa | 4.6 | 3.1 | setosa | 1.5 | 0.2 | 
| 4 | setosa | 5.0 | 3.6 | setosa | 1.4 | 0.2 | 
Joining on a single unique column¶
Combining values for the same subject across different variables.
Make up dummy data sets for illustration¶
In [18]:
np.random.seed(123456)
pid1 = np.random.choice(100, 6, replace=False)
val1 = np.random.normal(10, 1, 6)
df1 = pd.DataFrame({'pid': pid1, 'val': val1})
df1
Out[18]:
| pid | val | |
|---|---|---|
| 0 | 22 | 10.044553 | 
| 1 | 11 | 12.455353 | 
| 2 | 59 | 9.895324 | 
| 3 | 62 | 11.379907 | 
| 4 | 33 | 11.083937 | 
| 5 | 41 | 11.451440 | 
In [19]:
pid2 = np.random.permutation(pid1)
val2 = np.random.normal(15, 1, 6)
df2 = pd.DataFrame({'pid': pid2, 'val': val2})
df2
Out[19]:
| pid | val | |
|---|---|---|
| 0 | 41 | 16.024180 | 
| 1 | 11 | 15.569605 | 
| 2 | 22 | 15.875906 | 
| 3 | 62 | 12.788628 | 
| 4 | 59 | 15.974466 | 
| 5 | 33 | 12.993253 | 
Using merge to join on columns containing unique information¶
Note that the pid order is not the same, so we cannot simply stack
horizontally.
In [20]:
pd.merge(df1, df2, on='pid', suffixes=['_visit_1', '_visit_2'])
Out[20]:
| pid | val_visit_1 | val_visit_2 | |
|---|---|---|---|
| 0 | 22 | 10.044553 | 15.875906 | 
| 1 | 11 | 12.455353 | 15.569605 | 
| 2 | 59 | 9.895324 | 15.974466 | 
| 3 | 62 | 11.379907 | 12.788628 | 
| 4 | 33 | 11.083937 | 12.993253 | 
| 5 | 41 | 11.451440 | 16.024180 | 
In [21]:
df1a = df1.set_index('pid')  # we'll use these later
df2a = df2.set_index('pid')
Using merge on multiple columns¶
Sometimes we need multiple columns to define a unique identifier.
Note: In the data frames being merged, not all the unique identifiers need to be shared in both data frames. For example, when merging laboratory and clinical data, a patient may have laboratory results but no clinical results (perhaps due to a typo) or have clinical data but no laboratory data (because results are still being processed by the lab). We illustrate different ways to resolve the merge in this scenario in the example below.
In [22]:
df1['stim'] = np.random.choice(['cmv', 'flu'], 6, replace=True)
df1 = df1[['pid', 'stim', 'val']]
df1
Out[22]:
| pid | stim | val | |
|---|---|---|---|
| 0 | 22 | cmv | 10.044553 | 
| 1 | 11 | flu | 12.455353 | 
| 2 | 59 | cmv | 9.895324 | 
| 3 | 62 | flu | 11.379907 | 
| 4 | 33 | flu | 11.083937 | 
| 5 | 41 | flu | 11.451440 | 
In [23]:
df2['stim'] = np.random.choice(['cmv', 'flu'], 6, replace=True)
df2 = df2[['pid', 'stim', 'val']]
df2
Out[23]:
| pid | stim | val | |
|---|---|---|---|
| 0 | 41 | flu | 16.024180 | 
| 1 | 11 | cmv | 15.569605 | 
| 2 | 22 | flu | 15.875906 | 
| 3 | 62 | flu | 12.788628 | 
| 4 | 59 | flu | 15.974466 | 
| 5 | 33 | flu | 12.993253 | 
In [24]:
pd.merge(df1, df2, on = ['pid', 'stim'], suffixes = ['_visit_1', '_visit_2'])
Out[24]:
| pid | stim | val_visit_1 | val_visit_2 | |
|---|---|---|---|---|
| 0 | 62 | flu | 11.379907 | 12.788628 | 
| 1 | 33 | flu | 11.083937 | 12.993253 | 
| 2 | 41 | flu | 11.451440 | 16.024180 | 
In [25]:
pd.merge(df1, df2, on = ['pid', 'stim'], how = 'left', suffixes = ['_visit_1', '_visit_2'])
Out[25]:
| pid | stim | val_visit_1 | val_visit_2 | |
|---|---|---|---|---|
| 0 | 22 | cmv | 10.044553 | NaN | 
| 1 | 11 | flu | 12.455353 | NaN | 
| 2 | 59 | cmv | 9.895324 | NaN | 
| 3 | 62 | flu | 11.379907 | 12.788628 | 
| 4 | 33 | flu | 11.083937 | 12.993253 | 
| 5 | 41 | flu | 11.451440 | 16.024180 | 
In [26]:
pd.merge(df1, df2, on = ['pid', 'stim'], how = 'right', suffixes = ['_visit_1', '_visit_2'])
Out[26]:
| pid | stim | val_visit_1 | val_visit_2 | |
|---|---|---|---|---|
| 0 | 62 | flu | 11.379907 | 12.788628 | 
| 1 | 33 | flu | 11.083937 | 12.993253 | 
| 2 | 41 | flu | 11.451440 | 16.024180 | 
| 3 | 11 | cmv | NaN | 15.569605 | 
| 4 | 22 | flu | NaN | 15.875906 | 
| 5 | 59 | flu | NaN | 15.974466 | 
In [27]:
pd.merge(df1, df2, on = ['pid', 'stim'], how = 'outer', suffixes = ['_visit_1', '_visit_2'])
Out[27]:
| pid | stim | val_visit_1 | val_visit_2 | |
|---|---|---|---|---|
| 0 | 22 | cmv | 10.044553 | NaN | 
| 1 | 11 | flu | 12.455353 | NaN | 
| 2 | 59 | cmv | 9.895324 | NaN | 
| 3 | 62 | flu | 11.379907 | 12.788628 | 
| 4 | 33 | flu | 11.083937 | 12.993253 | 
| 5 | 41 | flu | 11.451440 | 16.024180 | 
| 6 | 11 | cmv | NaN | 15.569605 | 
| 7 | 22 | flu | NaN | 15.875906 | 
| 8 | 59 | flu | NaN | 15.974466 | 
Using join when indexes are the keys¶
Note that merge/join have a rich and powerful (but complex!) API, and they are deeply interconnected. Much more details can be found in the official documentation.
Note in particular this section:
mergeis a function in the pandas namespace, and it is also available as a DataFrame instance method, with the calling DataFrame being implicitly considered the left object in the join.The related
DataFrame.joinmethod, usesmergeinternally for the index-on-index (by default) and column(s)-on-index join. If you are joining on index only, you may wish to useDataFrame.jointo save yourself some typing.
In [28]:
df1a
Out[28]:
| val | |
|---|---|
| pid | |
| 22 | 10.044553 | 
| 11 | 12.455353 | 
| 59 | 9.895324 | 
| 62 | 11.379907 | 
| 33 | 11.083937 | 
| 41 | 11.451440 | 
In [29]:
df2a
Out[29]:
| val | |
|---|---|
| pid | |
| 41 | 16.024180 | 
| 11 | 15.569605 | 
| 22 | 15.875906 | 
| 62 | 12.788628 | 
| 59 | 15.974466 | 
| 33 | 12.993253 | 
In [30]:
df1a.join(df2a, lsuffix='_1', rsuffix='_2')
Out[30]:
| val_1 | val_2 | |
|---|---|---|
| pid | ||
| 22 | 10.044553 | 15.875906 | 
| 11 | 12.455353 | 15.569605 | 
| 59 | 9.895324 | 15.974466 | 
| 62 | 11.379907 | 12.788628 | 
| 33 | 11.083937 | 12.993253 | 
| 41 | 11.451440 | 16.024180 | 
(2) Separate multiple values in a single column¶
Remember that for a tidy data frame, each column must contain a single variable.
In [31]:
from collections import OrderedDict
In [32]:
d = OrderedDict()
d['pid-visit-stim'] = ['1-1-cmv', '1-1-hiv', '1-2-cmv', '1-2-hiv', '1-3-cmv', '1-3-hiv', '2-1-cmv', '2-1-hiv', '2-2-cmv', '2-2-hiv']
d['tnf'] = [1.0, 2.0, 1.1, 2.1, 1.2, 2.2, 3, 4, 3.1, 4.1]
d['ifn'] = [11.0, 12.0, 11.1, 12.1, 11.2, 12.2, 13, 14, 13.1, 14.1]
d['il2'] = [0.0, 0.0, 0.1, 0.1, 0.2, 0.2, 0.1, 0.3, 0.1, 0.1]
df = pd.DataFrame(d)
In [33]:
df.head()
Out[33]:
| pid-visit-stim | tnf | ifn | il2 | |
|---|---|---|---|---|
| 0 | 1-1-cmv | 1.0 | 11.0 | 0.0 | 
| 1 | 1-1-hiv | 2.0 | 12.0 | 0.0 | 
| 2 | 1-2-cmv | 1.1 | 11.1 | 0.1 | 
| 3 | 1-2-hiv | 2.1 | 12.1 | 0.1 | 
| 4 | 1-3-cmv | 1.2 | 11.2 | 0.2 | 
In [34]:
df1 = pd.DataFrame(df['pid-visit-stim'].str.split('-').tolist(),
                   columns = ['pid', 'visit', 'stim'])
df1.head()
Out[34]:
| pid | visit | stim | |
|---|---|---|---|
| 0 | 1 | 1 | cmv | 
| 1 | 1 | 1 | hiv | 
| 2 | 1 | 2 | cmv | 
| 3 | 1 | 2 | hiv | 
| 4 | 1 | 3 | cmv | 
In [35]:
df1 = pd.concat([df1, df], axis=1)
df1.head()
Out[35]:
| pid | visit | stim | pid-visit-stim | tnf | ifn | il2 | |
|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | cmv | 1-1-cmv | 1.0 | 11.0 | 0.0 | 
| 1 | 1 | 1 | hiv | 1-1-hiv | 2.0 | 12.0 | 0.0 | 
| 2 | 1 | 2 | cmv | 1-2-cmv | 1.1 | 11.1 | 0.1 | 
| 3 | 1 | 2 | hiv | 1-2-hiv | 2.1 | 12.1 | 0.1 | 
| 4 | 1 | 3 | cmv | 1-3-cmv | 1.2 | 11.2 | 0.2 | 
In [36]:
df1.drop('pid-visit-stim', axis=1, inplace=True)
df1.head()
Out[36]:
| pid | visit | stim | tnf | ifn | il2 | |
|---|---|---|---|---|---|---|
| 0 | 1 | 1 | cmv | 1.0 | 11.0 | 0.0 | 
| 1 | 1 | 1 | hiv | 2.0 | 12.0 | 0.0 | 
| 2 | 1 | 2 | cmv | 1.1 | 11.1 | 0.1 | 
| 3 | 1 | 2 | hiv | 2.1 | 12.1 | 0.1 | 
| 4 | 1 | 3 | cmv | 1.2 | 11.2 | 0.2 | 
Wrap into a convenient function¶
In [37]:
def separate(df, column, sep):
    df1 = pd.DataFrame(df[column].str.split(sep).tolist(), columns = column.split(sep))
    df1 = pd.concat([df1, df], axis=1)
    return df1.drop(column, axis = 1)
In [38]:
separate(df, 'pid-visit-stim', '-').head()
Out[38]:
| pid | visit | stim | tnf | ifn | il2 | |
|---|---|---|---|---|---|---|
| 0 | 1 | 1 | cmv | 1.0 | 11.0 | 0.0 | 
| 1 | 1 | 1 | hiv | 2.0 | 12.0 | 0.0 | 
| 2 | 1 | 2 | cmv | 1.1 | 11.1 | 0.1 | 
| 3 | 1 | 2 | hiv | 2.1 | 12.1 | 0.1 | 
| 4 | 1 | 3 | cmv | 1.2 | 11.2 | 0.2 | 
Exercises¶
You are given three DataFrames d1, d2, and d3. Answer the
questions below.
In [39]:
d1 = pd.DataFrame({'subject': ['a', 'b', 'c', 'd'], 'group': ['A', 'B', 'C', 'D'], 'value': [1, 2, 3, 4]})
d2 = pd.DataFrame({'subject': ['w', 'x', 'y', 'z'], 'group': ['A', 'B', 'C', 'D'], 'value': [2, 1, 4, 3]})
d3 = pd.DataFrame({'subject': ['a', 'x', 'w', 'd', 'b', 'c', 'y', 'z'], 'age': [10, 20, 30, 40, 50, 60, 70, 80]})
In [40]:
d1
Out[40]:
| group | subject | value | |
|---|---|---|---|
| 0 | A | a | 1 | 
| 1 | B | b | 2 | 
| 2 | C | c | 3 | 
| 3 | D | d | 4 | 
In [41]:
d2
Out[41]:
| group | subject | value | |
|---|---|---|---|
| 0 | A | w | 2 | 
| 1 | B | x | 1 | 
| 2 | C | y | 4 | 
| 3 | D | z | 3 | 
In [42]:
d3
Out[42]:
| age | subject | |
|---|---|---|
| 0 | 10 | a | 
| 1 | 20 | x | 
| 2 | 30 | w | 
| 3 | 40 | d | 
| 4 | 50 | b | 
| 5 | 60 | c | 
| 6 | 70 | y | 
| 7 | 80 | z | 
1. Combine d1 and d2 to create a DataFrame called df1
with 8 rows and 3 columns.
2. Combine df1 with d3 to create a DataFrame called df2
with 8 rows and 4 columns.
3.. Convert the given DataFrame d from wide to tall format and
save in df.
4. Name the column in df with group values group and remove
the group_ part from the group name values in each cell, leaving
only the number.