Data Manipulation

Note: This is an edited version of Cliburn Chan’s original tutorial. All changes remain licensed as the original, under the terms of the MIT license.

In [1]:
import numpy as np
import pandas as pd

(1) Combining data sets

Combining rows

In [2]:
df_versiocolor = pd.read_csv('data/versicolor.csv')
df_virginica = pd.read_csv('data/virginica.csv')
df_sertosa = pd.read_csv('data/setosa.csv')
dfs = [df_versiocolor, df_virginica, df_sertosa]
In [3]:
df_versiocolor.head(3)
Out[3]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 7.0 3.2 4.7 1.4 versicolor
1 6.4 3.2 4.5 1.5 versicolor
2 6.9 3.1 4.9 1.5 versicolor
In [4]:
df_virginica.head(3)
Out[4]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 6.3 3.3 6.0 2.5 virginica
1 5.8 2.7 5.1 1.9 virginica
2 7.1 3.0 5.9 2.1 virginica
In [5]:
df_sertosa.head(3)
Out[5]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
In [6]:
for df in dfs:
    print(df.shape)
(50, 5)
(50, 5)
(50, 5)

Each DataFrame only contains data about one species of iris

In [7]:
for df in dfs:
    print(df.Species.unique())
['versicolor']
['virginica']
['setosa']

Combine with concat

We can append rows of one data frame to another using concat. Here axis = 0 means combining by rows, in contrast to axis = 1 which means combining by columns.

In [8]:
df = pd.concat(dfs, axis=0)
df.shape
Out[8]:
(150, 5)

Note that the indices are maintained

Here we have to use iloc because ix and loc work with the integer index values, not the positions, and there are no values equal to 50 or 51.

In [9]:
df.iloc[48:52]
Out[9]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
48 5.1 2.5 3.0 1.1 versicolor
49 5.7 2.8 4.1 1.3 versicolor
0 6.3 3.3 6.0 2.5 virginica
1 5.8 2.7 5.1 1.9 virginica

We can ask for a new consecutive index

In [10]:
df = df.reset_index(drop=True)
In [11]:
df.iloc[48:52]
Out[11]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
48 5.1 2.5 3.0 1.1 versicolor
49 5.7 2.8 4.1 1.3 versicolor
50 6.3 3.3 6.0 2.5 virginica
51 5.8 2.7 5.1 1.9 virginica

Combined DataFrame contains all 3 species

In [12]:
df.Species.unique()
Out[12]:
array(['versicolor', 'virginica', 'setosa'], dtype=object)

Combining columns

When combining rows, we are usually just adding new observations with the same variables and there is little ambiguity about what we are trying to do.

When combining columns, we are usually trying to merge information based on some unique identifier, for example, to combine clinical and laboratory information for specific patients with unique PIDs. Often the ordering of the PIDs in the two data frames are not exactly lined up so “horizontal stacking” will not work. Usually we need to merge the data on a unique identifier from one or more columns.

Simplest case - rows line up perfectly across data frames

In [13]:
df_sepal = pd.read_csv('data/sepal.csv')
df_petal = pd.read_csv('data/petal.csv')
In [14]:
df_sepal.head(3)
Out[14]:
Species Sepal.Length Sepal.Width
0 setosa 5.1 3.5
1 setosa 4.9 3.0
2 setosa 4.7 3.2
In [15]:
df_petal.head(3)
Out[15]:
Species Petal.Length Petal.Width
0 setosa 1.4 0.2
1 setosa 1.4 0.2
2 setosa 1.3 0.2
In [16]:
df_sepal.shape, df_petal.shape
Out[16]:
((150, 3), (150, 3))

Using concat

In [17]:
pd.concat([df_sepal, df_petal], axis=1).head()
Out[17]:
Species Sepal.Length Sepal.Width Species Petal.Length Petal.Width
0 setosa 5.1 3.5 setosa 1.4 0.2
1 setosa 4.9 3.0 setosa 1.4 0.2
2 setosa 4.7 3.2 setosa 1.3 0.2
3 setosa 4.6 3.1 setosa 1.5 0.2
4 setosa 5.0 3.6 setosa 1.4 0.2

Joining on a single unique column

Combining values for the same subject across different variables.

Make up dummy data sets for illustration

In [18]:
np.random.seed(123456)
pid1 = np.random.choice(100, 6, replace=False)
val1 = np.random.normal(10, 1, 6)
df1 = pd.DataFrame({'pid': pid1, 'val': val1})
df1
Out[18]:
pid val
0 22 10.044553
1 11 12.455353
2 59 9.895324
3 62 11.379907
4 33 11.083937
5 41 11.451440
In [19]:
pid2 = np.random.permutation(pid1)
val2 = np.random.normal(15, 1, 6)
df2 = pd.DataFrame({'pid': pid2, 'val': val2})
df2
Out[19]:
pid val
0 41 16.024180
1 11 15.569605
2 22 15.875906
3 62 12.788628
4 59 15.974466
5 33 12.993253

Using merge to join on columns containing unique information

Note that the pid order is not the same, so we cannot simply stack horizontally.

In [20]:
pd.merge(df1, df2, on='pid', suffixes=['_visit_1', '_visit_2'])
Out[20]:
pid val_visit_1 val_visit_2
0 22 10.044553 15.875906
1 11 12.455353 15.569605
2 59 9.895324 15.974466
3 62 11.379907 12.788628
4 33 11.083937 12.993253
5 41 11.451440 16.024180
In [21]:
df1a = df1.set_index('pid')  # we'll use these later
df2a = df2.set_index('pid')

Using merge on multiple columns

Sometimes we need multiple columns to define a unique identifier.

Note: In the data frames being merged, not all the unique identifiers need to be shared in both data frames. For example, when merging laboratory and clinical data, a patient may have laboratory results but no clinical results (perhaps due to a typo) or have clinical data but no laboratory data (because results are still being processed by the lab). We illustrate different ways to resolve the merge in this scenario in the example below.

In [22]:
df1['stim'] = np.random.choice(['cmv', 'flu'], 6, replace=True)
df1 = df1[['pid', 'stim', 'val']]
df1
Out[22]:
pid stim val
0 22 cmv 10.044553
1 11 flu 12.455353
2 59 cmv 9.895324
3 62 flu 11.379907
4 33 flu 11.083937
5 41 flu 11.451440
In [23]:
df2['stim'] = np.random.choice(['cmv', 'flu'], 6, replace=True)
df2 = df2[['pid', 'stim', 'val']]
df2
Out[23]:
pid stim val
0 41 flu 16.024180
1 11 cmv 15.569605
2 22 flu 15.875906
3 62 flu 12.788628
4 59 flu 15.974466
5 33 flu 12.993253
In [24]:
pd.merge(df1, df2, on = ['pid', 'stim'], suffixes = ['_visit_1', '_visit_2'])
Out[24]:
pid stim val_visit_1 val_visit_2
0 62 flu 11.379907 12.788628
1 33 flu 11.083937 12.993253
2 41 flu 11.451440 16.024180
In [25]:
pd.merge(df1, df2, on = ['pid', 'stim'], how = 'left', suffixes = ['_visit_1', '_visit_2'])
Out[25]:
pid stim val_visit_1 val_visit_2
0 22 cmv 10.044553 NaN
1 11 flu 12.455353 NaN
2 59 cmv 9.895324 NaN
3 62 flu 11.379907 12.788628
4 33 flu 11.083937 12.993253
5 41 flu 11.451440 16.024180
In [26]:
pd.merge(df1, df2, on = ['pid', 'stim'], how = 'right', suffixes = ['_visit_1', '_visit_2'])
Out[26]:
pid stim val_visit_1 val_visit_2
0 62 flu 11.379907 12.788628
1 33 flu 11.083937 12.993253
2 41 flu 11.451440 16.024180
3 11 cmv NaN 15.569605
4 22 flu NaN 15.875906
5 59 flu NaN 15.974466
In [27]:
pd.merge(df1, df2, on = ['pid', 'stim'], how = 'outer', suffixes = ['_visit_1', '_visit_2'])
Out[27]:
pid stim val_visit_1 val_visit_2
0 22 cmv 10.044553 NaN
1 11 flu 12.455353 NaN
2 59 cmv 9.895324 NaN
3 62 flu 11.379907 12.788628
4 33 flu 11.083937 12.993253
5 41 flu 11.451440 16.024180
6 11 cmv NaN 15.569605
7 22 flu NaN 15.875906
8 59 flu NaN 15.974466

Using join when indexes are the keys

Note that merge/join have a rich and powerful (but complex!) API, and they are deeply interconnected. Much more details can be found in the official documentation.

Note in particular this section:

merge is a function in the pandas namespace, and it is also available as a DataFrame instance method, with the calling DataFrame being implicitly considered the left object in the join.

The related DataFrame.join method, uses merge internally for the index-on-index (by default) and column(s)-on-index join. If you are joining on index only, you may wish to use DataFrame.join to save yourself some typing.

In [28]:
df1a
Out[28]:
val
pid
22 10.044553
11 12.455353
59 9.895324
62 11.379907
33 11.083937
41 11.451440
In [29]:
df2a
Out[29]:
val
pid
41 16.024180
11 15.569605
22 15.875906
62 12.788628
59 15.974466
33 12.993253
In [30]:
df1a.join(df2a, lsuffix='_1', rsuffix='_2')
Out[30]:
val_1 val_2
pid
22 10.044553 15.875906
11 12.455353 15.569605
59 9.895324 15.974466
62 11.379907 12.788628
33 11.083937 12.993253
41 11.451440 16.024180

(2) Separate multiple values in a single column

Remember that for a tidy data frame, each column must contain a single variable.

In [31]:
from collections import OrderedDict
In [32]:
d = OrderedDict()
d['pid-visit-stim'] = ['1-1-cmv', '1-1-hiv', '1-2-cmv', '1-2-hiv', '1-3-cmv', '1-3-hiv', '2-1-cmv', '2-1-hiv', '2-2-cmv', '2-2-hiv']
d['tnf'] = [1.0, 2.0, 1.1, 2.1, 1.2, 2.2, 3, 4, 3.1, 4.1]
d['ifn'] = [11.0, 12.0, 11.1, 12.1, 11.2, 12.2, 13, 14, 13.1, 14.1]
d['il2'] = [0.0, 0.0, 0.1, 0.1, 0.2, 0.2, 0.1, 0.3, 0.1, 0.1]
df = pd.DataFrame(d)
In [33]:
df.head()
Out[33]:
pid-visit-stim tnf ifn il2
0 1-1-cmv 1.0 11.0 0.0
1 1-1-hiv 2.0 12.0 0.0
2 1-2-cmv 1.1 11.1 0.1
3 1-2-hiv 2.1 12.1 0.1
4 1-3-cmv 1.2 11.2 0.2
In [34]:
df1 = pd.DataFrame(df['pid-visit-stim'].str.split('-').tolist(),
                   columns = ['pid', 'visit', 'stim'])
df1.head()
Out[34]:
pid visit stim
0 1 1 cmv
1 1 1 hiv
2 1 2 cmv
3 1 2 hiv
4 1 3 cmv
In [35]:
df1 = pd.concat([df1, df], axis=1)
df1.head()
Out[35]:
pid visit stim pid-visit-stim tnf ifn il2
0 1 1 cmv 1-1-cmv 1.0 11.0 0.0
1 1 1 hiv 1-1-hiv 2.0 12.0 0.0
2 1 2 cmv 1-2-cmv 1.1 11.1 0.1
3 1 2 hiv 1-2-hiv 2.1 12.1 0.1
4 1 3 cmv 1-3-cmv 1.2 11.2 0.2
In [36]:
df1.drop('pid-visit-stim', axis=1, inplace=True)
df1.head()
Out[36]:
pid visit stim tnf ifn il2
0 1 1 cmv 1.0 11.0 0.0
1 1 1 hiv 2.0 12.0 0.0
2 1 2 cmv 1.1 11.1 0.1
3 1 2 hiv 2.1 12.1 0.1
4 1 3 cmv 1.2 11.2 0.2

Wrap into a convenient function

In [37]:
def separate(df, column, sep):
    df1 = pd.DataFrame(df[column].str.split(sep).tolist(), columns = column.split(sep))
    df1 = pd.concat([df1, df], axis=1)
    return df1.drop(column, axis = 1)
In [38]:
separate(df, 'pid-visit-stim', '-').head()
Out[38]:
pid visit stim tnf ifn il2
0 1 1 cmv 1.0 11.0 0.0
1 1 1 hiv 2.0 12.0 0.0
2 1 2 cmv 1.1 11.1 0.1
3 1 2 hiv 2.1 12.1 0.1
4 1 3 cmv 1.2 11.2 0.2

Exercises

You are given three DataFrames d1, d2, and d3. Answer the questions below.

In [39]:
d1 = pd.DataFrame({'subject': ['a', 'b', 'c', 'd'], 'group': ['A', 'B', 'C', 'D'], 'value': [1, 2, 3, 4]})
d2 = pd.DataFrame({'subject': ['w', 'x', 'y', 'z'], 'group': ['A', 'B', 'C', 'D'], 'value': [2, 1, 4, 3]})
d3 = pd.DataFrame({'subject': ['a', 'x', 'w', 'd', 'b', 'c', 'y', 'z'], 'age': [10, 20, 30, 40, 50, 60, 70, 80]})
In [40]:
d1
Out[40]:
group subject value
0 A a 1
1 B b 2
2 C c 3
3 D d 4
In [41]:
d2
Out[41]:
group subject value
0 A w 2
1 B x 1
2 C y 4
3 D z 3
In [42]:
d3
Out[42]:
age subject
0 10 a
1 20 x
2 30 w
3 40 d
4 50 b
5 60 c
6 70 y
7 80 z

1. Combine d1 and d2 to create a DataFrame called df1 with 8 rows and 3 columns.

2. Combine df1 with d3 to create a DataFrame called df2 with 8 rows and 4 columns.

3.. Convert the given DataFrame d from wide to tall format and save in df.

4. Name the column in df with group values group and remove the group_ part from the group name values in each cell, leaving only the number.