Data Manipulation¶

Note: This is an edited version of Cliburn Chan’s original tutorial. All changes remain licensed as the original, under the terms of the MIT license.

In [1]:

import numpy as np
import pandas as pd

(1) Combining data sets¶

Combining rows¶

In [2]:

df_versiocolor = pd.read_csv('data/versicolor.csv')
df_virginica = pd.read_csv('data/virginica.csv')
df_sertosa = pd.read_csv('data/setosa.csv')
dfs = [df_versiocolor, df_virginica, df_sertosa]

In [3]:

df_versiocolor.head(3)

Out[3]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
0	7.0	3.2	4.7	1.4	versicolor
1	6.4	3.2	4.5	1.5	versicolor
2	6.9	3.1	4.9	1.5	versicolor

In [4]:

df_virginica.head(3)

Out[4]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
0	6.3	3.3	6.0	2.5	virginica
1	5.8	2.7	5.1	1.9	virginica
2	7.1	3.0	5.9	2.1	virginica

In [5]:

df_sertosa.head(3)

Out[5]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa

In [6]:

for df in dfs:
    print(df.shape)

(50, 5)
(50, 5)
(50, 5)

Each DataFrame only contains data about one species of iris¶

In [7]:

for df in dfs:
    print(df.Species.unique())

['versicolor']
['virginica']
['setosa']

Combine with `concat`¶

We can append rows of one data frame to another using concat. Here axis = 0 means combining by rows, in contrast to axis = 1 which means combining by columns.

In [8]:

df = pd.concat(dfs, axis=0)
df.shape

Out[8]:

(150, 5)

Note that the indices are maintained¶

Here we have to use iloc because ix and loc work with the integer index values, not the positions, and there are no values equal to 50 or 51.

In [9]:

df.iloc[48:52]

Out[9]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
48	5.1	2.5	3.0	1.1	versicolor
49	5.7	2.8	4.1	1.3	versicolor
0	6.3	3.3	6.0	2.5	virginica
1	5.8	2.7	5.1	1.9	virginica

We can ask for a new consecutive index¶

In [10]:

df = df.reset_index(drop=True)

In [11]:

df.iloc[48:52]

Out[11]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
48	5.1	2.5	3.0	1.1	versicolor
49	5.7	2.8	4.1	1.3	versicolor
50	6.3	3.3	6.0	2.5	virginica
51	5.8	2.7	5.1	1.9	virginica

Combined DataFrame contains all 3 species¶

In [12]:

df.Species.unique()

Out[12]:

array(['versicolor', 'virginica', 'setosa'], dtype=object)

Combining columns¶

When combining rows, we are usually just adding new observations with the same variables and there is little ambiguity about what we are trying to do.

When combining columns, we are usually trying to merge information based on some unique identifier, for example, to combine clinical and laboratory information for specific patients with unique PIDs. Often the ordering of the PIDs in the two data frames are not exactly lined up so “horizontal stacking” will not work. Usually we need to merge the data on a unique identifier from one or more columns.

Simplest case - rows line up perfectly across data frames¶

In [13]:

df_sepal = pd.read_csv('data/sepal.csv')
df_petal = pd.read_csv('data/petal.csv')

In [14]:

df_sepal.head(3)

Out[14]:

	Species	Sepal.Length	Sepal.Width
0	setosa	5.1	3.5
1	setosa	4.9	3.0
2	setosa	4.7	3.2

In [15]:

df_petal.head(3)

Out[15]:

	Species	Petal.Length	Petal.Width
0	setosa	1.4	0.2
1	setosa	1.4	0.2
2	setosa	1.3	0.2

In [16]:

df_sepal.shape, df_petal.shape

Out[16]:

((150, 3), (150, 3))

Using `concat`¶

In [17]:

pd.concat([df_sepal, df_petal], axis=1).head()

Out[17]:

	Species	Sepal.Length	Sepal.Width	Species	Petal.Length	Petal.Width
0	setosa	5.1	3.5	setosa	1.4	0.2
1	setosa	4.9	3.0	setosa	1.4	0.2
2	setosa	4.7	3.2	setosa	1.3	0.2
3	setosa	4.6	3.1	setosa	1.5	0.2
4	setosa	5.0	3.6	setosa	1.4	0.2

Joining on a single unique column¶

Combining values for the same subject across different variables.

Make up dummy data sets for illustration¶

In [18]:

np.random.seed(123456)
pid1 = np.random.choice(100, 6, replace=False)
val1 = np.random.normal(10, 1, 6)
df1 = pd.DataFrame({'pid': pid1, 'val': val1})
df1

Out[18]:

	pid	val
0	22	10.044553
1	11	12.455353
2	59	9.895324
3	62	11.379907
4	33	11.083937
5	41	11.451440

In [19]:

pid2 = np.random.permutation(pid1)
val2 = np.random.normal(15, 1, 6)
df2 = pd.DataFrame({'pid': pid2, 'val': val2})
df2

Out[19]:

	pid	val
0	41	16.024180
1	11	15.569605
2	22	15.875906
3	62	12.788628
4	59	15.974466
5	33	12.993253

Using merge to join on columns containing unique information¶

Note that the pid order is not the same, so we cannot simply stack horizontally.

In [20]:

pd.merge(df1, df2, on='pid', suffixes=['_visit_1', '_visit_2'])

Out[20]:

	pid	val_visit_1	val_visit_2
0	22	10.044553	15.875906
1	11	12.455353	15.569605
2	59	9.895324	15.974466
3	62	11.379907	12.788628
4	33	11.083937	12.993253
5	41	11.451440	16.024180

In [21]:

df1a = df1.set_index('pid')  # we'll use these later
df2a = df2.set_index('pid')

Using merge on multiple columns¶

Sometimes we need multiple columns to define a unique identifier.

Note: In the data frames being merged, not all the unique identifiers need to be shared in both data frames. For example, when merging laboratory and clinical data, a patient may have laboratory results but no clinical results (perhaps due to a typo) or have clinical data but no laboratory data (because results are still being processed by the lab). We illustrate different ways to resolve the merge in this scenario in the example below.

In [22]:

df1['stim'] = np.random.choice(['cmv', 'flu'], 6, replace=True)
df1 = df1[['pid', 'stim', 'val']]
df1

Out[22]:

	pid	stim	val
0	22	cmv	10.044553
1	11	flu	12.455353
2	59	cmv	9.895324
3	62	flu	11.379907
4	33	flu	11.083937
5	41	flu	11.451440

In [23]:

df2['stim'] = np.random.choice(['cmv', 'flu'], 6, replace=True)
df2 = df2[['pid', 'stim', 'val']]
df2

Out[23]:

	pid	stim	val
0	41	flu	16.024180
1	11	cmv	15.569605
2	22	flu	15.875906
3	62	flu	12.788628
4	59	flu	15.974466
5	33	flu	12.993253

In [24]:

pd.merge(df1, df2, on = ['pid', 'stim'], suffixes = ['_visit_1', '_visit_2'])

Out[24]:

	pid	stim	val_visit_1	val_visit_2
0	62	flu	11.379907	12.788628
1	33	flu	11.083937	12.993253
2	41	flu	11.451440	16.024180

In [25]:

pd.merge(df1, df2, on = ['pid', 'stim'], how = 'left', suffixes = ['_visit_1', '_visit_2'])

Out[25]:

	pid	stim	val_visit_1	val_visit_2
0	22	cmv	10.044553	NaN
1	11	flu	12.455353	NaN
2	59	cmv	9.895324	NaN
3	62	flu	11.379907	12.788628
4	33	flu	11.083937	12.993253
5	41	flu	11.451440	16.024180

In [26]:

pd.merge(df1, df2, on = ['pid', 'stim'], how = 'right', suffixes = ['_visit_1', '_visit_2'])

Out[26]:

	pid	stim	val_visit_1	val_visit_2
0	62	flu	11.379907	12.788628
1	33	flu	11.083937	12.993253
2	41	flu	11.451440	16.024180
3	11	cmv	NaN	15.569605
4	22	flu	NaN	15.875906
5	59	flu	NaN	15.974466

In [27]:

pd.merge(df1, df2, on = ['pid', 'stim'], how = 'outer', suffixes = ['_visit_1', '_visit_2'])

Out[27]:

	pid	stim	val_visit_1	val_visit_2
0	22	cmv	10.044553	NaN
1	11	flu	12.455353	NaN
2	59	cmv	9.895324	NaN
3	62	flu	11.379907	12.788628
4	33	flu	11.083937	12.993253
5	41	flu	11.451440	16.024180
6	11	cmv	NaN	15.569605
7	22	flu	NaN	15.875906
8	59	flu	NaN	15.974466

Using `join` when indexes are the keys¶

Note that merge/join have a rich and powerful (but complex!) API, and they are deeply interconnected. Much more details can be found in the official documentation.

Note in particular this section:

merge is a function in the pandas namespace, and it is also available as a DataFrame instance method, with the calling DataFrame being implicitly considered the left object in the join.

The related DataFrame.join method, uses merge internally for the index-on-index (by default) and column(s)-on-index join. If you are joining on index only, you may wish to use DataFrame.join to save yourself some typing.

In [28]:

df1a

Out[28]:

	val
pid
22	10.044553
11	12.455353
59	9.895324
62	11.379907
33	11.083937
41	11.451440

In [29]:

df2a

Out[29]:

	val
pid
41	16.024180
11	15.569605
22	15.875906
62	12.788628
59	15.974466
33	12.993253

In [30]:

df1a.join(df2a, lsuffix='_1', rsuffix='_2')

Out[30]:

	val_1	val_2
pid
22	10.044553	15.875906
11	12.455353	15.569605
59	9.895324	15.974466
62	11.379907	12.788628
33	11.083937	12.993253
41	11.451440	16.024180

(2) Separate multiple values in a single column¶

Remember that for a tidy data frame, each column must contain a single variable.

In [31]:

from collections import OrderedDict

In [32]:

d = OrderedDict()
d['pid-visit-stim'] = ['1-1-cmv', '1-1-hiv', '1-2-cmv', '1-2-hiv', '1-3-cmv', '1-3-hiv', '2-1-cmv', '2-1-hiv', '2-2-cmv', '2-2-hiv']
d['tnf'] = [1.0, 2.0, 1.1, 2.1, 1.2, 2.2, 3, 4, 3.1, 4.1]
d['ifn'] = [11.0, 12.0, 11.1, 12.1, 11.2, 12.2, 13, 14, 13.1, 14.1]
d['il2'] = [0.0, 0.0, 0.1, 0.1, 0.2, 0.2, 0.1, 0.3, 0.1, 0.1]
df = pd.DataFrame(d)

In [33]:

df.head()

Out[33]:

	pid-visit-stim	tnf	ifn	il2
0	1-1-cmv	1.0	11.0	0.0
1	1-1-hiv	2.0	12.0	0.0
2	1-2-cmv	1.1	11.1	0.1
3	1-2-hiv	2.1	12.1	0.1
4	1-3-cmv	1.2	11.2	0.2

In [34]:

df1 = pd.DataFrame(df['pid-visit-stim'].str.split('-').tolist(),
                   columns = ['pid', 'visit', 'stim'])
df1.head()

Out[34]:

	pid	visit	stim
0	1	1	cmv
1	1	1	hiv
2	1	2	cmv
3	1	2	hiv
4	1	3	cmv

In [35]:

df1 = pd.concat([df1, df], axis=1)
df1.head()

Out[35]:

	pid	visit	stim	pid-visit-stim	tnf	ifn	il2
0	1	1	cmv	1-1-cmv	1.0	11.0	0.0
1	1	1	hiv	1-1-hiv	2.0	12.0	0.0
2	1	2	cmv	1-2-cmv	1.1	11.1	0.1
3	1	2	hiv	1-2-hiv	2.1	12.1	0.1
4	1	3	cmv	1-3-cmv	1.2	11.2	0.2

In [36]:

df1.drop('pid-visit-stim', axis=1, inplace=True)
df1.head()

Out[36]:

	pid	visit	stim	tnf	ifn	il2
0	1	1	cmv	1.0	11.0	0.0
1	1	1	hiv	2.0	12.0	0.0
2	1	2	cmv	1.1	11.1	0.1
3	1	2	hiv	2.1	12.1	0.1
4	1	3	cmv	1.2	11.2	0.2

Wrap into a convenient function¶

In [37]:

def separate(df, column, sep):
    df1 = pd.DataFrame(df[column].str.split(sep).tolist(), columns = column.split(sep))
    df1 = pd.concat([df1, df], axis=1)
    return df1.drop(column, axis = 1)

In [38]:

separate(df, 'pid-visit-stim', '-').head()

Out[38]:

	pid	visit	stim	tnf	ifn	il2
0	1	1	cmv	1.0	11.0	0.0
1	1	1	hiv	2.0	12.0	0.0
2	1	2	cmv	1.1	11.1	0.1
3	1	2	hiv	2.1	12.1	0.1
4	1	3	cmv	1.2	11.2	0.2

Exercises¶

You are given three DataFrames d1, d2, and d3. Answer the questions below.

In [39]:

d1 = pd.DataFrame({'subject': ['a', 'b', 'c', 'd'], 'group': ['A', 'B', 'C', 'D'], 'value': [1, 2, 3, 4]})
d2 = pd.DataFrame({'subject': ['w', 'x', 'y', 'z'], 'group': ['A', 'B', 'C', 'D'], 'value': [2, 1, 4, 3]})
d3 = pd.DataFrame({'subject': ['a', 'x', 'w', 'd', 'b', 'c', 'y', 'z'], 'age': [10, 20, 30, 40, 50, 60, 70, 80]})

In [40]:

d1

Out[40]:

	group	subject	value
0	A	a	1
1	B	b	2
2	C	c	3
3	D	d	4

In [41]:

d2

Out[41]:

	group	subject	value
0	A	w	2
1	B	x	1
2	C	y	4
3	D	z	3

In [42]:

d3

Out[42]:

	age	subject
0	10	a
1	20	x
2	30	w
3	40	d
4	50	b
5	60	c
6	70	y
7	80	z

1. Combine d1 and d2 to create a DataFrame called df1 with 8 rows and 3 columns.

2. Combine df1 with d3 to create a DataFrame called df2 with 8 rows and 4 columns.

3.. Convert the given DataFrame d from wide to tall format and save in df.

4. Name the column in df with group values group and remove the group_ part from the group name values in each cell, leaving only the number.

Data Manipulation¶

(1) Combining data sets¶

Combining rows¶

Each DataFrame only contains data about one species of iris¶

Combine with `concat`¶

Note that the indices are maintained¶

We can ask for a new consecutive index¶

Combined DataFrame contains all 3 species¶

Combining columns¶

Simplest case - rows line up perfectly across data frames¶

Using `concat`¶

Joining on a single unique column¶

Make up dummy data sets for illustration¶

Using merge to join on columns containing unique information¶

Using merge on multiple columns¶

Using `join` when indexes are the keys¶

(2) Separate multiple values in a single column¶

Wrap into a convenient function¶

Exercises¶

Stat 159/259 - Reproducible and Collaborative Data Science

Navigation

Related Topics

Data Manipulation¶

(1) Combining data sets¶

Combining rows¶

Each DataFrame only contains data about one species of iris¶

Combine with concat¶

Note that the indices are maintained¶

We can ask for a new consecutive index¶

Combined DataFrame contains all 3 species¶

Combining columns¶

Simplest case - rows line up perfectly across data frames¶

Using concat¶

Joining on a single unique column¶

Make up dummy data sets for illustration¶

Using merge to join on columns containing unique information¶

Using merge on multiple columns¶

Using join when indexes are the keys¶

(2) Separate multiple values in a single column¶

Wrap into a convenient function¶

Exercises¶

Combine with `concat`¶

Using `concat`¶

Using `join` when indexes are the keys¶