{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Manipulation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note:** This is an edited version of [Cliburn Chan's](http://people.duke.edu/~ccc14/cfar-data-2016/ManipulatingDataFrames_Solutions.html) original tutorial. All changes remain licensed as the original, under the terms of the MIT license." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## (1) Combining data sets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Combining rows" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "df_versiocolor = pd.read_csv('data/versicolor.csv')\n", "df_virginica = pd.read_csv('data/virginica.csv')\n", "df_sertosa = pd.read_csv('data/setosa.csv')\n", "dfs = [df_versiocolor, df_virginica, df_sertosa]" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
07.03.24.71.4versicolor
16.43.24.51.5versicolor
26.93.14.91.5versicolor
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n", "0 7.0 3.2 4.7 1.4 versicolor\n", "1 6.4 3.2 4.5 1.5 versicolor\n", "2 6.9 3.1 4.9 1.5 versicolor" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_versiocolor.head(3)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
06.33.36.02.5virginica
15.82.75.11.9virginica
27.13.05.92.1virginica
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n", "0 6.3 3.3 6.0 2.5 virginica\n", "1 5.8 2.7 5.1 1.9 virginica\n", "2 7.1 3.0 5.9 2.1 virginica" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_virginica.head(3)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
05.13.51.40.2setosa
14.93.01.40.2setosa
24.73.21.30.2setosa
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n", "0 5.1 3.5 1.4 0.2 setosa\n", "1 4.9 3.0 1.4 0.2 setosa\n", "2 4.7 3.2 1.3 0.2 setosa" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_sertosa.head(3)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(50, 5)\n", "(50, 5)\n", "(50, 5)\n" ] } ], "source": [ "for df in dfs:\n", " print(df.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Each DataFrame only contains data about one species of iris" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['versicolor']\n", "['virginica']\n", "['setosa']\n" ] } ], "source": [ "for df in dfs:\n", " print(df.Species.unique())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Combine with `concat`\n", "\n", "We can append rows of one data frame to another using `concat`. Here `axis = 0` means combining by rows, in contrast to `axis = 1` which means combining by columns." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(150, 5)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.concat(dfs, axis=0)\n", "df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Note that the indices are maintained\n", "\n", "Here we have to use `iloc` because `ix` and `loc` work with the integer index values, not the positions, and there are no values equal to 50 or 51." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
485.12.53.01.1versicolor
495.72.84.11.3versicolor
06.33.36.02.5virginica
15.82.75.11.9virginica
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n", "48 5.1 2.5 3.0 1.1 versicolor\n", "49 5.7 2.8 4.1 1.3 versicolor\n", "0 6.3 3.3 6.0 2.5 virginica\n", "1 5.8 2.7 5.1 1.9 virginica" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[48:52]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### We can ask for a new consecutive index" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "df = df.reset_index(drop=True)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
485.12.53.01.1versicolor
495.72.84.11.3versicolor
506.33.36.02.5virginica
515.82.75.11.9virginica
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n", "48 5.1 2.5 3.0 1.1 versicolor\n", "49 5.7 2.8 4.1 1.3 versicolor\n", "50 6.3 3.3 6.0 2.5 virginica\n", "51 5.8 2.7 5.1 1.9 virginica" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[48:52]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Combined DataFrame contains all 3 species" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['versicolor', 'virginica', 'setosa'], dtype=object)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.Species.unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Combining columns\n", "\n", "When combining rows, we are usually just adding new observations with the same variables and there is little ambiguity about what we are trying to do.\n", "\n", "When combining columns, we are usually trying to merge information based on some unique identifier, for example, to combine clinical and laboratory information for specific patients with unique PIDs. Often the ordering of the PIDs in the two data frames are not exactly lined up so \"horizontal stacking\" will not work. Usually we need to `merge` the data on a unique identifier from one or more columns." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Simplest case - rows line up perfectly across data frames" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "df_sepal = pd.read_csv('data/sepal.csv')\n", "df_petal = pd.read_csv('data/petal.csv')" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SpeciesSepal.LengthSepal.Width
0setosa5.13.5
1setosa4.93.0
2setosa4.73.2
\n", "
" ], "text/plain": [ " Species Sepal.Length Sepal.Width\n", "0 setosa 5.1 3.5\n", "1 setosa 4.9 3.0\n", "2 setosa 4.7 3.2" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_sepal.head(3)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SpeciesPetal.LengthPetal.Width
0setosa1.40.2
1setosa1.40.2
2setosa1.30.2
\n", "
" ], "text/plain": [ " Species Petal.Length Petal.Width\n", "0 setosa 1.4 0.2\n", "1 setosa 1.4 0.2\n", "2 setosa 1.3 0.2" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_petal.head(3)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((150, 3), (150, 3))" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_sepal.shape, df_petal.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Using `concat`" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SpeciesSepal.LengthSepal.WidthSpeciesPetal.LengthPetal.Width
0setosa5.13.5setosa1.40.2
1setosa4.93.0setosa1.40.2
2setosa4.73.2setosa1.30.2
3setosa4.63.1setosa1.50.2
4setosa5.03.6setosa1.40.2
\n", "
" ], "text/plain": [ " Species Sepal.Length Sepal.Width Species Petal.Length Petal.Width\n", "0 setosa 5.1 3.5 setosa 1.4 0.2\n", "1 setosa 4.9 3.0 setosa 1.4 0.2\n", "2 setosa 4.7 3.2 setosa 1.3 0.2\n", "3 setosa 4.6 3.1 setosa 1.5 0.2\n", "4 setosa 5.0 3.6 setosa 1.4 0.2" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([df_sepal, df_petal], axis=1).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Joining on a single unique column\n", "\n", "Combining values for the same subject across different variables." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Make up dummy data sets for illustration" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidval
02210.044553
11112.455353
2599.895324
36211.379907
43311.083937
54111.451440
\n", "
" ], "text/plain": [ " pid val\n", "0 22 10.044553\n", "1 11 12.455353\n", "2 59 9.895324\n", "3 62 11.379907\n", "4 33 11.083937\n", "5 41 11.451440" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.random.seed(123456)\n", "pid1 = np.random.choice(100, 6, replace=False)\n", "val1 = np.random.normal(10, 1, 6)\n", "df1 = pd.DataFrame({'pid': pid1, 'val': val1})\n", "df1" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidval
04116.024180
11115.569605
22215.875906
36212.788628
45915.974466
53312.993253
\n", "
" ], "text/plain": [ " pid val\n", "0 41 16.024180\n", "1 11 15.569605\n", "2 22 15.875906\n", "3 62 12.788628\n", "4 59 15.974466\n", "5 33 12.993253" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pid2 = np.random.permutation(pid1)\n", "val2 = np.random.normal(15, 1, 6)\n", "df2 = pd.DataFrame({'pid': pid2, 'val': val2})\n", "df2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Using merge to join on columns containing unique information\n", "\n", "Note that the `pid` order is not the same, so we cannot simply stack horizontally." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidval_visit_1val_visit_2
02210.04455315.875906
11112.45535315.569605
2599.89532415.974466
36211.37990712.788628
43311.08393712.993253
54111.45144016.024180
\n", "
" ], "text/plain": [ " pid val_visit_1 val_visit_2\n", "0 22 10.044553 15.875906\n", "1 11 12.455353 15.569605\n", "2 59 9.895324 15.974466\n", "3 62 11.379907 12.788628\n", "4 33 11.083937 12.993253\n", "5 41 11.451440 16.024180" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(df1, df2, on='pid', suffixes=['_visit_1', '_visit_2'])" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "df1a = df1.set_index('pid') # we'll use these later\n", "df2a = df2.set_index('pid')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Using merge on multiple columns\n", "\n", "Sometimes we need multiple columns to define a unique identifier. \n", "\n", "**Note**: In the data frames being merged, not all the unique identifiers need to be shared in both data frames. For example, when merging laboratory and clinical data, a patient may have laboratory results but no clinical results (perhaps due to a typo) or have clinical data but no laboratory data (because results are still being processed by the lab). We illustrate different ways to resolve the merge in this scenario in the example below." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidstimval
022cmv10.044553
111flu12.455353
259cmv9.895324
362flu11.379907
433flu11.083937
541flu11.451440
\n", "
" ], "text/plain": [ " pid stim val\n", "0 22 cmv 10.044553\n", "1 11 flu 12.455353\n", "2 59 cmv 9.895324\n", "3 62 flu 11.379907\n", "4 33 flu 11.083937\n", "5 41 flu 11.451440" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1['stim'] = np.random.choice(['cmv', 'flu'], 6, replace=True)\n", "df1 = df1[['pid', 'stim', 'val']]\n", "df1" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidstimval
041flu16.024180
111cmv15.569605
222flu15.875906
362flu12.788628
459flu15.974466
533flu12.993253
\n", "
" ], "text/plain": [ " pid stim val\n", "0 41 flu 16.024180\n", "1 11 cmv 15.569605\n", "2 22 flu 15.875906\n", "3 62 flu 12.788628\n", "4 59 flu 15.974466\n", "5 33 flu 12.993253" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2['stim'] = np.random.choice(['cmv', 'flu'], 6, replace=True)\n", "df2 = df2[['pid', 'stim', 'val']]\n", "df2" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidstimval_visit_1val_visit_2
062flu11.37990712.788628
133flu11.08393712.993253
241flu11.45144016.024180
\n", "
" ], "text/plain": [ " pid stim val_visit_1 val_visit_2\n", "0 62 flu 11.379907 12.788628\n", "1 33 flu 11.083937 12.993253\n", "2 41 flu 11.451440 16.024180" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(df1, df2, on = ['pid', 'stim'], suffixes = ['_visit_1', '_visit_2'])" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidstimval_visit_1val_visit_2
022cmv10.044553NaN
111flu12.455353NaN
259cmv9.895324NaN
362flu11.37990712.788628
433flu11.08393712.993253
541flu11.45144016.024180
\n", "
" ], "text/plain": [ " pid stim val_visit_1 val_visit_2\n", "0 22 cmv 10.044553 NaN\n", "1 11 flu 12.455353 NaN\n", "2 59 cmv 9.895324 NaN\n", "3 62 flu 11.379907 12.788628\n", "4 33 flu 11.083937 12.993253\n", "5 41 flu 11.451440 16.024180" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(df1, df2, on = ['pid', 'stim'], how = 'left', suffixes = ['_visit_1', '_visit_2'])" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidstimval_visit_1val_visit_2
062flu11.37990712.788628
133flu11.08393712.993253
241flu11.45144016.024180
311cmvNaN15.569605
422fluNaN15.875906
559fluNaN15.974466
\n", "
" ], "text/plain": [ " pid stim val_visit_1 val_visit_2\n", "0 62 flu 11.379907 12.788628\n", "1 33 flu 11.083937 12.993253\n", "2 41 flu 11.451440 16.024180\n", "3 11 cmv NaN 15.569605\n", "4 22 flu NaN 15.875906\n", "5 59 flu NaN 15.974466" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(df1, df2, on = ['pid', 'stim'], how = 'right', suffixes = ['_visit_1', '_visit_2'])" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidstimval_visit_1val_visit_2
022cmv10.044553NaN
111flu12.455353NaN
259cmv9.895324NaN
362flu11.37990712.788628
433flu11.08393712.993253
541flu11.45144016.024180
611cmvNaN15.569605
722fluNaN15.875906
859fluNaN15.974466
\n", "
" ], "text/plain": [ " pid stim val_visit_1 val_visit_2\n", "0 22 cmv 10.044553 NaN\n", "1 11 flu 12.455353 NaN\n", "2 59 cmv 9.895324 NaN\n", "3 62 flu 11.379907 12.788628\n", "4 33 flu 11.083937 12.993253\n", "5 41 flu 11.451440 16.024180\n", "6 11 cmv NaN 15.569605\n", "7 22 flu NaN 15.875906\n", "8 59 flu NaN 15.974466" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(df1, df2, on = ['pid', 'stim'], how = 'outer', suffixes = ['_visit_1', '_visit_2'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Using `join` when indexes are the keys\n", "\n", "Note that merge/join have a rich and powerful (but complex!) API, and they are deeply interconnected. Much more details can be found in [the official documentation](http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging).\n", "\n", "Note in particular this section:\n", "\n", "> `merge` is a function in the pandas namespace, and it is also available as a DataFrame instance method, with the calling DataFrame being implicitly considered the left object in the join.\n", "> \n", "> The related `DataFrame.join` method, uses `merge` internally for the index-on-index (by default) and column(s)-on-index join. If you are joining on index only, you may wish to use `DataFrame.join` to save yourself some typing." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
val
pid
2210.044553
1112.455353
599.895324
6211.379907
3311.083937
4111.451440
\n", "
" ], "text/plain": [ " val\n", "pid \n", "22 10.044553\n", "11 12.455353\n", "59 9.895324\n", "62 11.379907\n", "33 11.083937\n", "41 11.451440" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1a" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
val
pid
4116.024180
1115.569605
2215.875906
6212.788628
5915.974466
3312.993253
\n", "
" ], "text/plain": [ " val\n", "pid \n", "41 16.024180\n", "11 15.569605\n", "22 15.875906\n", "62 12.788628\n", "59 15.974466\n", "33 12.993253" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2a" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
val_1val_2
pid
2210.04455315.875906
1112.45535315.569605
599.89532415.974466
6211.37990712.788628
3311.08393712.993253
4111.45144016.024180
\n", "
" ], "text/plain": [ " val_1 val_2\n", "pid \n", "22 10.044553 15.875906\n", "11 12.455353 15.569605\n", "59 9.895324 15.974466\n", "62 11.379907 12.788628\n", "33 11.083937 12.993253\n", "41 11.451440 16.024180" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1a.join(df2a, lsuffix='_1', rsuffix='_2')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## (2) Separate multiple values in a single column\n", "\n", "Remember that for a tidy data frame, each column must contain a single variable. " ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "from collections import OrderedDict" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "d = OrderedDict()\n", "d['pid-visit-stim'] = ['1-1-cmv', '1-1-hiv', '1-2-cmv', '1-2-hiv', '1-3-cmv', '1-3-hiv', '2-1-cmv', '2-1-hiv', '2-2-cmv', '2-2-hiv']\n", "d['tnf'] = [1.0, 2.0, 1.1, 2.1, 1.2, 2.2, 3, 4, 3.1, 4.1]\n", "d['ifn'] = [11.0, 12.0, 11.1, 12.1, 11.2, 12.2, 13, 14, 13.1, 14.1]\n", "d['il2'] = [0.0, 0.0, 0.1, 0.1, 0.2, 0.2, 0.1, 0.3, 0.1, 0.1]\n", "df = pd.DataFrame(d)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pid-visit-stimtnfifnil2
01-1-cmv1.011.00.0
11-1-hiv2.012.00.0
21-2-cmv1.111.10.1
31-2-hiv2.112.10.1
41-3-cmv1.211.20.2
\n", "
" ], "text/plain": [ " pid-visit-stim tnf ifn il2\n", "0 1-1-cmv 1.0 11.0 0.0\n", "1 1-1-hiv 2.0 12.0 0.0\n", "2 1-2-cmv 1.1 11.1 0.1\n", "3 1-2-hiv 2.1 12.1 0.1\n", "4 1-3-cmv 1.2 11.2 0.2" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidvisitstim
011cmv
111hiv
212cmv
312hiv
413cmv
\n", "
" ], "text/plain": [ " pid visit stim\n", "0 1 1 cmv\n", "1 1 1 hiv\n", "2 1 2 cmv\n", "3 1 2 hiv\n", "4 1 3 cmv" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = pd.DataFrame(df['pid-visit-stim'].str.split('-').tolist(), \n", " columns = ['pid', 'visit', 'stim'])\n", "df1.head()" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidvisitstimpid-visit-stimtnfifnil2
011cmv1-1-cmv1.011.00.0
111hiv1-1-hiv2.012.00.0
212cmv1-2-cmv1.111.10.1
312hiv1-2-hiv2.112.10.1
413cmv1-3-cmv1.211.20.2
\n", "
" ], "text/plain": [ " pid visit stim pid-visit-stim tnf ifn il2\n", "0 1 1 cmv 1-1-cmv 1.0 11.0 0.0\n", "1 1 1 hiv 1-1-hiv 2.0 12.0 0.0\n", "2 1 2 cmv 1-2-cmv 1.1 11.1 0.1\n", "3 1 2 hiv 1-2-hiv 2.1 12.1 0.1\n", "4 1 3 cmv 1-3-cmv 1.2 11.2 0.2" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = pd.concat([df1, df], axis=1)\n", "df1.head()" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidvisitstimtnfifnil2
011cmv1.011.00.0
111hiv2.012.00.0
212cmv1.111.10.1
312hiv2.112.10.1
413cmv1.211.20.2
\n", "
" ], "text/plain": [ " pid visit stim tnf ifn il2\n", "0 1 1 cmv 1.0 11.0 0.0\n", "1 1 1 hiv 2.0 12.0 0.0\n", "2 1 2 cmv 1.1 11.1 0.1\n", "3 1 2 hiv 2.1 12.1 0.1\n", "4 1 3 cmv 1.2 11.2 0.2" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.drop('pid-visit-stim', axis=1, inplace=True)\n", "df1.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Wrap into a convenient function" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "def separate(df, column, sep):\n", " df1 = pd.DataFrame(df[column].str.split(sep).tolist(), columns = column.split(sep))\n", " df1 = pd.concat([df1, df], axis=1)\n", " return df1.drop(column, axis = 1)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidvisitstimtnfifnil2
011cmv1.011.00.0
111hiv2.012.00.0
212cmv1.111.10.1
312hiv2.112.10.1
413cmv1.211.20.2
\n", "
" ], "text/plain": [ " pid visit stim tnf ifn il2\n", "0 1 1 cmv 1.0 11.0 0.0\n", "1 1 1 hiv 2.0 12.0 0.0\n", "2 1 2 cmv 1.1 11.1 0.1\n", "3 1 2 hiv 2.1 12.1 0.1\n", "4 1 3 cmv 1.2 11.2 0.2" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "separate(df, 'pid-visit-stim', '-').head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercises" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You are given three DataFrames `d1`, `d2`, and `d3`. Answer the questions below." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "d1 = pd.DataFrame({'subject': ['a', 'b', 'c', 'd'], 'group': ['A', 'B', 'C', 'D'], 'value': [1, 2, 3, 4]})\n", "d2 = pd.DataFrame({'subject': ['w', 'x', 'y', 'z'], 'group': ['A', 'B', 'C', 'D'], 'value': [2, 1, 4, 3]})\n", "d3 = pd.DataFrame({'subject': ['a', 'x', 'w', 'd', 'b', 'c', 'y', 'z'], 'age': [10, 20, 30, 40, 50, 60, 70, 80]})" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
groupsubjectvalue
0Aa1
1Bb2
2Cc3
3Dd4
\n", "
" ], "text/plain": [ " group subject value\n", "0 A a 1\n", "1 B b 2\n", "2 C c 3\n", "3 D d 4" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d1" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
groupsubjectvalue
0Aw2
1Bx1
2Cy4
3Dz3
\n", "
" ], "text/plain": [ " group subject value\n", "0 A w 2\n", "1 B x 1\n", "2 C y 4\n", "3 D z 3" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d2" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agesubject
010a
120x
230w
340d
450b
560c
670y
780z
\n", "
" ], "text/plain": [ " age subject\n", "0 10 a\n", "1 20 x\n", "2 30 w\n", "3 40 d\n", "4 50 b\n", "5 60 c\n", "6 70 y\n", "7 80 z" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**1**. Combine `d1` and `d2` to create a DataFrame called `df1` with 8 rows and 3 columns." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**2**. Combine `df1` with `d3` to create a DataFrame called `df2` with 8 rows and 4 columns." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**3**.. Convert the given DataFrame `d` from wide to tall format and save in `df`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**4**. Name the column in `df` with group values `group` and remove the `group_` part from the group name values in each cell, leaving only the number." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" }, "latex_envs": { "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 0 } }, "nbformat": 4, "nbformat_minor": 1 }