Exploratory analyses #1 & #2: the effects of place and time¶
We begin with exploratory analysis of the various factors we hypothesize may influence a company’s success. In this notebook, we investigate associations between company valuation at IPO or total funding amount and office location and timing.
Set-up¶
Import stand-alone functions and necessary packages.
In [1]:
from modules import *
In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
plt.style.use('seaborn-dark')
plt.rcParams['figure.figsize'] = (10, 6)
Connect to the MySQL database, read in cb_offices and cb_ipos and convert to dataframes, disconnect from database.
In [3]:
conn = dbConnect()
offices = dbTableToDataFrame(conn, 'cb_offices')
ipos = dbTableToDataFrame(conn, 'cb_ipos')
conn.close()
Look at offices and ipos dataframes & merge them¶
In [4]:
offices.head()
Out[4]:
address1 | address2 | city | country_code | created_at | description | id | latitude | longitude | object_id | office_id | region | state_code | updated_at | zip_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 710 - 2nd Avenue | Suite 1100 | Seattle | USA | None | 1 | 47.6031220000 | -122.3332530000 | c:1 | 1 | Seattle | WA | None | 98104 | |
1 | 4900 Hopyard Rd | Suite 310 | Pleasanton | USA | None | Headquarters | 2 | 37.6929340000 | -121.9049450000 | c:3 | 3 | SF Bay | CA | None | 94588 |
2 | 135 Mississippi St | None | San Francisco | USA | None | None | 3 | 37.7647260000 | -122.3945230000 | c:4 | 4 | SF Bay | CA | None | 94107 |
3 | 1601 Willow Road | None | Menlo Park | USA | None | Headquarters | 4 | 37.4160500000 | -122.1518010000 | c:5 | 5 | SF Bay | CA | None | 94025 |
4 | Suite 200 | 654 High Street | Palo Alto | ISR | None | 5 | None | None | c:7 | 7 | SF Bay | CA | None | 94301 |
In [5]:
ipos.head()
Out[5]:
created_at | id | ipo_id | object_id | public_at | raised_amount | raised_currency_code | source_description | source_url | stock_symbol | updated_at | valuation_amount | valuation_currency_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2008-02-09 05:17:45 | 1 | 1 | c:1654 | 1980-12-19 | None | USD | None | None | NASDAQ:AAPL | 2012-04-12 04:02:59 | None | USD |
1 | 2008-02-09 05:25:18 | 2 | 2 | c:1242 | 1986-03-13 | None | None | None | None | NASDAQ:MSFT | 2010-12-11 12:39:46 | None | USD |
2 | 2008-02-09 05:40:32 | 3 | 3 | c:342 | 1969-06-09 | None | None | None | None | NYSE:DIS | 2010-12-23 08:58:16 | None | USD |
3 | 2008-02-10 22:51:24 | 4 | 4 | c:59 | 2004-08-25 | None | None | None | None | NASDAQ:GOOG | 2011-08-01 20:47:08 | None | USD |
4 | 2008-02-10 23:28:09 | 5 | 5 | c:317 | 1997-05-01 | None | None | None | None | NASDAQ:AMZN | 2011-08-01 21:11:22 | 100000000000 | USD |
In [6]:
# merge by 'object_id' field, which uniquely links the two dataframes
offices_ipos = pd.merge(offices, ipos, on='object_id')
In [7]:
# look at new merged dataframe
offices_ipos.head()
Out[7]:
address1 | address2 | city | country_code | created_at_x | description | id_x | latitude | longitude | object_id | ... | ipo_id | public_at | raised_amount | raised_currency_code | source_description | source_url | stock_symbol | updated_at_y | valuation_amount | valuation_currency_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1601 Willow Road | None | Menlo Park | USA | None | Headquarters | 4 | 37.4160500000 | -122.1518010000 | c:5 | ... | 847 | 2012-05-18 | 18400000000 | USD | Facebook Prices IPO at Record Value | http://online.wsj.com/news/articles/SB10001424... | NASDAQ:FB | 2013-11-21 19:40:55 | 104000000000 | USD |
1 | None | None | Dublin | IRL | None | Europe HQ | 6975 | 53.3441040000 | -6.2674940000 | c:5 | ... | 847 | 2012-05-18 | 18400000000 | USD | Facebook Prices IPO at Record Value | http://online.wsj.com/news/articles/SB10001424... | NASDAQ:FB | 2013-11-21 19:40:55 | 104000000000 | USD |
2 | 340 Madison Ave | None | New York | USA | None | New York | 9084 | 40.7557162000 | -73.9792469000 | c:5 | ... | 847 | 2012-05-18 | 18400000000 | USD | Facebook Prices IPO at Record Value | http://online.wsj.com/news/articles/SB10001424... | NASDAQ:FB | 2013-11-21 19:40:55 | 104000000000 | USD |
3 | 1355 Market St. | None | San Francisco | USA | None | 10 | 37.7768052000 | -122.4169244000 | c:12 | ... | 1310 | 2013-11-07 | 1820000000 | USD | Twitter Prices IPO Above Estimates At $26 Per ... | http://techcrunch.com/2013/11/06/twitter-price... | NYSE:TWTR | 2013-11-07 04:18:48 | 18100000000 | USD | |
4 | 2145 Hamilton Avenue | None | San Jose | USA | None | Headquarters | 16 | 37.2950050000 | -121.9300350000 | c:20 | ... | 26 | 1998-10-02 | None | USD | None | None | NASDAQ:EBAY | 2012-04-12 04:24:15 | None | USD |
5 rows × 27 columns
In [8]:
offices_ipos.shape
Out[8]:
(1554, 27)
In [9]:
offices_ipos.columns
Out[9]:
Index(['address1', 'address2', 'city', 'country_code', 'created_at_x',
'description', 'id_x', 'latitude', 'longitude', 'object_id',
'office_id', 'region', 'state_code', 'updated_at_x', 'zip_code',
'created_at_y', 'id_y', 'ipo_id', 'public_at', 'raised_amount',
'raised_currency_code', 'source_description', 'source_url',
'stock_symbol', 'updated_at_y', 'valuation_amount',
'valuation_currency_code'],
dtype='object')
Distribution of valuation amount¶
We first want a sense for the total valuation amounts we are considering. The valuations are all < 2x1010 except for 3 that are >1x1011. While there are 1,554 entries in the merged dataframe, there are only 167 entries that have valuation amount provided. We will continue with analysis of valuation amount but subsequently look at total funding amount instead later in this notebook to increase our sample size.
In [10]:
# plot valuation distribution; include number of companies considering
val = pd.to_numeric(offices_ipos[~offices_ipos.valuation_amount.isnull()].valuation_amount)
a=sns.distplot(val, rug=True);
a.set_title('valuation distribution (N = ' + str(len(val)) + ')');

We look at these 3 outliers and see that one is simply because the valuation amount is in JPY (currently 1 JPY = 0.0088 USD). The other two are Facebook and Amazon, which we know to be companies with very high valuations. We will look at those with valuation_currency_code ‘USD’ moving forward to standardize our analysis to one currency.
In [11]:
# look at the 3 outliers with very large valuation_amount
offices_ipos[pd.to_numeric(offices_ipos.valuation_amount) > 0.8e11]
Out[11]:
address1 | address2 | city | country_code | created_at_x | description | id_x | latitude | longitude | object_id | ... | ipo_id | public_at | raised_amount | raised_currency_code | source_description | source_url | stock_symbol | updated_at_y | valuation_amount | valuation_currency_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1601 Willow Road | None | Menlo Park | USA | None | Headquarters | 4 | 37.4160500000 | -122.1518010000 | c:5 | ... | 847 | 2012-05-18 | 18400000000 | USD | Facebook Prices IPO at Record Value | http://online.wsj.com/news/articles/SB10001424... | NASDAQ:FB | 2013-11-21 19:40:55 | 104000000000 | USD |
1 | None | None | Dublin | IRL | None | Europe HQ | 6975 | 53.3441040000 | -6.2674940000 | c:5 | ... | 847 | 2012-05-18 | 18400000000 | USD | Facebook Prices IPO at Record Value | http://online.wsj.com/news/articles/SB10001424... | NASDAQ:FB | 2013-11-21 19:40:55 | 104000000000 | USD |
2 | 340 Madison Ave | None | New York | USA | None | New York | 9084 | 40.7557162000 | -73.9792469000 | c:5 | ... | 847 | 2012-05-18 | 18400000000 | USD | Facebook Prices IPO at Record Value | http://online.wsj.com/news/articles/SB10001424... | NASDAQ:FB | 2013-11-21 19:40:55 | 104000000000 | USD |
91 | 1200 12th Ave | S # 1200 | Seattle | USA | None | None | 288 | 47.5923000000 | -122.3172950000 | c:317 | ... | 5 | 1997-05-01 | None | None | None | None | NASDAQ:AMZN | 2011-08-01 21:11:22 | 100000000000 | USD |
360 | Roppongi-Hills Mori Tower, 6-10-1, Roppongi | Minato-ku | Tokyo | JPN | None | Headquarters | 9085 | None | None | c:15609 | ... | 78 | 2008-12-17 | None | None | None | None | 3632 | 2010-10-08 06:47:07 | 108960000000 | JPY |
5 rows × 27 columns
The above table reveals that we do have duplicate entries for certain companies based on multiple office locations. We need to look at the number of unique companies we are considering, which as shown below is 101 (not 167). Because the ‘description field’ does not sufficiently and uniformly describe the type of office for all entries, we will not filter by that column. Instead, this is just a preliminary exploratory analysis that will keep all locations in the analysis. We would ideally like to filter by ‘Headquarters’ or ‘HQ’ but the data is not sufficiently documented so this would require substantial data entry to designate the headquarters location for all companies with multiple office locations.
In [12]:
len(pd.unique(offices_ipos[~offices_ipos.valuation_amount.isnull()].object_id))
Out[12]:
101
Valuation amount by region¶
We want to determine the effect of location on IPO valuation. In particular, we consider the median valuation and number of IPOs by region to determine whether certain regions are hotspots for companies that IPO. As mentioned above, we now limit our analysis to valuations in USD, which reduces our N from 167 to 161.
In [13]:
# restrict to valuations in USD currency
offices_ipos_us = offices_ipos[offices_ipos.valuation_currency_code == 'USD']
val_us = pd.to_numeric(offices_ipos_us[~offices_ipos_us.valuation_amount.isnull()].valuation_amount)
region = offices_ipos_us[~offices_ipos_us.valuation_amount.isnull()].region
# create dataframe with just region and valuation
df_region_val = pd.concat([region, val_us], axis=1)
df_region_val.describe()
Out[13]:
valuation_amount | |
---|---|
count | 1.610000e+02 |
mean | 3.904637e+09 |
std | 1.615982e+10 |
min | 3.860000e+04 |
25% | 1.340000e+08 |
50% | 3.150000e+08 |
75% | 1.000000e+09 |
max | 1.040000e+11 |
In [14]:
# perform calculations on a per region basis
reg = df_region_val.groupby(['region'])
SF Bay has the most IPOs with NY, London, and Seattle following.
In [15]:
# sort regions by number of ipos (that had valutation amount listed)
num = reg.count()
num.sort_values(by='valuation_amount', ascending=False)[0:10]
Out[15]:
valuation_amount | |
---|---|
region | |
SF Bay | 34 |
New York | 11 |
London | 8 |
Seattle | 7 |
Denver | 6 |
Boston | 6 |
Los Angeles | 6 |
Chicago | 5 |
Beijing | 4 |
Singapore | 3 |
To create an informative boxplot of valuation amount by region we include regions with >5 companies that IPO.
In [16]:
# plot valuation by region
top7 = ['SF Bay', 'New York', 'London', 'Seattle', 'Denver', 'Boston', 'Los Angeles']
df_top7 = df_region_val[df_region_val.region.isin(top7)]
fig, a = plt.subplots(figsize=(16,10))
a=sns.barplot(x='region', y='valuation_amount',data=df_top7, ci=False, order=top7)
a.set_ylabel('valuation amount')
a.set_title('valuation by region')
plt.xticks(rotation=90);
plt.savefig('results/valuation_region.png')

While SF Bay has the most companies that IPO (above barplot is sorted from most companies that IPO to least), the region with the highest mean valuation amount is Seattle. However, the N for Seattle is only 7 so the effects of outliers like Microsoft are more significant (refer to wider boxplot for Seattle below).
In [17]:
# plot valuation by region
top7 = ['SF Bay', 'New York', 'London', 'Seattle', 'Denver', 'Boston', 'Los Angeles']
df_top7 = df_region_val[df_region_val.region.isin(top7)]
fig, a = plt.subplots(figsize=(16,10))
a=sns.boxplot(x='region', y='valuation_amount',data=df_top7, order=top7)
a.set_ylabel('valuation amount')
a.set_title('valuation by region')
plt.xticks(rotation=90);

Analysis of number of IPOs over time¶
In addition to looking at the value and region of companies that IPO we also look at historical trends in terms of the number of IPOs over time to determine how timing may affect a company’s success. We no longer restrict analysis to companies where valuation amount is provided.
In [18]:
# now don't filter to ones with valuation amount only
dt = pd.to_datetime(offices_ipos.public_at)
valAll = pd.to_numeric(offices_ipos.valuation_amount)
regionAll = offices_ipos.region
In [19]:
df_region_dt = pd.concat([regionAll, dt], axis=1)
time = df_region_dt.groupby([dt.dt.year])
In [20]:
# top counts by year
num = time.count()
num.sort_values(by='public_at', ascending=False).head(10)
Out[20]:
region | public_at | |
---|---|---|
public_at | ||
2013.0 | 154 | 154 |
2004.0 | 88 | 88 |
2011.0 | 83 | 83 |
2010.0 | 77 | 77 |
2012.0 | 51 | 51 |
2009.0 | 47 | 47 |
1999.0 | 47 | 47 |
2007.0 | 46 | 46 |
2008.0 | 44 | 44 |
2000.0 | 43 | 43 |
In [21]:
# barplot showing number of ipos by year
a=sns.barplot(num.public_at.index, num.public_at)
a.set_ylabel('number of ipos')
a.set_xlabel('year')
a.set_title('number of ipos by year');
a.set_xticklabels(labels=num.public_at.index.astype(int), rotation=90);
plt.savefig('results/ipos_year.png')

More recent years have more IPOs. However, we cannot rule out the possibility that the Crunchbase dataset is also simply becoming more complete.
Incorporate cb_objects into analysis to consider funding_total_usd metric¶
While we had been considering valuation amount from the IPO data, we also consider funding_total_usd for the below analysis because it is a much richer dataset. We have N=27,874 instead of N=167.
Connect to database and look at dataframe from cb_objects¶
In [22]:
conn = dbConnect()
objs = dbTableToDataFrame(conn, 'cb_objects')
conn.close()
In [23]:
objs.head()
Out[23]:
category_code | city | closed_at | country_code | created_at | created_by | description | domain | entity_id | entity_type | ... | parent_id | permalink | region | relationships | short_description | state_code | status | tag_list | twitter_username | updated_at | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | web | Seattle | None | USA | 2007-05-25 06:51:27 | initial-importer | Technology Platform Company | wetpaint-inc.com | 1 | Company | ... | None | /company/wetpaint | Seattle | 17.0 | None | WA | operating | wiki, seattle, elowitz, media-industry, media-... | BachelrWetpaint | 2013-04-13 03:29:00 |
1 | games_video | Culver City | None | USA | 2007-05-31 21:11:51 | initial-importer | None | flektor.com | 10 | Company | ... | None | /company/flektor | Los Angeles | 6.0 | None | CA | acquired | flektor, photo, video | None | 2008-05-23 23:23:14 |
2 | games_video | San Mateo | None | USA | 2007-08-06 23:52:45 | initial-importer | there.com | 100 | Company | ... | None | /company/there | SF Bay | 12.0 | None | CA | acquired | virtualworld, there, teens | None | 2013-11-04 02:09:48 | |
3 | network_hosting | None | None | None | 2008-08-24 16:51:57 | None | None | mywebbo.com | 10000 | Company | ... | None | /company/mywebbo | unknown | NaN | None | None | operating | social-network, new, website, web, friends, ch... | None | 2008-09-06 14:19:18 |
4 | games_video | None | None | None | 2008-08-24 17:10:34 | None | None | themoviestreamer.com | 10001 | Company | ... | None | /company/the-movie-streamer | unknown | NaN | None | None | operating | watch, full-length, moives, online, for, free,... | None | 2008-09-06 14:19:18 |
5 rows × 40 columns
In [24]:
objs.columns
Out[24]:
Index(['category_code', 'city', 'closed_at', 'country_code', 'created_at',
'created_by', 'description', 'domain', 'entity_id', 'entity_type',
'first_funding_at', 'first_investment_at', 'first_milestone_at',
'founded_at', 'funding_rounds', 'funding_total_usd', 'homepage_url',
'id', 'invested_companies', 'investment_rounds', 'last_funding_at',
'last_investment_at', 'last_milestone_at', 'logo_height', 'logo_url',
'logo_width', 'milestones', 'name', 'normalized_name', 'overview',
'parent_id', 'permalink', 'region', 'relationships',
'short_description', 'state_code', 'status', 'tag_list',
'twitter_username', 'updated_at'],
dtype='object')
In [25]:
objs.shape
Out[25]:
(462651, 40)
In [26]:
len(objs.funding_total_usd[~objs.funding_total_usd.isnull()])
Out[26]:
27874
Look at the influence of state on total funding¶
The greatest mean funding is in WA, CA, MA, MD, and AL. However when looking at boxplots with the median indicated, which is a better metric so we are not sensitive to extreme outliers, we see that the median is highest in CA, MA, NJ, NH, and ME.
In [27]:
# plot mean funding_total_usd by state
a=sns.barplot(objs['state_code'], pd.to_numeric(objs['funding_total_usd']), ci=False);
plt.xticks(rotation=90);

In [28]:
# boxplot of funding_total_usd by state
a=sns.boxplot(objs['state_code'], pd.to_numeric(objs['funding_total_usd']));
a.set_ylim(0, 2.5e7)
plt.xticks(rotation=90);
plt.savefig('results/funding_state.png')

Just as we looked at the number of IPOs over time, we now look at the emergence of new companies over time.
In [29]:
# group by date founded to do analysis on the emergence of new companies over time
dt = pd.to_datetime(objs.founded_at)
df_fund_dt = pd.concat([objs.funding_total_usd, dt], axis=1)
founded = df_fund_dt.groupby([dt.dt.year])
In [30]:
# top counts by year
num_found = founded.count()
num_found.sort_values(by='founded_at', ascending=False)['founded_at'].head(10)
Out[30]:
founded_at
2011.0 11884
2012.0 11158
2010.0 10858
2009.0 9805
2008.0 8502
2007.0 6765
2013.0 6280
2006.0 4689
2005.0 3580
2004.0 2828
Name: founded_at, dtype: int64
In [31]:
# plot number of new companies by year; zooming in on 1990 and later where there is more data
a=sns.barplot(num_found.founded_at.index.astype(int)[(num_found.founded_at.index.astype(int) >= 1990)
& (num_found.founded_at.index.astype(int) < 2013)],
num_found.founded_at[(num_found.founded_at.index.astype(int) >= 1990)
& (num_found.founded_at.index.astype(int) < 2013)])
a.set_ylabel('number of new companies')
a.set_xlabel('year')
a.set_title('number of new companies by year');
plt.xticks(rotation=90);
plt.savefig('results/new_year.png')

In [32]:
# company lifespan: closed_at - founded_at
start = pd.to_datetime(objs.founded_at)
end = pd.to_datetime(objs.closed_at)
life = end.dt.year - start.dt.year
In [33]:
a=sns.distplot(life[~life.isnull()]);
a.set_xlim(-5, 50);

In [34]:
# there are 27 with a life < 0 showing data entry for this data set was flawed.
life[life < 0].describe()
Out[34]:
count 27.000000
mean -7.851852
std 10.614590
min -40.000000
25% -9.500000
50% -3.000000
75% -1.000000
max -1.000000
dtype: float64
In [35]:
# consider those with positive life
life_pos = life[life>0]
life_pos[~life_pos.isnull()].describe()
Out[35]:
count 2013.000000
mean 4.384004
std 4.359592
min 1.000000
25% 2.000000
50% 3.000000
75% 5.000000
max 53.000000
dtype: float64
In [36]:
# see whether companies tend to have longer of shorter lifespans based on when they were founded
a=sns.barplot(start[~life.isnull()].dt.year.astype(int), life[~life.isnull()], ci=None);
plt.xticks(rotation=90);

In [37]:
# see whether companies tend to have longer or shorter lifespans based on region
a=sns.barplot(objs.state_code[~objs.state_code.isnull()], life[~objs.state_code.isnull()], ci=None);
plt.xticks(rotation=90);

In [38]:
# see whether companies tend to have longer or shorter lifespans based on region
a=sns.boxplot(objs.state_code[~objs.state_code.isnull()], life[~objs.state_code.isnull()]);
a.set_ylim(0,15);
plt.xticks(rotation=90);
