Exploratory Analysis #4: Executive Education¶

So you want to start a company…¶

The leaders of the world’s top businesses have to start somewhere. Using the Crunchbase data, we take a look at where these business leaders received their educations. Is it true that you need to attend a world class institution to be successful, or is the business world much more varied in terms of educational backgrounds?

In [1]:

% matplotlib inline
# Overhead
import MySQLdb
import seaborn as sns
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from IPython.display import display
from modules import *

Since the crunchbase data is in the form of MySQL database dumps, we have already loaded this information into a MySQL database locally. See ___________ for instructions if you have not done this already.

Next, we connect to that database and save a couple information tables locally as pandas DataFrames.

In [2]:

conn = dbConnect()
people = dbTableToDataFrame(conn,'cb_people')
degrees = dbTableToDataFrame(conn,'cb_degrees')
funding_rounds = dbTableToDataFrame(conn,'cb_funding_rounds')
objects = dbTableToDataFrame(conn,'cb_objects')
conn.close()

Username: root
Password: ········

The most successful businesses¶

If we consider a successful business to be one that has raised the most capital through a variety of funding rounds, we can sort our dataset accordingly.

We can get the total raised by a company by using the funding_total_usd column in the objects database. Note that we only want to use companies, so we only select those with an entity_type of “company”. We will also reduce the size of our dataframe to include only a couple important metrics.

In [3]:

companies = objects[objects.entity_type=='Company'][['name','entity_id','funding_total_usd','closed_at']].copy()
# Only select companies who have raised some money
companies = companies.dropna(subset=['funding_total_usd'])
display(companies.sort_values('funding_total_usd',ascending=False).head(10))

comp_tot = len(companies)
print('There are {:,} companies in the dataset'.format(comp_tot))

	name	entity_id	funding_total_usd	closed_at
4575	Clearwire	13219	5700000000	None
160023	Verizon Communications, Inc.	4843	3985050000	None
97784	sigmacare	242735	2600000000	None
161551	Facebook	5	2425700000	None
175825	Carestream	64365	2400000000	None
81001	Solyndra	22568	1765504319	None
170979	Fisker Automotive	5951	1451000000	None
151715	O3b Networks	39799	1270283000	None
101987	Terra-Gen Power	24693	1200000000	None
2971	Twitter	12	1160166511	None

There are 27,874 companies in the dataset

Now that we have all of the companies, let’s define exactly what we mean by the most successful companies. First, to get a feel for the data visually, let’s look at the top 100 companies and their fundraising totals.

In [4]:

fig,ax = plt.subplots(figsize=(20,20))
n_top_companies = 100
plotset = companies.sort_values('funding_total_usd',ascending=False).head(n_top_companies)
ax = sns.barplot(x='funding_total_usd',y='name',data=plotset)
ax.set_title('Funding Totals for the Top {} Companies'.format(n_top_companies),fontsize=22)
ax.set_xscale('log')
ax.set_xlabel('Funding Total [USD] -- Log Scale',fontsize=18);
ax.set_yticklabels(labels=plotset.name)
ax.set_ylabel('Company',fontsize=18)
plt.savefig('results/funding_totals.png')

Note that the y-axis is on a logarithmic scale, and so the visible trend is dramatically underexaggerated.

It appears that only a handful of the companies fundraise greater than $1 billion ($:math:10^9), though many more companies appear to fundraise in the millions. Let’s find the exact percent of companies for each order of magnitude.

In [5]:

I,F = 3,10
for i in np.logspace(I,F,F-I+1):
    num_above = len(companies[companies['funding_total_usd']>i])
    frac = num_above/comp_tot
    print('{:,} companies ({}%) above ${:,}'.format(num_above,round(100*frac,1),int(i)))

27,854 companies (99.9%) above $1,000
27,608 companies (99.0%) above $10,000
24,960 companies (89.5%) above $100,000
18,112 companies (65.0%) above $1,000,000
7,309 companies (26.2%) above $10,000,000
630 companies (2.3%) above $100,000,000
15 companies (0.1%) above $1,000,000,000
0 companies (0.0%) above $10,000,000,000

We see from this that only slightly more than 25% of the companies fundraise more than $10 million. Let’s define these 7309 companies as the most successful.

The people behind the most successful businesses¶

Next, we can use the people database to pair individuals with the company they are affiliated with. Like for the funding rounds data, we don’t need all of the information provided in the database of people either. Since we ultimately want to look at educational factors, let’s limit the dataset to only a person’s name (first and last), object_id, and affiliation, in a key that will let us link to the degrees dataset.

In [6]:

people_key = people[['affiliation_name','first_name','last_name','object_id']]
display(people_key.head(10))

ppl_tot = len(people_key)
print('There are {:,} people in the dataset'.format(ppl_tot))

	affiliation_name	first_name	last_name	object_id
0	Blue Nile	Ben	Elowitz	p:2
1	Wetpaint	Kevin	Flaherty	p:3
2	Zoho	Raju	Vegesna	p:4
3	Zoho	Ian	Wenig	p:5
4	i/o Ventures	Kevin	Rose	p:6
5	Digg	Jay	Adelson	p:7
6	Digg	Owen	Byrne	p:8
7	Digg	Ron	Gorodetzky	p:9
8	Facebook	Mark	Zuckerberg	p:10
9	Facebook	Dustin	Moskovitz	p:11

There are 226,709 people in the dataset

Now we merge the two datasets together to get each funded company and people in the crunchbase dataset who were involved.

In [7]:

company_people = pd.merge(companies,people_key,left_on='name',right_on='affiliation_name').drop('affiliation_name',axis=1)
display(company_people.head())

cp_tot = len(company_people)
print('There are {:,} people affiliated with funded companies in the dataset'.format(cp_tot))

	name	entity_id	funding_total_usd	closed_at	first_name	last_name	object_id
0	Wetpaint	1	39750000	None	Kevin	Flaherty	p:3
1	Wetpaint	1	39750000	None	Alex	Weinstein	p:104800
2	FriendFeed	1001	5000000	None	Paul	Buchheit	p:640
3	FriendFeed	1001	5000000	None	Sanjeev	Singh	p:2716
4	Fitbit	10015	68069200	None	James	Park	p:16930

There are 12,168 people affiliated with funded companies in the dataset

Interestingly, there are significantly fewer people associated with companies than there are people in the dataset.

We can now repeat the merging process with the dataset of educational information. First, let’s limit the degrees dataset to include only degree type, institution, subject, and the id of the corresponding person (object_id). Then, we merge this into the database of companies and people.

In [8]:

edu = degrees[['object_id','institution','degree_type','subject']].dropna(subset=['institution'])
CPE = pd.merge(company_people,edu,on='object_id')
display(CPE.head(10))

	name	entity_id	funding_total_usd	closed_at	first_name	last_name	object_id	institution	degree_type	subject
0	Wetpaint	1	39750000	None	Kevin	Flaherty	p:3	Washington University in St Louis	BBA	None
1	Wetpaint	1	39750000	None	Kevin	Flaherty	p:3	Indiana University	MBA	None
2	FriendFeed	1001	5000000	None	Paul	Buchheit	p:640	Case Western Reserve University	MS	Computer Science
3	FriendFeed	1001	5000000	None	Sanjeev	Singh	p:2716	Stanford University	BS	Computer Science
4	Fitbit	10015	68069200	None	James	Park	p:16930	Harvard University	DNF	Computer Science
5	Fitbit	10015	68069200	None	Eric	Friedman	p:16931	George Washington University	BBA	Business Marketing
6	Fitbit	10015	68069200	None	Meena	Srinivasan	p:89236	India, Certified Public Accountant in California	Chartered Accountant	None
7	Fitbit	10015	68069200	None	Meena	Srinivasan	p:89236	East Carolina University	MBA	None
8	Fitbit	10015	68069200	None	Mark	Silverio	p:121130	Boston College	BS	Marketing
9	Fitbit	10015	68069200	None	Eric	Friedman	p:216979	Yale University	MS	Computer Science

Finally, we filter the dataset of company, people, and education to only those in the most successful businesses (those who fundraised more than $10 million).

In [9]:

successful_CPE = CPE[CPE['funding_total_usd']>10_000]
display(successful_CPE.head())

n_entries = len(CPE)
n_unique_ppl = CPE['object_id'].nunique()
print('There are {:,} entries in the dataset, and {:,} unique people in the dataset.'.format(n_entries,n_unique_ppl))
print('On average, successful businesses involve people with an average of {} degrees.'.format(round(n_entries/n_unique_ppl,1)))

	name	entity_id	funding_total_usd	closed_at	first_name	last_name	object_id	institution	degree_type	subject
0	Wetpaint	1	39750000	None	Kevin	Flaherty	p:3	Washington University in St Louis	BBA	None
1	Wetpaint	1	39750000	None	Kevin	Flaherty	p:3	Indiana University	MBA	None
2	FriendFeed	1001	5000000	None	Paul	Buchheit	p:640	Case Western Reserve University	MS	Computer Science
3	FriendFeed	1001	5000000	None	Sanjeev	Singh	p:2716	Stanford University	BS	Computer Science
4	Fitbit	10015	68069200	None	James	Park	p:16930	Harvard University	DNF	Computer Science

There are 10,823 entries in the dataset, and 6,768 unique people in the dataset.
On average, successful businesses involve people with an average of 1.6 degrees.

We see that out of the 12,168 people in the full dataset, more than half are associated with well performing businesses. We also see that it is quite common for those at successful businesses to have multiple degrees.

(Not so) humble beginnings¶

Next we take a look at the institutions who have produced the people affiliated with the successful businesses. If we were to take a closer look at the table above, you would see that many of the institutions are duplicated, but often with slight variations for each school.

We will try to eliminate some of these duplicates, while at the same time reducing the names in such a way that we can properly aggregate a school when it is named alternatively. We will do this by defining a function which processes a list of school names, removes punctuation and superfluous word (often we are not interested in specific business schools or law schools, and we want inputs like “University of California–Berkeley” and “University of California, Berkeley” to be treated the same. Note: the function defined here relies on a more complicated “SchoolTools” class which is contained in the modules module.

In [11]:

def identify_institutions(institutions):
    '''
    Identify institutions using a common name.

    Each person in the database has a university which was input by a user.
    These inputs are not uniform, and so names are inconsistent for some schools.
    (e.g. Berkeley could be UC Berkeley, Cal, or The University of California)

    Parameters
    ----------
    institutions : Series
        A series giving a list of institutions to be identified.

    Returns
    -------
    inst : Series
        A series where insitutions have been commonly identified.

    '''
    ST = SchoolTools()
    # Make all words lowercase
    inst=institutions.str.lower()
    #Remove stopwords
    inst = ST.remove_stopwords(inst)
    # Remove punctuation
    inst = ST.remove_punctuation(inst)
    # Replace common stems with just the stem, delete others, and replace keywords and nicknames with a common title.
    inst = ST.identify_schools(inst)
    return inst.str.title()

In [12]:

cpe = CPE.copy()
cpe.institution=identify_institutions(cpe.institution)

This dataset is still quite large by itself. To reduce it’s size, we will only focus on those schools which have contributed 15 or more people to successful businesses.

In [13]:

people_contributed_min = 15
inst_counts = pd.DataFrame(cpe.institution.value_counts())
inst_counts = inst_counts.rename(columns={'institution':'total'})
inst_mult = inst_counts[inst_counts.total>=people_contributed_min]
cpe_count = pd.merge(cpe,inst_mult,left_on='institution',right_index=True).sort_values('total',ascending=False)
cpe_count.head()

Out[13]:

	name	entity_id	funding_total_usd	closed_at	first_name	last_name	object_id	institution	degree_type	subject	total
10708	Zillow	959	96627980	None	Christopher	Roberts	p:2556	Stanford	BS	Mechanical Engineering	618
7501	Facebook	5	2425700000	None	Sandra Liu	Huang	p:15296	Stanford	BS	Computer Science	618
6930	Apture	4421	4600000	None	Can	Sar	p:18982	Stanford	BS	Computer Science	618
6931	Apture	4421	4600000	None	Can	Sar	p:18982	Stanford	MS	Computer Science	618
7040	MUBI	4524	8245630	None	Efe	Cakarel	p:9537	Stanford	MBA	None	618

To make more sense of this dataset, we will plot according to universities which have contributed the most people to successful businesses.

In [14]:

fig,edu_ax = plt.subplots(figsize=(16,25))
fig.suptitle('Graduates contributed to the most successful businesses by college',fontsize=22,x=0.55,y=1.01)

edu_ax = sns.barplot(x='total',y='institution',data=cpe_count)
edu_ax.tick_params(labelsize=18)
edu_ax.xaxis.tick_top()
edu_ax.set_xlabel('Number of graduates',fontsize=18);
edu_ax.xaxis.set_label_position('top')
edu_ax.set_yticklabels(labels=cpe_count.institution.unique(),fontsize=12)
edu_ax.set_ylabel('School',fontsize=18)

fig.tight_layout()
plt.savefig('results/affiliates_by_school.png')

We can see that there are several universities contributing far more graduates to the top businesses than the rest. Just the top three schools (Stanford, Harvard, and UC Berkeley) produce more than 13% of executives and founders of these companies.

In [15]:

cum_frac = 0
for top_school in ['Stanford','Harvard','Berkeley']:
    frac = inst_counts.total.loc[top_school]/inst_counts.total.sum()
    cum_frac += frac
    print('{} has {}% of the total graduates in top businesses.'.format(top_school,round(100*frac,1)))
print('Together, these three schools produced {}% of the total graduates in top businesses.'.format(round(100*cum_frac,1)))

Stanford has 5.7% of the total graduates in top businesses.
Harvard has 4.8% of the total graduates in top businesses.
Berkeley has 3.0% of the total graduates in top businesses.
Together, these three schools produced 13.5% of the total graduates in top businesses.

Exploratory Analysis #4: Executive Education¶

So you want to start a company…¶

The most successful businesses¶

The people behind the most successful businesses¶

(Not so) humble beginnings¶

Table Of Contents

Related Topics

This Page