Summary: Can we predict a start-up’s success?¶

Authors: Annie Maslan, Jack Moorer, Mitch Negus

Overview¶

We begin by exploratory analysis of the various factors we hypothesize may influence a company’s success. We consider the following factors: 1. Valuation and number of IPOs by place and time 2. Total funding by place and time 3. Investor portfolio correlations

Exploratory Analysis¶

1. Valuation and number of IPOs by place and time¶

We first consider the number of IPOs by region. The data provided in the ‘cb_ipos’ table is lacking the valuation amount for many companies. When filtering for entries that have valuation amount included, we have an N of 167. We continue with this valuation analyis but will subsequently look at total funding in part 2 for a larger dataset that may also indicate company success. We find that SF Bay has the highest number of companies that IPO (34) with NYC in second (11) and London in third (8). In the below barplot, we consider the regions that had >5 companies IPO in order of decreasing number of companies. Seattle has the highest average valuation for companies that IPO; however, if we consider a barplot of the same data we see that Seattle with 7 companies is more sensitive to outliers and has a broader distribution. This additional analysis can be found in the supplementary notebook ‘location_time.ipynb.’

valuation by region

We next look at the number of IPOs over time to determine how timing may affect a company’s success. We no longer restrict our analysis to companies where valuation amount is provided. More recent years have more IPOs. However, we cannot rule out the possibility that the Crunchbase dataset is also simply becoming more complete over time. ipos by year

From this analysis of valuation and number of IPOs by place and time, we may predict that start-ups that originate in SF Bay and New York in the years since 2007 would be most successful. However, we have also found many shortcomings in the Crunchbase data suggesting that some findings may be more so dependent on improved quality of complete data entry in more recent years.

2. Total funding by place and time¶

While we had been considering valuation amount from the IPO data, we also consider funding_total_usd for the below analysis because it is a much richer dataset. We have N=27,874 instead of N=167. The median funding is greatest in CA and MA. funding by state

Just as we looked at the number of IPOs over time, we now look at the emergence of new companies over time. Looking at the period from 1990-2012, we see that the number of new companies in on the rise. As with the IPOs data, we have to consider that the Crunchbase dataset may be becoming more exhaustive. new companies by year

We also consider the lifespan of companies and find that the median is 3 years, and that the lifespan does not vary significantly by state. For all supplementary related analysis, refer to ‘location_time.ipynb.’

From the analysis of funding by state and number of new companis over time, we can hypothesize that companies in CA and MA that are founded in more recent years will be most successful.

3. Investor portfolio correlations¶

If a company can get a certain investor, are there other investors the company is likely to get as well? We consider the correlation among investors using multidimensional scaling with the ultimate goal of determing whether certain investor portfolios cluster. We create a company matrix, whose columns are company vectors for each investing company. Each matrix entry is a count of the number of occurences of each company in the investor portfolio. The resulting matrix is very sparse; it is compose of 99.76% zeros. We normalize the matrix to represent the probability of investing in a given company for each given investor. Using Multidimensional Scaling with both euclidean distance and Jensen-Shannon distance calcuations, no clusters are evident.

mds naive mds jsdiv

From this very sparce investor-company distance matrix data we cannot conclude any compelling correlations among investors. We also try a variety of subsets of the data in the hopes that some correlations may arise. For example we consider just the top 5% of companies in terms of funding total, just companies in CA, just biotech companies, etc. For each of the filters we tried, no clusters formed. Therefore, we do not identify any correlations among investor portfolios.

4. Executive Education¶

Executives at some of the highest performing businesses need to get their starts somewhere. For some, like Facebook’s Mark Zuckerberg, their backgrounds do not include a college degree. This however seems to be the minority. Still, the premise begs the question: are there schools which tend to produce more successful business executives? Is there a direct correlation between which institution you receive a degree from and eventual success?

To answer these questions, we first need to define a benchmark for a company’s success. This could simply be whether the company is still operating or not (presumably companies that are not still operating were unsuccessful), or it could also be related to the company’s total valuation. Instead, we use the total funding acquired in all funding rounds for a company. Note that this is itself an imperfect metric. Some highly successful companies might require less capital investment than others, so this statistic could be biased in favor of capital intensive companies. We will assume that this is not so overwhelming to render the metric worthless. Funding totals for the top 100 businesses are plotted below.

Using this metric, we find that slightly more than 25% of the companies in the Crunchbase dataset raised over $10 million in funding rounds, and so we select this subset as “the top businesses”. For the more than 7,000 companies in that sample, we match individuals (who are paired with degree information) to those companies. Together, the degree information for successful companies let’s us create a visual to show how many degrees each university has conferred to affiliates of all companies in the top tier of fundraising businesses.

5. Prediction of Success¶

The first question when asked when we found this dataset is can we predict startup success, and if so, what are some important predictors of success in business. Much of this dataset was focused on data we probably know is highly correlated with success, for example, money evaluation, investments, funding rounds, etc. We easily could have used some of these features and attempted to predict start up success with them, but there is a good chance what we find is pretty trivial: more likely that not a company with a large number of investor, a lot of funding, and a high evaluation will be successful, but that isn’t that interesting. Instead, we wanted to focus on purely categorical features (so discrete variables) of a business based on the company itself and its founder. These variables include what industry the company is in, what highest level of degree did the founder receive, or what region is company is based in. I also included logo size, which is not discrete, just because I thought it would be funny. The full list of features we used are:

company category_code
whether they had funding (yes or no)
number of investors
number of relationships to other founders
company logo height
company logo width
region
degree_type (highest degree)
institution (for highest degree)
subject (for highest degree)
first name
last name

The way we are measuring success here is via company status, and in a binary case whether the business was closed or not. The different categories for status were whether the business was closed, acquired, an IPO, or operating. The closed response variable was just encoded as “Yes” for closed and “No” for not closed.

There were two big issues while preprocessing the data for classification. The first was that a lot of the Object dataframe, which contained the majority, of the data was empty. One of the main reasons why is that for many of the feature variables we are using, the data put values equivalent to 0 or ‘None’ (ie if the company did not have a logo) to missing values. In fact, the dataset usually used the string ‘unkown’ as its NaN/None value, which does not come up as a null entry in python. After examining the data, we determined which variables had None or NaN that could be converted to useful values, and encoded them. We then switched ‘unkown’ to NaN so we could drop it later.

The next issue was there was no easy way to combine the dataframes. The dataframes had object ids that corresponded to the id in the Object dataframe, however, data on companies had different object ids than information about the same company’s founder. In addition, when we determined we could merge the data frames through company name, many duplicates appeared, and several founder from the same company existed in the dataframe. These duplicates certainly would have affected any classification model we tried to use.

I want to take a moment to talk about the structure of this dataset. The data-preprocessing portion of this project took much longer than it should have if it was a well-documented, well-formatted dataset. There was no documentation we could find on what the features represented or how the data was organized. In the end the assumptions we made during this preprocessing, which were necessary in order to have a “large enough” dataset, may have severely affected model performance and inference. In addition, the time it took to understand how the data was structured, what certain features represented, and combine the data in a useable way, ended up taking time away from analysis. I actually was planning on doing a separate regression based predictive model on continuous variables, but the issues I discussed in this paragraph left me so frustrated with the data I decided it wasn’t worth it. I thought that my frustrations might be relevant in a course based on reproducibility, but I digress.

After merging all features and response into one dataframe and dropping NaN values, we explored the relationship between the feature variables and response variables. For feature variables that had numeric values we looked at the boxplots of each type category of response for both the status and closed variables. Here is an example of a status vs number of milestones and closed vs number of milestones.

$num\_milestones\_status\_boxplot$

num_milestones_status_boxplot

$num\_mile\_closed\_boxplot$

num_mile_closed_boxplot

We also looked at the relationship between what I called the purely “string based” categorical variables and response by plotting the relative frequencies of the response variables in each category of the features. It was important we looked at relative response due to unbalanced frequencies in the response and feature variables. Here are the plots for category code on satus and had funding on closed.

$cat\_code\_status$

cat_code_status

$had\_fund\_closed$

had_fund_closed

At this point we were ready to train and predict a classification model on our data. Before I talk about the methods I used and process let’s look at the distributions of the response variables.

$status\_dist$

status_dist

$closed\_dist$

closed_dist

As you can see there is a huge imbalance in the distrubtions of our response variables. This is going to be a serious issue for our predictive models. I detail this issue more in the classification notebook, but here is a quick example for why model performance will be affected by reponse category imbalance. Say we are trying to predict whether someone has a very rare disease, say 0.001 percent of the population has this disease. When we fit our model, it will look at the accuracy of prediction in order to determine its internal features. However, in this case, a model that predicts that you don’t have the disease every time has an accuarcy of 99.99 percent; obviously this model is very accurate, but in reality it does nothing.

One more note about the structure of our data, sklearn currenlty does not accept non-integer predictors. This was unfortunate, and my two options were to convert the data into a huge, high-dimensional dumby matrix, or encode each variable with string values. I decided to encode the variables, however, you can read why this is an issue in the classification notebook.

The models we used for classification were a random forest and extremely randomized trees. The main reason we choose these two methods were a) they have a fantastic built in ability to provide inference to feature importance, and b) once we discovered extremely randomized trees we wanted to compare its results to the similar random forest model. I also decided to work with random forests (and tackle this classification question in general) in order to explore sklearn. I (Jack) just spent the semester learning about machine learning methods in r, so I decided to work with random forests to teach myself how to use sklearn’s RandomForestClassifier and compare it to the randomForest r package.

Before fitting a model on the data, I had to tune hyperparameters, and luckily sklearn provides GridSearchCV to help me in this process. For random forest I tuned 3 hyperparameters, the number of random features for each tree in the bootstrap process, the number of estimators (trees) used, and the maximum depth of each tree. Due to runtime issues, I later skipped tuning the number of estimators for the extremely randomized trees model. Sometimes maximum depth of trees is note tuned, and originally this is what I did (leaving the maximum depth of the tree as the default none), however, because of the imbalance in response category frequency maximum depth gave us horrible results, and had to be tuned.

Looking at the status response variables¶

Random Forest¶

I will start with the four category status response variables. Originally for the cross-validation metrix used to tune the parameters of the models I used the defeault accuracy, and got horrible results with high accuracy, but useless results, as they generally overclassified the ‘operating’ category in order to be accurate. Instead, I used the scoring metric f1_weight, which is based both on model accuracy and precision. For random forest I plotted the results of the hyperparameters like in this example.

$fr\_status\_cv$

fr_status_cv

Here you see how the weighted f1 score changed basde on the max depth of the tree grouped by the max features of the tree in seperate panels for the number of estimators used. Notice the shape of all of the plots. The reason this is happening is that as depth increases, the model will overpredict the frequent ‘operating’ variable. If I were to use accuracy as the cross validation metric, you would see a monotonically increasing function for each line.

A ‘best’ random forest was picked using the best combination of parameters, then the random forest was fit on the training data. I will cover feature importance at the end of the this status section. Perforamnce, was mixed for the random forest. The model accuracy was about 82 percent, however, looking at the confusion matrix showed sporatic results. Using an f1 score did allow our model to predict more than just the operating status, and at the end of the section on random forests using status I showed an example of what the confusion matrix would have looked like had we used accuracy as our performance metrix. In this confusion matrix, almost every test individual was predicting to be operating, providing a high accuracy, but showing the difficulty working with unbalanced models.

Extremely Randomized Trees¶

I discovered ExtraTrees when looking up how to deal with unbalanced data in a random forest. An extremly randomized trees model (or ExtraTreesClassifier in sklearn) is similar to a random forest, except that instead of using a boostrap process to estimate node splits, it splits node with a completely random decision boundary. I read this paper that introduced Extra Trees in order to understand what is going on, and the decision to increase randomness is centered on the bias-variance tradeoff: a model’s variance should increase with increased randomness without an huge increase in bias.

Here is an example of the cross validation plot using max depth and numbe of features for ExtraTrees.

$ext\_cv\_status$

ext_cv_status

As you can see performance is monotonically increasing with an increase in depth, and after fitting the “best” model and examining the results I looked at the confusion matrix of different depth levels. In general, the ExtraTreesClassifier performed like the same RandomForest with a depth decrease of about 2. We can see that the ExtraTrees classifier was less accurate, but slightly better at predicting the less frequent status categories compared to random forest.

Feature Importance¶

Our preditive results for the random forest and extremely randomized trees models were mixed, but the nice thing about tree based models is that even though they generally don’t provide fantastic predictive performance, we can still examine feature importance

$var\_imp\_rf\_st$

var_imp_rf_st

$var\_imp\_ex\_status$

var_imp_ex_status

We can see the feature importance of the two models is similar, which is expected since they have a similar process. From these results we can see number of relationships with other founders is the feature that best splits the data here. Many of the features have similarly low importance, especially the string based features. I will go over later why I think these features simply aren’t important for predicting company status

Looking at the closed response variables¶

I wasn’t very happy with the f1 scoring metric, however, it was hard to find a better one. However, one great metric for evaluating imbalanced data is the AUC or area under the curve an ROC curve. In order to use AUC as the performance metric I had to use a two category predictor, hense the closed response variable