Exploratory Analysis #5: Prepare Data for Classification

In [66]:
from modules import *
In [67]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import sklearn
import missingno as msno

plt.style.use('seaborn-dark')
plt.rcParams['figure.figsize'] = (10, 6)
In [68]:
conn = dbConnect()
objects = dbTableToDataFrame(conn, 'cb_objects')
degrees = dbTableToDataFrame(conn, 'cb_degrees')
people = dbTableToDataFrame(conn, 'cb_people')
conn.close()

In this notebook I am going to prepare data to be analyzed in the classification notebook. The classification I will do is attempt to predict company succes, and find which categorical predictors about a company and founder, like company region or academic degree of the founder. Success here is based on status, with a status of “closed” for a company representing a failure for the company. This is an oversimplification of succes, but instead of looking at company value, I want to specifically run classificaiton models.

The predictors I am using are: - company category_code - whether they had funding (yes or no) - number of investors - number of relationships to other founders - company logo height - company logo width - region - degree_type (highest degree) - institution (for highest degree) - subject (for highest degree) - first name - last name

One big issue with this data set though is that it is mostly empty. Many of the variables have missing values in place of a “0”, or for a categorical variable like ‘closed_at’, a none value means the company is not closed. This means if we don’t create dumby variables the data frame will be almost entirely empty. Let’s take a look at how empty our objects data frame is using missingno.

In [69]:
#look at missing values of objects
msno.matrix(objects)
_images/classification_data_prep._7_0.png

The other issue is how we can go about combining the row of the data fame. We can look at the variable ‘id’ in object, and see every other data frame has an ‘object_id’. At first, I thought this was great, and would let me combine any of the data frames. However, the ‘id’ in object or ‘object_id’ is different for founders and companies, and is denotated with a “p” for a person or “c” for company. Objects had information on everything, and seperates information for companies and founders This means I need to find a way to combine information about companies and their founders.

In [70]:
#in degrees and offices 'object_id' is matched to 'id' in objects
#I am going to create an 'object_id' in objects
ids = objects['id'].values
objects['object_id'] = ids

I am not starting with all variables as predictors for a couple of reasons. For one, I want to avoid multicollinearity issues. Many of the variables are hightly correlated, which is why for now I am only using region, instead of state and country as well. These specific variables were chosen because they are largely categorical, and don’t deal directly with the company’s total funding or worth. I am now going to create some dumby variables to get rid of the emptiness of the data.

In [71]:
#function to convert variable to fill data
def convert_null(column, df, positive, negative, pos_val = False):
    """
    description: create a two class dummby variable for a dataframe with None values

    inputs:
        column: String, must be name of column in df
        df: Pandas Data Frame
        postive: String/Boolean/Integer, what to call non-none value in dumby variable
        negative: String/Boolean/Integer, what to call non value in dumby variable
        pos_val: Boolean (optional), if True then positive keep their own value

    output:
        dumby vector
    """
    if type(column) != str:
        raise ValueError("column and newname must be a string")
    if type(df) != pd.core.frame.DataFrame:
        raise ValueError("df must be pandas data frame")
    col = df[column]
    bools = col.isnull().values
    if pos_val == True:
        dumby = []
        for i in range(len(bools)):
            if bools[i] == True:
                dumby.append(negative)
            else:
                dumby.append(col.values[i])
        return(dumby)
    dumby = [positive if not status else negative for status in bools]
    return(dumby)
In [72]:
#change variables of responses:
closed = convert_null('closed_at', objects, 'Yes', 'No', False)
objects['closed'] = closed
In [73]:
#make dumby variable for funding: Yes if they got funding, no if not
funding = convert_null('first_funding_at', objects, 'Yes', 'No')
objects['had_funding'] = funding
In [74]:
#make variable for investment: Yes if they got investment, no if not
investment = convert_null('invested_companies', objects, 'Yes', 0.0, pos_val = True)
objects['num_investment'] = investment
In [75]:
#get number of relationships
num_relationships = convert_null('relationships', objects, 'Yes', 0.0, pos_val = True)
objects['num_relationships'] = num_relationships
In [76]:
#get number of milestones
num_milestones = convert_null('milestones', objects, 'Yes', 0.0, pos_val = True)
objects['num_milestones'] = num_milestones
In [80]:
#deal with logo sizes
width = convert_null('logo_width', objects, 'Yes', 0.0, pos_val = True)
objects['logo_width'] = width

height = convert_null('logo_height', objects, 'Yes', 0.0, pos_val = True)
objects['logo_height'] = height

Now I can store the predictors in one data frame.

In [81]:
def keep_variables(df, variables, object_id = True):
    """
    description: create a two class dummby variable for a dataframe with None values

    inputs:
        df: Pandas Data Frame from where columns are selected
        variables: a list of strings of column names of variables to select
        object_id: (optional) boolean variable, if False do not keep object_id, else add object_id to variables

    output:
        Pandas Data Frame of variables we want to keep
    """
    #check inputs are good
    if type(variables) != list:
        raise TypeError("variables must be a list of strings")
    if not all(isinstance(item, str) for item in variables):
        raise ValueError("all items in variables must be strings")
    if type(df) != pd.core.frame.DataFrame:
        raise ValueError("df must be pandas data frame")
    if object_id:
        variables = ['object_id'] + variables
    return df[variables]
In [82]:
#keep variables from people
people['name'] = people['affiliation_name']
vars_people = keep_variables(people, ['birthplace', 'first_name', 'last_name', 'name'])
In [83]:
#keep variables from degrees
vars_degrees = keep_variables(degrees, ['degree_type', 'institution', 'subject'])
In [84]:
#keep variables from objects
vars_objects = keep_variables(objects, ['closed', 'status', 'name', 'category_code', 'had_funding', 'num_investment', 'num_relationships', 'num_milestones', 'logo_height', 'logo_width', 'region'])
In [85]:
def drop_duplicates(dfs_info):
    """
    description: create a two class dummby variable for a dataframe with None values

    inputs:
        dfs_info: List tuples. First item is data frame, second is subset info for drop_dublic, last is keep infor for drop_duplicate

    output:
        list of pandas dfs equal to length of list
    """
    dfs = []
    for item in dfs_info:
        cur_df = item[0]
        cur_subset = item[1]
        cur_keep = item[2]
        edit_df = cur_df.drop_duplicates(subset = cur_subset, keep = cur_keep)
        dfs.append(edit_df)
    return(dfs)
In [86]:
#set up function
first = (vars_people, 'name', 'first')
second = (vars_degrees, 'object_id', 'last')
dfs_info = [first, second]

#edit dfs
vars_people, vars_degrees = drop_duplicates(dfs_info)
In [88]:
def multi_merge(dfs, times, on = 'object_id', leave_out = []):
    """
    description: merge multiple data frame

    inputs:
        dfs: List of Pandas Data Frame to be merged
        times: number of seperate merges
        on: (optional) String, column to be merged on
        leave_out: (optional) list of 2 item tuples. Each tuple's first item is a data frame from the df list and the second item is a list of columns to leave out of merge.

    output:
        Pandas Data Frame of meged dfs
    """
    if type(dfs) != list:
        raise TypeError("dfs must be a list of Pandas dfs")
    if not all(isinstance(item, pd.core.frame.DataFrame) for item in dfs):
        raise ValueError("all items in dfs must be Pandas DFs")

    if len(leave_out) != 0:
        if type(leave_out) is not list or not all(isinstance(t, tuple) for t in leave_out):
            raise TypeError("leave_out must be a list of tuples")
        for item in leave_out:
            df_to_alter = item[0]
            cols = item[1]
            if df_to_alter not in dfs:
                raise ValueError("data frams in leave_out must be in multi_merge")
            df_alterted = df_to_alter.drop(cols, 1)
            index = dfs.index(df_to_alter)
            dfs[index] = df_alterted

    first = dfs[0]
    rest = dfs[1:]
    for i in range(len(rest)):
        cur_df = rest[i]
        first = first.merge(cur_df, on = on)
    return(first)
In [89]:
#merge the data frames
deg_pep = vars_degrees.merge(vars_people, on = 'object_id')

#drop object id
#df1 = obj_off.drop('object_id', 1)
df1 = vars_objects.drop('object_id', 1)
df2 = deg_pep.drop('object_id', 1)

dat = df1.merge(df2, on = 'name')
dat.shape
Out[89]:
(14051, 17)
In [90]:
#replace missing value
rep_na = dat.replace("", np.nan)
rep_na = rep_na.replace("unknown", np.nan)

We still have an issue in that much of this data will have completely empty rows. I am going to deal with this by droping any row that has na values.

In [91]:
#look at emptiness of data
msno.matrix(rep_na)
_images/classification_data_prep._29_0.png
In [92]:
data = rep_na.dropna(axis = 0, how = 'any')
data.shape
Out[92]:
(2348, 17)
In [93]:
data.head()
Out[93]:
closed status name category_code had_funding num_investment num_relationships num_milestones logo_height logo_width region degree_type institution subject birthplace first_name last_name
7 No operating Fundable.com finance No 3.0 3.0 4.0 120.0 120.0 Columbus BBA Ohio State University Entrepreneurship, Marketing Ohio Eric Corl
8 No operating Wevod games_video Yes 0.0 2.0 0.0 89.0 250.0 Paris MS EPITECH computer sciences and technologies France Nassim Ghandour
11 No acquired Jumptap mobile Yes 0.0 45.0 3.0 165.0 650.0 Boston BA American University Economics St. Petersburg, Russia Veronika Sonsev
18 Yes closed FairSoftware web Yes 0.0 1.0 1.0 67.0 250.0 SF Bay MS Supelec Electrical Engineering Paris, France Alain Raynaud
22 No operating WPP public_relations No 21.0 23.0 3.0 59.0 86.0 New York Degree University of Kentucky Degree Kentucky Jack Smith

Let’s look at the relationship of our data and the reponse variables. First let’s look at the relationship of the more numerical features on the response variables.

In [94]:
def make_plots(response, feat, data, kind):

    """
    description: helper function to view view reponse and feature relationship
    inputs:
        reponse: string name of response variable in data
        feat: string name of feature in data
        data: dataframe
        kind: kind of plot
    output:
        baxplot that is saved with title

    """

    end = ".".join([kind, "png"])
    fig = "_".join([response, end])
    title = " ".join([feat, response, kind])
    fig_end = "_".join([feat, fig])
    save_fig = "/".join(['results', fig_end])

    if (kind == 'box'):
        plt.figure()
        sns.boxplot(data = data, y = feat, x = response)


    if (kind == 'bar'):
        fig = "_".join([response, 'bar.png'])
        group = data.groupby([feat, response])[[response, feat]].size().unstack()

        resp_freqs = group.sum(axis = 0).values
        cat_code_resp_rel = group / resp_freqs

        cat_freqs = cat_code_resp_rel.sum(axis = 1).values
        cat_code_rel = cat_code_resp_rel.div(cat_freqs, axis  = 0)

        cat_code_rel.plot(kind = 'bar')


    plt.title(title)
    plt.savefig(save_fig)
In [95]:
features = ['num_relationships', 'num_milestones', 'num_investment', 'logo_height', 'logo_width']
for f in features:
    make_plots(response = 'status', feat = f, data = data, kind = 'box')
    make_plots(response = 'closed', feat = f, data = data, kind = 'box')
_images/classification_data_prep._34_0.png
_images/classification_data_prep._34_1.png
_images/classification_data_prep._34_2.png
_images/classification_data_prep._34_3.png
_images/classification_data_prep._34_4.png
_images/classification_data_prep._34_5.png
_images/classification_data_prep._34_6.png
_images/classification_data_prep._34_7.png
_images/classification_data_prep._34_8.png
_images/classification_data_prep._34_9.png

We can see some of these variables have a different relationship based on the response variable. For example, the number of milestones looks like it could help predict status, for example, higher number of relationships seems to correlate with ipo status, but there doesn’t seem to be much seperation in number of relationships and the closed response. Interestingly, the features that seems to most seperate both variables are logo size (width or height).

I am now going to look at the relationship between purely string based categorical features and the response variables. Before I do, I should not that we have a huge imbalance of frequencies in the reponse variables. You can see the plots of these frequencies at the begining of the classification notebook, and the implications of this imbalance were huge in the ability to classify and which models we picked. For now, because of this imbalance, I’m going to look at the relative frequency of resopnse variable. Here is an example of what happens without relative frequency:

In [96]:
data.groupby(['category_code', 'status'])[['status', 'category_code']].size().unstack().plot(kind = 'bar')
plt.show()
_images/classification_data_prep._37_0.png

This doesn’t tell us much, because operating is just so frequent for every variable.

In addition, there are differences in frequencies of each category of the response variable. I want to get an idea of the impact of each category on the status variable. Here is what happens if I don’t divide by feature frequencies.

In [97]:
cat_code = data.groupby(['category_code', 'status'])[['status', 'category_code']].size().unstack()

resp_freqs = cat_code.sum(axis = 0).values
cat_code_resp_rel = cat_code / resp_freqs

cat_code_resp_rel.plot(kind = 'bar')
plt.show()
_images/classification_data_prep._39_0.png

It is hard to visualize what is going on when web and software companies clearly overpower the rest. I am now going to plot the relationship between most the string based categorical response variables, and the two types of response. Note that I am now including region, birthplace, first name, or last name, simply because there were so many possibilities that the plot was hard to read. You will be able to see what I mean for degree_type, institution, and subject. In retrospect, I should have used a stemmed on these variables, however, I didn’t think of this until after I completed the classification notebook, and the runetime on that notebook is so significant that a change to stemmed categories wasn’t worth it.

In [100]:
features = ['category_code', 'had_funding', 'region', 'degree_type', 'institution', 'subject']
for f in features:
    make_plots(response = 'status', feat = f, data = data, kind = 'bar')
    make_plots(response = 'closed', feat = f, data = data, kind = 'bar')
_images/classification_data_prep._41_0.png
_images/classification_data_prep._41_1.png
_images/classification_data_prep._41_2.png
_images/classification_data_prep._41_3.png
_images/classification_data_prep._41_4.png
_images/classification_data_prep._41_5.png
_images/classification_data_prep._41_6.png
_images/classification_data_prep._41_7.png
_images/classification_data_prep._41_8.png
_images/classification_data_prep._41_9.png
_images/classification_data_prep._41_10.png
_images/classification_data_prep._41_11.png
In [99]:
data.head()
Out[99]:
closed status name category_code had_funding num_investment num_relationships num_milestones logo_height logo_width region degree_type institution subject birthplace first_name last_name
7 No operating Fundable.com finance No 3.0 3.0 4.0 120.0 120.0 Columbus BBA Ohio State University Entrepreneurship, Marketing Ohio Eric Corl
8 No operating Wevod games_video Yes 0.0 2.0 0.0 89.0 250.0 Paris MS EPITECH computer sciences and technologies France Nassim Ghandour
11 No acquired Jumptap mobile Yes 0.0 45.0 3.0 165.0 650.0 Boston BA American University Economics St. Petersburg, Russia Veronika Sonsev
18 Yes closed FairSoftware web Yes 0.0 1.0 1.0 67.0 250.0 SF Bay MS Supelec Electrical Engineering Paris, France Alain Raynaud
22 No operating WPP public_relations No 21.0 23.0 3.0 59.0 86.0 New York Degree University of Kentucky Degree Kentucky Jack Smith

Now I can save the data file and use it in the next notebook

In [101]:
#save data
data.to_hdf('results/classification_data.h5', 'classification_data')

Move on to the classification notebook next.