Titanic Machine Learning Case

This project is my version of machine learning applied to the famous Titanic Case.

Here, we have two data set (train and test) that represents information on passengers.

The train set contains an extra information : did the passenger survive or not, and the goal here is to find the patterns of survivors to predict whether a passenger from the test set will survive or not.

This is a Machine Learning project using supervised learning and more specifically classification.

Get the data

In [1]:
import pandas as pd
import numpy as np
from pandas.api.types import CategoricalDtype
import matplotlib.pyplot as plt
In [2]:
#Get the data

train_set = pd.read_csv("train.csv") 
test_set = pd.read_csv("test.csv") 
In [3]:
# Let's stack datasets together to apply modifications simultaneously

stacked = pd.concat([train_set, test_set])
stacked
C:\Users\garan\Anaconda3\lib\site-packages\ipykernel_launcher.py:3: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  This is separate from the ipykernel package so we can avoid doing imports until
Out[3]:
Age Cabin Embarked Fare Name Parch PassengerId Pclass Sex SibSp Survived Ticket
0 22.0 NaN S 7.2500 Braund, Mr. Owen Harris 0 1 3 male 1 0.0 A/5 21171
1 38.0 C85 C 71.2833 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 2 1 female 1 1.0 PC 17599
2 26.0 NaN S 7.9250 Heikkinen, Miss. Laina 0 3 3 female 0 1.0 STON/O2. 3101282
3 35.0 C123 S 53.1000 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 4 1 female 1 1.0 113803
4 35.0 NaN S 8.0500 Allen, Mr. William Henry 0 5 3 male 0 0.0 373450
... ... ... ... ... ... ... ... ... ... ... ... ...
413 NaN NaN S 8.0500 Spector, Mr. Woolf 0 1305 3 male 0 NaN A.5. 3236
414 39.0 C105 C 108.9000 Oliva y Ocana, Dona. Fermina 0 1306 1 female 0 NaN PC 17758
415 38.5 NaN S 7.2500 Saether, Mr. Simon Sivertsen 0 1307 3 male 0 NaN SOTON/O.Q. 3101262
416 NaN NaN S 8.0500 Ware, Mr. Frederick 0 1308 3 male 0 NaN 359309
417 NaN NaN C 22.3583 Peter, Master. Michael J 1 1309 3 male 1 NaN 2668

1309 rows × 12 columns

Data preparation and Features engineering

First impression, Sex and PClass seem to be determining factors for survival onboard.

In [4]:
stacked[["Sex", "Survived"]].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)
Out[4]:
Sex Survived
0 female 0.742038
1 male 0.188908
In [5]:
stacked[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)
Out[5]:
Pclass Survived
0 1 0.629630
1 2 0.472826
2 3 0.242363

This shows that upper class (in terms of boat's class) and women were more likely to survive.

This analysis help us to select this both features as important ones.

And now, we can transform Sex into a numerical value!

In [6]:
#Sex
#Conditional function définition : 

def s(stacked):
    if stacked['Sex'] == 'male':
        val = 1
    else:
        val = 0
    return val

#Modification of column 'Sex'
stacked['Sex'] = stacked.apply(s, axis=1)
stacked
Out[6]:
Age Cabin Embarked Fare Name Parch PassengerId Pclass Sex SibSp Survived Ticket
0 22.0 NaN S 7.2500 Braund, Mr. Owen Harris 0 1 3 1 1 0.0 A/5 21171
1 38.0 C85 C 71.2833 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 2 1 0 1 1.0 PC 17599
2 26.0 NaN S 7.9250 Heikkinen, Miss. Laina 0 3 3 0 0 1.0 STON/O2. 3101282
3 35.0 C123 S 53.1000 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 4 1 0 1 1.0 113803
4 35.0 NaN S 8.0500 Allen, Mr. William Henry 0 5 3 1 0 0.0 373450
... ... ... ... ... ... ... ... ... ... ... ... ...
413 NaN NaN S 8.0500 Spector, Mr. Woolf 0 1305 3 1 0 NaN A.5. 3236
414 39.0 C105 C 108.9000 Oliva y Ocana, Dona. Fermina 0 1306 1 0 0 NaN PC 17758
415 38.5 NaN S 7.2500 Saether, Mr. Simon Sivertsen 0 1307 3 1 0 NaN SOTON/O.Q. 3101262
416 NaN NaN S 8.0500 Ware, Mr. Frederick 0 1308 3 1 0 NaN 359309
417 NaN NaN C 22.3583 Peter, Master. Michael J 1 1309 3 1 1 NaN 2668

1309 rows × 12 columns

However, for logical reason, we can remove the names: indeed, your name does not impact whether you will survive or not.

What can make the difference however is your title. It may be an element that can impact your survival in case of evacuation (because of hierarchy) Let's split names from their title

In [7]:
#Get the title out of the name
split1 = stacked['Name'].str.split('.').str[0]
title = split1.str.split(' ').str[1]
title
Out[7]:
0          Mr
1         Mrs
2        Miss
3         Mrs
4          Mr
        ...  
413        Mr
414         y
415        Mr
416        Mr
417    Master
Name: Name, Length: 1309, dtype: object
In [8]:
#Add the column to the dataset
stacked['Title'] = title
stacked = stacked.drop(['Name'], axis=1)
stacked
Out[8]:
Age Cabin Embarked Fare Parch PassengerId Pclass Sex SibSp Survived Ticket Title
0 22.0 NaN S 7.2500 0 1 3 1 1 0.0 A/5 21171 Mr
1 38.0 C85 C 71.2833 0 2 1 0 1 1.0 PC 17599 Mrs
2 26.0 NaN S 7.9250 0 3 3 0 0 1.0 STON/O2. 3101282 Miss
3 35.0 C123 S 53.1000 0 4 1 0 1 1.0 113803 Mrs
4 35.0 NaN S 8.0500 0 5 3 1 0 0.0 373450 Mr
... ... ... ... ... ... ... ... ... ... ... ... ...
413 NaN NaN S 8.0500 0 1305 3 1 0 NaN A.5. 3236 Mr
414 39.0 C105 C 108.9000 0 1306 1 0 0 NaN PC 17758 y
415 38.5 NaN S 7.2500 0 1307 3 1 0 NaN SOTON/O.Q. 3101262 Mr
416 NaN NaN S 8.0500 0 1308 3 1 0 NaN 359309 Mr
417 NaN NaN C 22.3583 1 1309 3 1 1 NaN 2668 Master

1309 rows × 12 columns

Let's change it to numerical values:

In [9]:
stacked['Title'].nunique()
Out[9]:
34

There are 34 different titles

In [10]:
#we create a second dataframe with title list and we attribute one number (index) to each title
t = stacked['Title'].unique()
df = pd.DataFrame(t, columns=['Title'])
df['index'] = df.index
df.head()
Out[10]:
Title index
0 Mr 0
1 Mrs 1
2 Miss 2
3 Master 3
4 Planke, 4
In [11]:
#merge df with stacked
stacked= pd.merge(stacked, df[['Title', 'index']],left_on = 'Title', right_on = 'Title', how = 'left')
#delete title and rename index
stacked = stacked.drop('Title', axis=1)
stacked = stacked.rename(columns={'index':'Title'})
stacked
Out[11]:
Age Cabin Embarked Fare Parch PassengerId Pclass Sex SibSp Survived Ticket Title
0 22.0 NaN S 7.2500 0 1 3 1 1 0.0 A/5 21171 0
1 38.0 C85 C 71.2833 0 2 1 0 1 1.0 PC 17599 1
2 26.0 NaN S 7.9250 0 3 3 0 0 1.0 STON/O2. 3101282 2
3 35.0 C123 S 53.1000 0 4 1 0 1 1.0 113803 1
4 35.0 NaN S 8.0500 0 5 3 1 0 0.0 373450 0
... ... ... ... ... ... ... ... ... ... ... ... ...
1304 NaN NaN S 8.0500 0 1305 3 1 0 NaN A.5. 3236 0
1305 39.0 C105 C 108.9000 0 1306 1 0 0 NaN PC 17758 13
1306 38.5 NaN S 7.2500 0 1307 3 1 0 NaN SOTON/O.Q. 3101262 0
1307 NaN NaN S 8.0500 0 1308 3 1 0 NaN 359309 0
1308 NaN NaN C 22.3583 1 1309 3 1 1 NaN 2668 3

1309 rows × 12 columns

If we check our data we will see that some columns have missing values:

In [12]:
#Check for missing values
print(stacked.isnull().sum().sort_values(ascending=False))
Cabin          1014
Survived        418
Age             263
Embarked          2
Fare              1
Title             0
Ticket            0
SibSp             0
Sex               0
Pclass            0
PassengerId       0
Parch             0
dtype: int64

We can see that 'Age', 'Cabin', 'Embarked', 'Fare' and 'Survived have missing value.

AGE: with almost 300 missing values, it could be interesting to fill it with value such as average or median which won't mess up our analysis

CABIN: They are too many missing values on Cabin. Therefore, we will just drop the column

EMBARKED and FARE: They are missing respectively only one and two values, we will use the most common value to fill the missing one

SURVIVED: These missing values correspond to the test dataset. Indeed, since we are looking to predict the survival of this persenger, they don't have any value yet.

In [13]:
#Age description
stacked['Age'].describe()
Out[13]:
count    1046.000000
mean       29.881138
std        14.413493
min         0.170000
25%        21.000000
50%        28.000000
75%        39.000000
max        80.000000
Name: Age, dtype: float64

AGE: Here, we can see that average us almost 30 and median is 28. Since we have extrem value (80) and a big difference between the third quartile and the max value, we will choose to use the median to fill missing rows.

In [14]:
#Replace NaN in age by median
median = stacked['Age'].median()
r = stacked['Age'].fillna(median, inplace=True)

Here, we will use a bit of data vizualisation to understand the impact of age and sex on survival

In [15]:
#Visualization with seaborn
import seaborn as sns

grid = sns.FacetGrid(stacked, col='Survived', row ='Sex', size=2.2, aspect=1.6)
grid.map(plt.hist, 'Age', alpha=.5, bins=20)
grid.add_legend();
C:\Users\garan\Anaconda3\lib\site-packages\seaborn\axisgrid.py:230: UserWarning: The `size` paramter has been renamed to `height`; please update your code.
  warnings.warn(msg, UserWarning)

As we can see, age can be considered using ranges : Let's create those ranges.

In [16]:
#Range of age
stacked['AgeRange'] = pd.cut(stacked['Age'], 5)
ar = stacked[['AgeRange', 'Survived']].groupby(['AgeRange'], as_index=False).mean().sort_values(by='AgeRange', ascending=True)
ar
Out[16]:
AgeRange Survived
0 (0.0902, 16.136] 0.550000
1 (16.136, 32.102] 0.344168
2 (32.102, 48.068] 0.404255
3 (48.068, 64.034] 0.434783
4 (64.034, 80.0] 0.090909
In [17]:
# transform range into numerical values
ar['index']= ar.index
stacked = pd.merge(stacked, ar[['AgeRange', 'index']],left_on = 'AgeRange', right_on = 'AgeRange', how = 'left')
stacked = stacked.drop('AgeRange', axis=1)
stacked = stacked.drop('Age', axis=1)
stacked = stacked.rename(columns={'index':'Age'})
stacked.head()
Out[17]:
Cabin Embarked Fare Parch PassengerId Pclass Sex SibSp Survived Ticket Title Age
0 NaN S 7.2500 0 1 3 1 1 0.0 A/5 21171 0 1
1 C85 C 71.2833 0 2 1 0 1 1.0 PC 17599 1 2
2 NaN S 7.9250 0 3 3 0 0 1.0 STON/O2. 3101282 2 1
3 C123 S 53.1000 0 4 1 0 1 1.0 113803 1 2
4 NaN S 8.0500 0 5 3 1 0 0.0 373450 0 2
In [18]:
#Drop Cabin
stacked = stacked.drop('Cabin', axis = 1)

Now, we can do the same with Embarked

In [19]:
#Extract most common value
d = stacked.groupby('Embarked')['PassengerId'].count().sort_values(ascending = False).idxmax()
d
Out[19]:
'S'

S is the most common value

In [20]:
# Fillna with the most common value of Embarked

stacked['Embarked'] = stacked['Embarked'].fillna(d)

#Change Embarked into numerical values
e = stacked['Embarked'].unique()
dfa = pd.DataFrame(e, columns=['Embarked'])
dfa['index'] = dfa.index
stacked = pd.merge(stacked, dfa[['Embarked', 'index']],left_on = 'Embarked', right_on = 'Embarked', how = 'left')
stacked = stacked.drop('Embarked', axis=1)
stacked = stacked.rename(columns={'index':'Embarked'})
stacked.head(5)
Out[20]:
Fare Parch PassengerId Pclass Sex SibSp Survived Ticket Title Age Embarked
0 7.2500 0 1 3 1 1 0.0 A/5 21171 0 1 0
1 71.2833 0 2 1 0 1 1.0 PC 17599 1 2 1
2 7.9250 0 3 3 0 0 1.0 STON/O2. 3101282 2 1 0
3 53.1000 0 4 1 0 1 1.0 113803 1 2 0
4 8.0500 0 5 3 1 0 0.0 373450 0 2 0
In [21]:
#Idem for Fare
most_common_val = stacked.groupby('Fare')['PassengerId'].count().sort_values(ascending = False).idxmax()

stacked['Fare'] = stacked['Fare'].fillna(most_common_val)

Ticket has 929 different type of ticket. We can therefore say that this feature is going to be useless: to many different types compare to the number of passenger.

In [22]:
stacked['Ticket'].nunique()
Out[22]:
929
In [23]:
#Drop Cabin
stacked = stacked.drop('Ticket', axis =1)

Now, we are going to flag whether a person is alone or not

In [24]:
# Intermediate column: Family_count
stacked['Family_count'] = stacked['SibSp']+stacked['Parch']
stacked['Family_count'].head()
Out[24]:
0    1
1    1
2    0
3    1
4    0
Name: Family_count, dtype: int64
In [25]:
#Conditional function définition : 

def f(stacked):
    if stacked['Family_count'] < 1:
        val = 1
    else:
        val = 0
    return val
In [26]:
#Creation of column 'IsAlone'
stacked['IsAlone'] = stacked.apply(f, axis=1)
stacked[['IsAlone', 'Survived']].groupby(['IsAlone'], as_index=False).mean()
Out[26]:
IsAlone Survived
0 0 0.505650
1 1 0.303538

This information is relevant, we can now drop the following columns : SibSp, Family_count and Parch

In [27]:
#Drop SibSp, Family_count and Parch
stacked = stacked.drop('Parch', axis = 1)
stacked = stacked.drop('SibSp', axis = 1)
stacked = stacked.drop('Family_count', axis = 1)
In [28]:
stacked.head()
Out[28]:
Fare PassengerId Pclass Sex Survived Title Age Embarked IsAlone
0 7.2500 1 3 1 0.0 0 1 0 0
1 71.2833 2 1 0 1.0 1 2 1 0
2 7.9250 3 3 0 1.0 2 1 0 1
3 53.1000 4 1 0 1.0 1 2 0 0
4 8.0500 5 3 1 0.0 0 2 0 1

Now that our Data is ready, we can re-split it into train and test set :) !

Splitting data

In [29]:
#historical is the share of the data that contains 'Survived' columns
historical = stacked.loc[stacked['Survived'].notnull()]
historical
Out[29]:
Fare PassengerId Pclass Sex Survived Title Age Embarked IsAlone
0 7.2500 1 3 1 0.0 0 1 0 0
1 71.2833 2 1 0 1.0 1 2 1 0
2 7.9250 3 3 0 1.0 2 1 0 1
3 53.1000 4 1 0 1.0 1 2 0 0
4 8.0500 5 3 1 0.0 0 2 0 1
... ... ... ... ... ... ... ... ... ...
886 13.0000 887 2 1 0.0 6 1 0 1
887 30.0000 888 1 0 1.0 2 1 0 1
888 23.4500 889 3 0 0.0 2 1 0 0
889 30.0000 890 1 1 1.0 0 1 1 1
890 7.7500 891 3 1 0.0 0 1 2 1

891 rows × 9 columns

In [30]:
#Do not contain 'survived'
topredict_set = stacked.loc[stacked['Survived'].isnull()]
topredict_set = topredict_set.drop('Survived', axis = 1)
topredict_set 
Out[30]:
Fare PassengerId Pclass Sex Title Age Embarked IsAlone
891 7.8292 892 3 1 0 2 2 1
892 7.0000 893 3 0 1 2 0 0
893 9.6875 894 2 1 0 3 2 1
894 8.6625 895 3 1 0 1 0 1
895 12.2875 896 3 0 1 1 0 0
... ... ... ... ... ... ... ... ...
1304 8.0500 1305 3 1 0 1 0 1
1305 108.9000 1306 1 0 13 2 1 1
1306 7.2500 1307 3 1 0 2 0 1
1307 8.0500 1308 3 1 0 1 0 1
1308 22.3583 1309 3 1 3 1 1 0

418 rows × 8 columns

Now, let's create a test set for cross validation.

In [31]:
#create the test set:
def split_train_test(data, test_ratio):
    np.random.seed(42) #always generate the same random data
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data) * test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return data.iloc[train_indices], data.iloc[test_indices]


train_set, test_set = split_train_test(historical, 0.20)
In [32]:
#Defining X_train and Y_train
X_train = train_set.drop('Survived', axis=1)
Y_train = train_set['Survived']

Let's start playing with classifiers

In [33]:
#we define display_scores to display different scores that will allow us to select the best model 
def display_scores(scores):
    print('Scores:', scores)
    print('Mean', scores.mean())
    print('Standard deviation', scores.std())

Random Forest Classifier

In [34]:
from sklearn.ensemble import RandomForestClassifier
RFC = RandomForestClassifier(n_estimators=1000)
In [35]:
RFC.fit(X_train, Y_train)
predicted_Y_train = RFC.predict(X_train)

from sklearn.model_selection import cross_val_score #return the validation
RFC_scores = cross_val_score(RFC, X_train, Y_train, cv=10, scoring ='accuracy')

display_scores(RFC_scores) #50% of the total are true
print("\n")

from sklearn.model_selection import cross_val_predict #return the prediciton
RFC_train_Y_pred= cross_val_predict(RFC, X_train, Y_train, cv=3)

from sklearn.metrics import precision_score, recall_score
RFC_precision= precision_score(Y_train, RFC_train_Y_pred) 
RFC_recall= recall_score(Y_train, RFC_train_Y_pred)
print('Precision is:', RFC_precision, 'and Recall is:', RFC_recall)
Scores: [0.83333333 0.76388889 0.77777778 0.86111111 0.83098592 0.76056338
 0.8028169  0.78873239 0.76056338 0.91428571]
Mean 0.8094058797227811
Standard deviation 0.04798494769564089


Precision is: 0.7922077922077922 and Recall is: 0.6802973977695167

SGD classifier

In [36]:
from sklearn.linear_model import SGDClassifier
SGD = SGDClassifier(random_state = 42)
In [37]:
SGD.fit(X_train, Y_train)
predicted_Y_train = SGD.predict(X_train)

from sklearn.model_selection import cross_val_score #return the validation
SGD_scores = cross_val_score(SGD, X_train, Y_train, cv=10, scoring ='accuracy')

display_scores(SGD_scores) #50% of the total are true
print("\n")

from sklearn.model_selection import cross_val_predict #return the prediciton
SGD_train_Y_pred= cross_val_predict(SGD, X_train, Y_train, cv=3)

from sklearn.metrics import precision_score, recall_score
SGD_precision= precision_score(Y_train, SGD_train_Y_pred) 
SGD_recall= recall_score(Y_train, SGD_train_Y_pred)
print('Precision is:', SGD_precision, 'and Recall is:', SGD_recall)
Scores: [0.66666667 0.68055556 0.65277778 0.72222222 0.61971831 0.63380282
 0.63380282 0.64788732 0.63380282 0.37142857]
Mean 0.6262664878157838
Standard deviation 0.08950582124891168


Precision is: 0.4375 and Recall is: 0.6245353159851301

Logistic Regression

In [38]:
from sklearn.linear_model import LogisticRegression
LR = LogisticRegression(random_state = 42)


from warnings import simplefilter #every iteration with logreg output an annoying future warning
simplefilter(action = 'ignore', category = FutureWarning)
In [39]:
LR.fit(X_train, Y_train)
predicted_Y_train = LR.predict(X_train)

from sklearn.model_selection import cross_val_score #return the validation
LR_scores = cross_val_score(LR, X_train, Y_train, cv=10, scoring ='accuracy')

display_scores(LR_scores) #50% of the total are true
print("\n")

from sklearn.model_selection import cross_val_predict #return the prediciton
LR_train_Y_pred= cross_val_predict(LR, X_train, Y_train, cv=3)

from sklearn.metrics import precision_score, recall_score
LR_precision= precision_score(Y_train, LR_train_Y_pred) 
LR_recall= recall_score(Y_train, LR_train_Y_pred)
print('Precision is:', LR_precision, 'and Recall is:', LR_recall)
Scores: [0.79166667 0.76388889 0.70833333 0.90277778 0.8028169  0.70422535
 0.70422535 0.74647887 0.77464789 0.91428571]
Mean 0.7813346747149564
Standard deviation 0.0719566042132041


Precision is: 0.724 and Recall is: 0.6728624535315985

K Neighbors Classifier

In [40]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing #eventually to scale X
KN = KNeighborsClassifier()
In [41]:
KN.fit(X_train, Y_train)
predicted_Y_train = KN.predict(X_train)

from sklearn.model_selection import cross_val_score #return the validation
KN_scores = cross_val_score(KN, X_train, Y_train, cv=10, scoring ='accuracy')

display_scores(KN_scores) #50% of the total are true
print("\n")

from sklearn.model_selection import cross_val_predict #return the prediciton
KN_train_Y_pred= cross_val_predict(KN, X_train, Y_train, cv=3)

from sklearn.metrics import precision_score, recall_score
KN_precision= precision_score(Y_train, KN_train_Y_pred) 
KN_recall= recall_score(Y_train, KN_train_Y_pred)
print('Precision is:', KN_precision, 'and Recall is:', KN_recall)
Scores: [0.65277778 0.70833333 0.56944444 0.69444444 0.57746479 0.6056338
 0.6056338  0.61971831 0.56338028 0.65714286]
Mean 0.625397384305835
Standard deviation 0.048467685683036686


Precision is: 0.5022421524663677 and Recall is: 0.4163568773234201

Decision Tree Classifier

In [42]:
from sklearn import tree
DT = tree.DecisionTreeClassifier()
In [43]:
DT.fit(X_train, Y_train)
predicted_Y_train = DT.predict(X_train)

from sklearn.model_selection import cross_val_score #return the validation
DT_scores = cross_val_score(DT, X_train, Y_train, cv=10, scoring ='accuracy')

display_scores(DT_scores) #50% of the total are true
print("\n")

from sklearn.model_selection import cross_val_predict #return the prediciton
DT_train_Y_pred= cross_val_predict(DT, X_train, Y_train, cv=3)

from sklearn.metrics import precision_score, recall_score
DT_precision= precision_score(Y_train, DT_train_Y_pred) 
DT_recall= recall_score(Y_train, DT_train_Y_pred)
print('Precision is:', DT_precision, 'and Recall is:', DT_recall)
Scores: [0.72222222 0.58333333 0.70833333 0.77777778 0.74647887 0.71830986
 0.74647887 0.67605634 0.77464789 0.82857143]
Mean 0.7282209926224011
Standard deviation 0.06292182587870337


Precision is: 0.6498194945848376 and Recall is: 0.6691449814126395

Bernoulli Naive Bayes Classifier

In [44]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB

BNB = BernoulliNB(binarize = True) #function will binarize data
In [45]:
BNB.fit(X_train, Y_train)
predicted_Y_train = BNB.predict(X_train)

from sklearn.model_selection import cross_val_score #return the validation
BNB_scores = cross_val_score(BNB, X_train, Y_train, cv=10, scoring ='accuracy')

display_scores(BNB_scores) #50% of the total are true
print("\n")

from sklearn.model_selection import cross_val_predict #return the prediciton
BNB_train_Y_pred= cross_val_predict(BNB, X_train, Y_train, cv=3)

from sklearn.metrics import precision_score, recall_score
BNB_precision= precision_score(Y_train, BNB_train_Y_pred) 
BNB_recall= recall_score(Y_train, BNB_train_Y_pred)
print('Precision is:', BNB_precision, 'and Recall is:', BNB_recall)
Scores: [0.625      0.66666667 0.59722222 0.65277778 0.67605634 0.56338028
 0.6056338  0.64788732 0.57746479 0.68571429]
Mean 0.629780348759222
Standard deviation 0.040372320930531114


Precision is: 0.5815217391304348 and Recall is: 0.39776951672862454

Gaussian Naive Bayes Classifier

In [46]:
from sklearn.naive_bayes import GaussianNB
GNB = GaussianNB() 
In [47]:
GNB.fit(X_train, Y_train)
predicted_Y_train = GNB.predict(X_train)

from sklearn.model_selection import cross_val_score #return the validation
GNB_scores = cross_val_score(GNB, X_train, Y_train, cv=10, scoring ='accuracy')

display_scores(GNB_scores) #50% of the total are true
print("\n")

from sklearn.model_selection import cross_val_predict #return the prediciton
GNB_train_Y_pred= cross_val_predict(GNB, X_train, Y_train, cv=3)

from sklearn.metrics import precision_score, recall_score
GNB_precision= precision_score(Y_train, GNB_train_Y_pred) 
GNB_recall= recall_score(Y_train, GNB_train_Y_pred)
print('Precision is:', GNB_precision, 'and Recall is:', GNB_recall)
Scores: [0.76388889 0.77777778 0.70833333 0.93055556 0.78873239 0.67605634
 0.69014085 0.71830986 0.70422535 0.87142857]
Mean 0.7629448915716521
Standard deviation 0.078753213947817


Precision is: 0.7211155378486056 and Recall is: 0.6728624535315985

Linear Discriminant Analysis

In [48]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
LDA = LinearDiscriminantAnalysis()
In [49]:
LDA.fit(X_train, Y_train)
predicted_Y_train = LDA.predict(X_train)

from sklearn.model_selection import cross_val_score #return the validation
LDA_scores = cross_val_score(LDA, X_train, Y_train, cv=10, scoring ='accuracy')

display_scores(LDA_scores) #50% of the total are true
print("\n")

from sklearn.model_selection import cross_val_predict #return the prediciton
LDA_train_Y_pred= cross_val_predict(LDA, X_train, Y_train, cv=3)

from sklearn.metrics import precision_score, recall_score
LDA_precision= precision_score(Y_train, LDA_train_Y_pred) 
LDA_recall= recall_score(Y_train, LDA_train_Y_pred)
print('Precision is:', LDA_precision, 'and Recall is:', LDA_recall)
Scores: [0.81944444 0.75       0.69444444 0.91666667 0.81690141 0.67605634
 0.73239437 0.78873239 0.77464789 0.9       ]
Mean 0.7869287949921754
Standard deviation 0.07543905358783627


Precision is: 0.7346938775510204 and Recall is: 0.6691449814126395
In [50]:
#We create a matrix to compare results
MCA = {'accuracy': [np.mean(RFC_scores), np.mean(SGD_scores), np.mean(LR_scores), np.mean(KN_scores), np.mean(DT_scores), np.mean(BNB_scores), np.mean(GNB_scores), np.mean(LDA_scores) ],
     'precision': [np.mean(SGD_precision), np.mean(LR_precision), np.mean(KN_precision), np.mean(DT_precision), np.mean(BNB_precision), np.mean(GNB_precision), np.mean(RFC_precision), np.mean(LDA_precision) ],
    'Index' :['RFC','SGD','LR','KN','DT','BNB','GNB','LDA']
    }
accuracy_matrix = pd.DataFrame(MCA)
accuracy_matrix = accuracy_matrix.set_index('Index')
print(accuracy_matrix.sort_values('accuracy', ascending = False))
       accuracy  precision
Index                     
RFC    0.809406   0.437500
LDA    0.786929   0.734694
LR     0.781335   0.502242
GNB    0.762945   0.792208
DT     0.728221   0.581522
BNB    0.629780   0.721116
SGD    0.626266   0.724000
KN     0.625397   0.649819

Here, we decide to keep the top 3: RFC, LR, LDA

and we will now try to find the best parameters for these models !

Hypertuning

RFC tuned

In [51]:
#Hypertuning using RandomizedSearchCV

from sklearn.model_selection import RandomizedSearchCV

param_grid = {

'n_estimators': [150, 200, 400, 700, 1000],

'max_features': ['auto', 'sqrt', 'log2'],

'criterion': ['entropy', 'gini'],

'max_depth': [5, 50, 150],

'min_samples_leaf': [2, 10, 50]}

random = RandomizedSearchCV(estimator = RFC, param_distributions = param_grid, cv = 3, n_jobs=-1)

random_result = random.fit(X_train, Y_train)

# Summarize results

print("Best: %f using %s" % (random_result.best_score_, random_result.best_params_))
Best: 0.823282 using {'n_estimators': 400, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 50, 'criterion': 'entropy'}
In [52]:
RFC = RandomForestClassifier(**random_result.best_params_)

RFC.fit(X_train, Y_train)

from sklearn.model_selection import cross_val_score

RFC_scores_tuned = cross_val_score(RFC, X_train, Y_train, cv=15, scoring ='accuracy')

display_scores(RFC_scores_tuned)

print('/n')

from sklearn.model_selection import cross_val_predict #return the prediciton

RFC_train_Y_pred_tuned = cross_val_predict(RFC, X_train, Y_train, cv=3)

from sklearn.metrics import precision_score, recall_score

RFC_precision_tuned = precision_score(Y_train, RFC_train_Y_pred_tuned)

RFC_recall_tuned = recall_score(Y_train, RFC_train_Y_pred_tuned)

print('Precision is:', RFC_precision_tuned, 'and Recall is:', RFC_recall_tuned)
Scores: [0.79166667 0.83333333 0.72916667 0.75       0.875      0.875
 0.85416667 0.83333333 0.77083333 0.74468085 0.80851064 0.76595745
 0.72340426 0.87234043 0.91304348]
Mean 0.8093624730188096
Standard deviation 0.0587079320945668
/n
Precision is: 0.8287037037037037 and Recall is: 0.6654275092936803

LR tuned

In [53]:
from sklearn.model_selection import GridSearchCV
In [54]:
LogReg = LogisticRegression()

param_grid = [{'penalty' : ['l2', 'none'], 'solver' : ['newton-cg', 'lbfgs', 'sag', 'saga']}]
grid_search = GridSearchCV(LogReg, param_grid, cv=5, scoring='accuracy')



grid_search.fit(X_train, Y_train)
print(grid_search.best_params_)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
{'penalty': 'l2', 'solver': 'lbfgs'}
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
In [55]:
LR = LogisticRegression(**grid_search.best_params_)

LR.fit(X_train, Y_train)

from sklearn.model_selection import cross_val_score
LR_scores_tuned = cross_val_score(LR, X_train, Y_train, cv=10, scoring ='accuracy')

display_scores(LR_scores_tuned)

print('\n')


from sklearn.model_selection import cross_val_predict #return the prediciton
LR_train_Y_pred_tuned = cross_val_predict(LR, X_train, Y_train, cv=3)

from sklearn.metrics import precision_score, recall_score
LR_precision_tuned = precision_score(Y_train, LR_train_Y_pred_tuned) 
LR_recall_tuned = recall_score(Y_train, LR_train_Y_pred_tuned)
print('Precision is:', LR_precision_tuned, 'and Recall is:', LR_recall_tuned)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
Scores: [0.77777778 0.76388889 0.69444444 0.93055556 0.8028169  0.70422535
 0.67605634 0.74647887 0.76056338 0.88571429]
Mean 0.7742521797451374
Standard deviation 0.07724280654985156


Precision is: 0.7295081967213115 and Recall is: 0.6617100371747212
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)

LDA tuned

In [56]:
LDA = LinearDiscriminantAnalysis()

param_grid = [{'solver' : ['svd', 'lsqr'], 'n_components' : [None, 2, 5, 10, 15]}]
grid_search = GridSearchCV(LDA, param_grid, cv=5, scoring='accuracy')



grid_search.fit(X_train, Y_train)
print(grid_search.best_params_)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
{'n_components': None, 'solver': 'svd'}
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
In [57]:
LDA = LinearDiscriminantAnalysis(**grid_search.best_params_)

LDA.fit(X_train, Y_train)

from sklearn.model_selection import cross_val_score
LDA_scores_tuned = cross_val_score(LDA, X_train, Y_train, cv=10, scoring ='accuracy')

display_scores(LDA_scores_tuned)

print('\n')


from sklearn.model_selection import cross_val_predict #return the prediciton
LDA_train_Y_pred_tuned = cross_val_predict(LDA, X_train, Y_train, cv=3)

from sklearn.metrics import precision_score, recall_score
LDA_precision_tuned = precision_score(Y_train, LDA_train_Y_pred_tuned) 
LDA_recall_tuned = recall_score(Y_train, LDA_train_Y_pred_tuned)
print('Precision is:', LDA_precision_tuned, 'and Recall is:', LDA_recall_tuned)
Scores: [0.81944444 0.75       0.69444444 0.91666667 0.81690141 0.67605634
 0.73239437 0.78873239 0.77464789 0.9       ]
Mean 0.7869287949921754
Standard deviation 0.07543905358783627


Precision is: 0.7346938775510204 and Recall is: 0.6691449814126395
In [58]:
#We create a matrix to compare results
MCA2 = {'accuracy': [np.mean(RFC_scores_tuned), np.mean(LR_scores_tuned), np.mean(LDA_scores_tuned)],
     'precision': [np.mean(RFC_precision_tuned), np.mean(LR_precision_tuned), np.mean(LDA_precision_tuned)],
    'Index' :['RFC','LR','LDA']
    }
accuracy_matrix = pd.DataFrame(MCA2)
accuracy_matrix = accuracy_matrix.set_index('Index')
print(accuracy_matrix.sort_values('accuracy', ascending = False))
       accuracy  precision
Index                     
RFC    0.809362   0.828704
LDA    0.786929   0.734694
LR     0.774252   0.729508

Here we can see that RFC is our best model so far.

This result are not perfect, however they are the best that I've got even after editing my features and hypertuning. Besides, this my first machine learning project. Therefore, I am going to stick to these results.

Preparing the test_set

In [59]:
X_test = test_set.drop('Survived', axis=1)
Y_test = test_set['Survived']
In [60]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
In [61]:
#Applying RFC to test_set
RFC = RandomForestClassifier(**random_result.best_params_)

RFC.fit(X_train, Y_train)
RFC.predict(X_test)

#from sklearn.model_selection import cross_val_score
#RFC_scores_MF_predicted_test = cross_val_score(RFC, X_train_Sale_MF, test_set_num['Sale_MF'], cv=15, scoring ='accuracy')


accuracy_test = accuracy_score(Y_test, RFC.predict(X_test))

precision_test = precision_score(Y_test, RFC.predict(X_test))

print('Accuracy :', accuracy_test)
print('Precision :', precision_test)
Accuracy : 0.8314606741573034
Precision : 0.8028169014084507

Our result are similar with both train and test set. We can therefore move on to prediction

Prediction

In [62]:
RFC = RandomForestClassifier(**random_result.best_params_)

RFC.fit(X_train, Y_train)
prediction = RFC.predict(topredict_set)
topredict_set['prediction'] = prediction
topredict_set
Out[62]:
Fare PassengerId Pclass Sex Title Age Embarked IsAlone prediction
891 7.8292 892 3 1 0 2 2 1 0.0
892 7.0000 893 3 0 1 2 0 0 0.0
893 9.6875 894 2 1 0 3 2 1 0.0
894 8.6625 895 3 1 0 1 0 1 0.0
895 12.2875 896 3 0 1 1 0 0 1.0
... ... ... ... ... ... ... ... ... ...
1304 8.0500 1305 3 1 0 1 0 1 0.0
1305 108.9000 1306 1 0 13 2 1 1 1.0
1306 7.2500 1307 3 1 0 2 0 1 0.0
1307 8.0500 1308 3 1 0 1 0 1 0.0
1308 22.3583 1309 3 1 3 1 1 0 0.0

418 rows × 9 columns

In [63]:
#Changing column type
topredict_set['Prediction'] = topredict_set['prediction'].astype(int)
topredict_set
Out[63]:
Fare PassengerId Pclass Sex Title Age Embarked IsAlone prediction Prediction
891 7.8292 892 3 1 0 2 2 1 0.0 0
892 7.0000 893 3 0 1 2 0 0 0.0 0
893 9.6875 894 2 1 0 3 2 1 0.0 0
894 8.6625 895 3 1 0 1 0 1 0.0 0
895 12.2875 896 3 0 1 1 0 0 1.0 1
... ... ... ... ... ... ... ... ... ... ...
1304 8.0500 1305 3 1 0 1 0 1 0.0 0
1305 108.9000 1306 1 0 13 2 1 1 1.0 1
1306 7.2500 1307 3 1 0 2 0 1 0.0 0
1307 8.0500 1308 3 1 0 1 0 1 0.0 0
1308 22.3583 1309 3 1 3 1 1 0 0.0 0

418 rows × 10 columns

This is the end!

Thank you for checking my work !

In [ ]: