Titanic Machine Learning Case¶

This project is my version of machine learning applied to the famous Titanic Case.

Here, we have two data set (train and test) that represents information on passengers.

The train set contains an extra information : did the passenger survive or not, and the goal here is to find the patterns of survivors to predict whether a passenger from the test set will survive or not.

This is a Machine Learning project using supervised learning and more specifically classification.

Get the data¶

import pandas as pd
import numpy as np
from pandas.api.types import CategoricalDtype
import matplotlib.pyplot as plt

#Get the data

train_set = pd.read_csv("train.csv") 
test_set = pd.read_csv("test.csv")

# Let's stack datasets together to apply modifications simultaneously

stacked = pd.concat([train_set, test_set])
stacked

C:\Users\garan\Anaconda3\lib\site-packages\ipykernel_launcher.py:3: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  This is separate from the ipykernel package so we can avoid doing imports until

Data preparation and Features engineering¶

First impression, Sex and PClass seem to be determining factors for survival onboard.

stacked[["Sex", "Survived"]].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)

stacked[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)

This shows that upper class (in terms of boat's class) and women were more likely to survive.

This analysis help us to select this both features as important ones.

And now, we can transform Sex into a numerical value!

#Sex
#Conditional function définition : 

def s(stacked):
    if stacked['Sex'] == 'male':
        val = 1
    else:
        val = 0
    return val

#Modification of column 'Sex'
stacked['Sex'] = stacked.apply(s, axis=1)
stacked

However, for logical reason, we can remove the names: indeed, your name does not impact whether you will survive or not.

What can make the difference however is your title. It may be an element that can impact your survival in case of evacuation (because of hierarchy) Let's split names from their title

#Get the title out of the name
split1 = stacked['Name'].str.split('.').str[0]
title = split1.str.split(' ').str[1]
title

0          Mr
1         Mrs
2        Miss
3         Mrs
4          Mr
        ...  
413        Mr
414         y
415        Mr
416        Mr
417    Master
Name: Name, Length: 1309, dtype: object

#Add the column to the dataset
stacked['Title'] = title
stacked = stacked.drop(['Name'], axis=1)
stacked

Let's change it to numerical values:

stacked['Title'].nunique()

34

There are 34 different titles

#we create a second dataframe with title list and we attribute one number (index) to each title
t = stacked['Title'].unique()
df = pd.DataFrame(t, columns=['Title'])
df['index'] = df.index
df.head()

#merge df with stacked
stacked= pd.merge(stacked, df[['Title', 'index']],left_on = 'Title', right_on = 'Title', how = 'left')
#delete title and rename index
stacked = stacked.drop('Title', axis=1)
stacked = stacked.rename(columns={'index':'Title'})
stacked

If we check our data we will see that some columns have missing values:

#Check for missing values
print(stacked.isnull().sum().sort_values(ascending=False))

Cabin          1014
Survived        418
Age             263
Embarked          2
Fare              1
Title             0
Ticket            0
SibSp             0
Sex               0
Pclass            0
PassengerId       0
Parch             0
dtype: int64

We can see that 'Age', 'Cabin', 'Embarked', 'Fare' and 'Survived have missing value.

AGE: with almost 300 missing values, it could be interesting to fill it with value such as average or median which won't mess up our analysis

CABIN: They are too many missing values on Cabin. Therefore, we will just drop the column

EMBARKED and FARE: They are missing respectively only one and two values, we will use the most common value to fill the missing one

SURVIVED: These missing values correspond to the test dataset. Indeed, since we are looking to predict the survival of this persenger, they don't have any value yet.

#Age description
stacked['Age'].describe()

count    1046.000000
mean       29.881138
std        14.413493
min         0.170000
25%        21.000000
50%        28.000000
75%        39.000000
max        80.000000
Name: Age, dtype: float64

AGE: Here, we can see that average us almost 30 and median is 28. Since we have extrem value (80) and a big difference between the third quartile and the max value, we will choose to use the median to fill missing rows.

#Replace NaN in age by median
median = stacked['Age'].median()
r = stacked['Age'].fillna(median, inplace=True)

Here, we will use a bit of data vizualisation to understand the impact of age and sex on survival

#Visualization with seaborn
import seaborn as sns

grid = sns.FacetGrid(stacked, col='Survived', row ='Sex', size=2.2, aspect=1.6)
grid.map(plt.hist, 'Age', alpha=.5, bins=20)
grid.add_legend();

C:\Users\garan\Anaconda3\lib\site-packages\seaborn\axisgrid.py:230: UserWarning: The `size` paramter has been renamed to `height`; please update your code.
  warnings.warn(msg, UserWarning)

As we can see, age can be considered using ranges : Let's create those ranges.

#Range of age
stacked['AgeRange'] = pd.cut(stacked['Age'], 5)
ar = stacked[['AgeRange', 'Survived']].groupby(['AgeRange'], as_index=False).mean().sort_values(by='AgeRange', ascending=True)
ar

# transform range into numerical values
ar['index']= ar.index
stacked = pd.merge(stacked, ar[['AgeRange', 'index']],left_on = 'AgeRange', right_on = 'AgeRange', how = 'left')
stacked = stacked.drop('AgeRange', axis=1)
stacked = stacked.drop('Age', axis=1)
stacked = stacked.rename(columns={'index':'Age'})
stacked.head()

#Drop Cabin
stacked = stacked.drop('Cabin', axis = 1)

Now, we can do the same with Embarked

#Extract most common value
d = stacked.groupby('Embarked')['PassengerId'].count().sort_values(ascending = False).idxmax()
d

'S'

S is the most common value

# Fillna with the most common value of Embarked

stacked['Embarked'] = stacked['Embarked'].fillna(d)

#Change Embarked into numerical values
e = stacked['Embarked'].unique()
dfa = pd.DataFrame(e, columns=['Embarked'])
dfa['index'] = dfa.index
stacked = pd.merge(stacked, dfa[['Embarked', 'index']],left_on = 'Embarked', right_on = 'Embarked', how = 'left')
stacked = stacked.drop('Embarked', axis=1)
stacked = stacked.rename(columns={'index':'Embarked'})
stacked.head(5)

#Idem for Fare
most_common_val = stacked.groupby('Fare')['PassengerId'].count().sort_values(ascending = False).idxmax()

stacked['Fare'] = stacked['Fare'].fillna(most_common_val)

Ticket has 929 different type of ticket. We can therefore say that this feature is going to be useless: to many different types compare to the number of passenger.

stacked['Ticket'].nunique()

929

#Drop Cabin
stacked = stacked.drop('Ticket', axis =1)

Now, we are going to flag whether a person is alone or not

# Intermediate column: Family_count
stacked['Family_count'] = stacked['SibSp']+stacked['Parch']
stacked['Family_count'].head()

0    1
1    1
2    0
3    1
4    0
Name: Family_count, dtype: int64

#Conditional function définition : 

def f(stacked):
    if stacked['Family_count'] < 1:
        val = 1
    else:
        val = 0
    return val

#Creation of column 'IsAlone'
stacked['IsAlone'] = stacked.apply(f, axis=1)
stacked[['IsAlone', 'Survived']].groupby(['IsAlone'], as_index=False).mean()

This information is relevant, we can now drop the following columns : SibSp, Family_count and Parch

#Drop SibSp, Family_count and Parch
stacked = stacked.drop('Parch', axis = 1)
stacked = stacked.drop('SibSp', axis = 1)
stacked = stacked.drop('Family_count', axis = 1)

stacked.head()

Now that our Data is ready, we can re-split it into train and test set :) !

Splitting data¶

#historical is the share of the data that contains 'Survived' columns
historical = stacked.loc[stacked['Survived'].notnull()]
historical

#Do not contain 'survived'
topredict_set = stacked.loc[stacked['Survived'].isnull()]
topredict_set = topredict_set.drop('Survived', axis = 1)
topredict_set

Now, let's create a test set for cross validation.

#create the test set:
def split_train_test(data, test_ratio):
    np.random.seed(42) #always generate the same random data
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data) * test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return data.iloc[train_indices], data.iloc[test_indices]


train_set, test_set = split_train_test(historical, 0.20)

#Defining X_train and Y_train
X_train = train_set.drop('Survived', axis=1)
Y_train = train_set['Survived']

Let's start playing with classifiers¶

#we define display_scores to display different scores that will allow us to select the best model 
def display_scores(scores):
    print('Scores:', scores)
    print('Mean', scores.mean())
    print('Standard deviation', scores.std())

Random Forest Classifier¶

from sklearn.ensemble import RandomForestClassifier
RFC = RandomForestClassifier(n_estimators=1000)

RFC.fit(X_train, Y_train)
predicted_Y_train = RFC.predict(X_train)

from sklearn.model_selection import cross_val_score #return the validation
RFC_scores = cross_val_score(RFC, X_train, Y_train, cv=10, scoring ='accuracy')

display_scores(RFC_scores) #50% of the total are true
print("\n")

from sklearn.model_selection import cross_val_predict #return the prediciton
RFC_train_Y_pred= cross_val_predict(RFC, X_train, Y_train, cv=3)

from sklearn.metrics import precision_score, recall_score
RFC_precision= precision_score(Y_train, RFC_train_Y_pred) 
RFC_recall= recall_score(Y_train, RFC_train_Y_pred)
print('Precision is:', RFC_precision, 'and Recall is:', RFC_recall)

Scores: [0.83333333 0.76388889 0.77777778 0.86111111 0.83098592 0.76056338
 0.8028169  0.78873239 0.76056338 0.91428571]
Mean 0.8094058797227811
Standard deviation 0.04798494769564089


Precision is: 0.7922077922077922 and Recall is: 0.6802973977695167

SGD classifier¶

from sklearn.linear_model import SGDClassifier
SGD = SGDClassifier(random_state = 42)

SGD.fit(X_train, Y_train)
predicted_Y_train = SGD.predict(X_train)

from sklearn.model_selection import cross_val_score #return the validation
SGD_scores = cross_val_score(SGD, X_train, Y_train, cv=10, scoring ='accuracy')

display_scores(SGD_scores) #50% of the total are true
print("\n")

from sklearn.model_selection import cross_val_predict #return the prediciton
SGD_train_Y_pred= cross_val_predict(SGD, X_train, Y_train, cv=3)

from sklearn.metrics import precision_score, recall_score
SGD_precision= precision_score(Y_train, SGD_train_Y_pred) 
SGD_recall= recall_score(Y_train, SGD_train_Y_pred)
print('Precision is:', SGD_precision, 'and Recall is:', SGD_recall)

Scores: [0.66666667 0.68055556 0.65277778 0.72222222 0.61971831 0.63380282
 0.63380282 0.64788732 0.63380282 0.37142857]
Mean 0.6262664878157838
Standard deviation 0.08950582124891168


Precision is: 0.4375 and Recall is: 0.6245353159851301

Logistic Regression¶

from sklearn.linear_model import LogisticRegression
LR = LogisticRegression(random_state = 42)


from warnings import simplefilter #every iteration with logreg output an annoying future warning
simplefilter(action = 'ignore', category = FutureWarning)

LR.fit(X_train, Y_train)
predicted_Y_train = LR.predict(X_train)

from sklearn.model_selection import cross_val_score #return the validation
LR_scores = cross_val_score(LR, X_train, Y_train, cv=10, scoring ='accuracy')

display_scores(LR_scores) #50% of the total are true
print("\n")

from sklearn.model_selection import cross_val_predict #return the prediciton
LR_train_Y_pred= cross_val_predict(LR, X_train, Y_train, cv=3)

from sklearn.metrics import precision_score, recall_score
LR_precision= precision_score(Y_train, LR_train_Y_pred) 
LR_recall= recall_score(Y_train, LR_train_Y_pred)
print('Precision is:', LR_precision, 'and Recall is:', LR_recall)

Scores: [0.79166667 0.76388889 0.70833333 0.90277778 0.8028169  0.70422535
 0.70422535 0.74647887 0.77464789 0.91428571]
Mean 0.7813346747149564
Standard deviation 0.0719566042132041


Precision is: 0.724 and Recall is: 0.6728624535315985

K Neighbors Classifier¶

from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing #eventually to scale X
KN = KNeighborsClassifier()

KN.fit(X_train, Y_train)
predicted_Y_train = KN.predict(X_train)

from sklearn.model_selection import cross_val_score #return the validation
KN_scores = cross_val_score(KN, X_train, Y_train, cv=10, scoring ='accuracy')

display_scores(KN_scores) #50% of the total are true
print("\n")

from sklearn.model_selection import cross_val_predict #return the prediciton
KN_train_Y_pred= cross_val_predict(KN, X_train, Y_train, cv=3)

from sklearn.metrics import precision_score, recall_score
KN_precision= precision_score(Y_train, KN_train_Y_pred) 
KN_recall= recall_score(Y_train, KN_train_Y_pred)
print('Precision is:', KN_precision, 'and Recall is:', KN_recall)

Scores: [0.65277778 0.70833333 0.56944444 0.69444444 0.57746479 0.6056338
 0.6056338  0.61971831 0.56338028 0.65714286]
Mean 0.625397384305835
Standard deviation 0.048467685683036686


Precision is: 0.5022421524663677 and Recall is: 0.4163568773234201

Decision Tree Classifier¶

from sklearn import tree
DT = tree.DecisionTreeClassifier()

DT.fit(X_train, Y_train)
predicted_Y_train = DT.predict(X_train)

from sklearn.model_selection import cross_val_score #return the validation
DT_scores = cross_val_score(DT, X_train, Y_train, cv=10, scoring ='accuracy')

display_scores(DT_scores) #50% of the total are true
print("\n")

from sklearn.model_selection import cross_val_predict #return the prediciton
DT_train_Y_pred= cross_val_predict(DT, X_train, Y_train, cv=3)

from sklearn.metrics import precision_score, recall_score
DT_precision= precision_score(Y_train, DT_train_Y_pred) 
DT_recall= recall_score(Y_train, DT_train_Y_pred)
print('Precision is:', DT_precision, 'and Recall is:', DT_recall)

Scores: [0.72222222 0.58333333 0.70833333 0.77777778 0.74647887 0.71830986
 0.74647887 0.67605634 0.77464789 0.82857143]
Mean 0.7282209926224011
Standard deviation 0.06292182587870337


Precision is: 0.6498194945848376 and Recall is: 0.6691449814126395

Bernoulli Naive Bayes Classifier¶

from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB

BNB = BernoulliNB(binarize = True) #function will binarize data

BNB.fit(X_train, Y_train)
predicted_Y_train = BNB.predict(X_train)

from sklearn.model_selection import cross_val_score #return the validation
BNB_scores = cross_val_score(BNB, X_train, Y_train, cv=10, scoring ='accuracy')

display_scores(BNB_scores) #50% of the total are true
print("\n")

from sklearn.model_selection import cross_val_predict #return the prediciton
BNB_train_Y_pred= cross_val_predict(BNB, X_train, Y_train, cv=3)

from sklearn.metrics import precision_score, recall_score
BNB_precision= precision_score(Y_train, BNB_train_Y_pred) 
BNB_recall= recall_score(Y_train, BNB_train_Y_pred)
print('Precision is:', BNB_precision, 'and Recall is:', BNB_recall)

Scores: [0.625      0.66666667 0.59722222 0.65277778 0.67605634 0.56338028
 0.6056338  0.64788732 0.57746479 0.68571429]
Mean 0.629780348759222
Standard deviation 0.040372320930531114


Precision is: 0.5815217391304348 and Recall is: 0.39776951672862454

Gaussian Naive Bayes Classifier¶

from sklearn.naive_bayes import GaussianNB
GNB = GaussianNB()

GNB.fit(X_train, Y_train)
predicted_Y_train = GNB.predict(X_train)

from sklearn.model_selection import cross_val_score #return the validation
GNB_scores = cross_val_score(GNB, X_train, Y_train, cv=10, scoring ='accuracy')

display_scores(GNB_scores) #50% of the total are true
print("\n")

from sklearn.model_selection import cross_val_predict #return the prediciton
GNB_train_Y_pred= cross_val_predict(GNB, X_train, Y_train, cv=3)

from sklearn.metrics import precision_score, recall_score
GNB_precision= precision_score(Y_train, GNB_train_Y_pred) 
GNB_recall= recall_score(Y_train, GNB_train_Y_pred)
print('Precision is:', GNB_precision, 'and Recall is:', GNB_recall)

Scores: [0.76388889 0.77777778 0.70833333 0.93055556 0.78873239 0.67605634
 0.69014085 0.71830986 0.70422535 0.87142857]
Mean 0.7629448915716521
Standard deviation 0.078753213947817


Precision is: 0.7211155378486056 and Recall is: 0.6728624535315985

Linear Discriminant Analysis¶

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
LDA = LinearDiscriminantAnalysis()

LDA.fit(X_train, Y_train)
predicted_Y_train = LDA.predict(X_train)

from sklearn.model_selection import cross_val_score #return the validation
LDA_scores = cross_val_score(LDA, X_train, Y_train, cv=10, scoring ='accuracy')

display_scores(LDA_scores) #50% of the total are true
print("\n")

from sklearn.model_selection import cross_val_predict #return the prediciton
LDA_train_Y_pred= cross_val_predict(LDA, X_train, Y_train, cv=3)

from sklearn.metrics import precision_score, recall_score
LDA_precision= precision_score(Y_train, LDA_train_Y_pred) 
LDA_recall= recall_score(Y_train, LDA_train_Y_pred)
print('Precision is:', LDA_precision, 'and Recall is:', LDA_recall)

Scores: [0.81944444 0.75       0.69444444 0.91666667 0.81690141 0.67605634
 0.73239437 0.78873239 0.77464789 0.9       ]
Mean 0.7869287949921754
Standard deviation 0.07543905358783627


Precision is: 0.7346938775510204 and Recall is: 0.6691449814126395

#We create a matrix to compare results
MCA = {'accuracy': [np.mean(RFC_scores), np.mean(SGD_scores), np.mean(LR_scores), np.mean(KN_scores), np.mean(DT_scores), np.mean(BNB_scores), np.mean(GNB_scores), np.mean(LDA_scores) ],
     'precision': [np.mean(SGD_precision), np.mean(LR_precision), np.mean(KN_precision), np.mean(DT_precision), np.mean(BNB_precision), np.mean(GNB_precision), np.mean(RFC_precision), np.mean(LDA_precision) ],
    'Index' :['RFC','SGD','LR','KN','DT','BNB','GNB','LDA']
    }
accuracy_matrix = pd.DataFrame(MCA)
accuracy_matrix = accuracy_matrix.set_index('Index')
print(accuracy_matrix.sort_values('accuracy', ascending = False))

       accuracy  precision
Index                     
RFC    0.809406   0.437500
LDA    0.786929   0.734694
LR     0.781335   0.502242
GNB    0.762945   0.792208
DT     0.728221   0.581522
BNB    0.629780   0.721116
SGD    0.626266   0.724000
KN     0.625397   0.649819

Here, we decide to keep the top 3: RFC, LR, LDA

and we will now try to find the best parameters for these models !

Hypertuning¶

RFC tuned¶

#Hypertuning using RandomizedSearchCV

from sklearn.model_selection import RandomizedSearchCV

param_grid = {

'n_estimators': [150, 200, 400, 700, 1000],

'max_features': ['auto', 'sqrt', 'log2'],

'criterion': ['entropy', 'gini'],

'max_depth': [5, 50, 150],

'min_samples_leaf': [2, 10, 50]}

random = RandomizedSearchCV(estimator = RFC, param_distributions = param_grid, cv = 3, n_jobs=-1)

random_result = random.fit(X_train, Y_train)

# Summarize results

print("Best: %f using %s" % (random_result.best_score_, random_result.best_params_))

Best: 0.823282 using {'n_estimators': 400, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 50, 'criterion': 'entropy'}

RFC = RandomForestClassifier(**random_result.best_params_)

RFC.fit(X_train, Y_train)

from sklearn.model_selection import cross_val_score

RFC_scores_tuned = cross_val_score(RFC, X_train, Y_train, cv=15, scoring ='accuracy')

display_scores(RFC_scores_tuned)

print('/n')

from sklearn.model_selection import cross_val_predict #return the prediciton

RFC_train_Y_pred_tuned = cross_val_predict(RFC, X_train, Y_train, cv=3)

from sklearn.metrics import precision_score, recall_score

RFC_precision_tuned = precision_score(Y_train, RFC_train_Y_pred_tuned)

RFC_recall_tuned = recall_score(Y_train, RFC_train_Y_pred_tuned)

print('Precision is:', RFC_precision_tuned, 'and Recall is:', RFC_recall_tuned)

Scores: [0.79166667 0.83333333 0.72916667 0.75       0.875      0.875
 0.85416667 0.83333333 0.77083333 0.74468085 0.80851064 0.76595745
 0.72340426 0.87234043 0.91304348]
Mean 0.8093624730188096
Standard deviation 0.0587079320945668
/n
Precision is: 0.8287037037037037 and Recall is: 0.6654275092936803

LR tuned¶

from sklearn.model_selection import GridSearchCV

LogReg = LogisticRegression()

param_grid = [{'penalty' : ['l2', 'none'], 'solver' : ['newton-cg', 'lbfgs', 'sag', 'saga']}]
grid_search = GridSearchCV(LogReg, param_grid, cv=5, scoring='accuracy')



grid_search.fit(X_train, Y_train)
print(grid_search.best_params_)

C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)

{'penalty': 'l2', 'solver': 'lbfgs'}

C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)

LR = LogisticRegression(**grid_search.best_params_)

LR.fit(X_train, Y_train)

from sklearn.model_selection import cross_val_score
LR_scores_tuned = cross_val_score(LR, X_train, Y_train, cv=10, scoring ='accuracy')

display_scores(LR_scores_tuned)

print('\n')


from sklearn.model_selection import cross_val_predict #return the prediciton
LR_train_Y_pred_tuned = cross_val_predict(LR, X_train, Y_train, cv=3)

from sklearn.metrics import precision_score, recall_score
LR_precision_tuned = precision_score(Y_train, LR_train_Y_pred_tuned) 
LR_recall_tuned = recall_score(Y_train, LR_train_Y_pred_tuned)
print('Precision is:', LR_precision_tuned, 'and Recall is:', LR_recall_tuned)

C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)

Scores: [0.77777778 0.76388889 0.69444444 0.93055556 0.8028169  0.70422535
 0.67605634 0.74647887 0.76056338 0.88571429]
Mean 0.7742521797451374
Standard deviation 0.07724280654985156


Precision is: 0.7295081967213115 and Recall is: 0.6617100371747212

C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:947: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)

LDA tuned¶

LDA = LinearDiscriminantAnalysis()

param_grid = [{'solver' : ['svd', 'lsqr'], 'n_components' : [None, 2, 5, 10, 15]}]
grid_search = GridSearchCV(LDA, param_grid, cv=5, scoring='accuracy')



grid_search.fit(X_train, Y_train)
print(grid_search.best_params_)

C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)

{'n_components': None, 'solver': 'svd'}

C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)
C:\Users\garan\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:466: ChangedBehaviorWarning: n_components cannot be larger than min(n_features, n_classes - 1). Using min(n_features, n_classes - 1) = min(8, 2 - 1) = 1 components.
  ChangedBehaviorWarning)

LDA = LinearDiscriminantAnalysis(**grid_search.best_params_)

LDA.fit(X_train, Y_train)

from sklearn.model_selection import cross_val_score
LDA_scores_tuned = cross_val_score(LDA, X_train, Y_train, cv=10, scoring ='accuracy')

display_scores(LDA_scores_tuned)

print('\n')


from sklearn.model_selection import cross_val_predict #return the prediciton
LDA_train_Y_pred_tuned = cross_val_predict(LDA, X_train, Y_train, cv=3)

from sklearn.metrics import precision_score, recall_score
LDA_precision_tuned = precision_score(Y_train, LDA_train_Y_pred_tuned) 
LDA_recall_tuned = recall_score(Y_train, LDA_train_Y_pred_tuned)
print('Precision is:', LDA_precision_tuned, 'and Recall is:', LDA_recall_tuned)

Scores: [0.81944444 0.75       0.69444444 0.91666667 0.81690141 0.67605634
 0.73239437 0.78873239 0.77464789 0.9       ]
Mean 0.7869287949921754
Standard deviation 0.07543905358783627


Precision is: 0.7346938775510204 and Recall is: 0.6691449814126395

#We create a matrix to compare results
MCA2 = {'accuracy': [np.mean(RFC_scores_tuned), np.mean(LR_scores_tuned), np.mean(LDA_scores_tuned)],
     'precision': [np.mean(RFC_precision_tuned), np.mean(LR_precision_tuned), np.mean(LDA_precision_tuned)],
    'Index' :['RFC','LR','LDA']
    }
accuracy_matrix = pd.DataFrame(MCA2)
accuracy_matrix = accuracy_matrix.set_index('Index')
print(accuracy_matrix.sort_values('accuracy', ascending = False))

       accuracy  precision
Index                     
RFC    0.809362   0.828704
LDA    0.786929   0.734694
LR     0.774252   0.729508

Here we can see that RFC is our best model so far.

This result are not perfect, however they are the best that I've got even after editing my features and hypertuning. Besides, this my first machine learning project. Therefore, I am going to stick to these results.

Preparing the test_set¶

X_test = test_set.drop('Survived', axis=1)
Y_test = test_set['Survived']

from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score

#Applying RFC to test_set
RFC = RandomForestClassifier(**random_result.best_params_)

RFC.fit(X_train, Y_train)
RFC.predict(X_test)

#from sklearn.model_selection import cross_val_score
#RFC_scores_MF_predicted_test = cross_val_score(RFC, X_train_Sale_MF, test_set_num['Sale_MF'], cv=15, scoring ='accuracy')


accuracy_test = accuracy_score(Y_test, RFC.predict(X_test))

precision_test = precision_score(Y_test, RFC.predict(X_test))

print('Accuracy :', accuracy_test)
print('Precision :', precision_test)

Accuracy : 0.8314606741573034
Precision : 0.8028169014084507

Our result are similar with both train and test set. We can therefore move on to prediction

Prediction¶

RFC = RandomForestClassifier(**random_result.best_params_)

RFC.fit(X_train, Y_train)
prediction = RFC.predict(topredict_set)
topredict_set['prediction'] = prediction
topredict_set

#Changing column type
topredict_set['Prediction'] = topredict_set['prediction'].astype(int)
topredict_set

This is the end!

Thank you for checking my work !

	Pclass	Survived
0	1	0.629630
1	2	0.472826
2	3	0.242363

	AgeRange	Survived
0	(0.0902, 16.136]	0.550000
1	(16.136, 32.102]	0.344168
2	(32.102, 48.068]	0.404255
3	(48.068, 64.034]	0.434783
4	(64.034, 80.0]	0.090909

	Fare	PassengerId	Pclass	Sex	Title	Age	Embarked	IsAlone
891	7.8292	892	3	1	0	2	2	1
892	7.0000	893	3	0	1	2	0	0
893	9.6875	894	2	1	0	3	2	1
894	8.6625	895	3	1	0	1	0	1
895	12.2875	896	3	0	1	1	0	0
...	...	...	...	...	...	...	...	...
1304	8.0500	1305	3	1	0	1	0	1
1305	108.9000	1306	1	0	13	2	1	1
1306	7.2500	1307	3	1	0	2	0	1
1307	8.0500	1308	3	1	0	1	0	1
1308	22.3583	1309	3	1	3	1	1	0

	Age	Cabin	Embarked	Fare	Name	Parch	PassengerId	Pclass	Sex	SibSp	Survived	Ticket
0	22.0	NaN	S	7.2500	Braund, Mr. Owen Harris	0	1	3	male	1	0.0	A/5 21171
1	38.0	C85	C	71.2833	Cumings, Mrs. John Bradley (Florence Briggs Th...	0	2	1	female	1	1.0	PC 17599
2	26.0	NaN	S	7.9250	Heikkinen, Miss. Laina	0	3	3	female	0	1.0	STON/O2. 3101282
3	35.0	C123	S	53.1000	Futrelle, Mrs. Jacques Heath (Lily May Peel)	0	4	1	female	1	1.0	113803
4	35.0	NaN	S	8.0500	Allen, Mr. William Henry	0	5	3	male	0	0.0	373450
...	...	...	...	...	...	...	...	...	...	...	...	...
413	NaN	NaN	S	8.0500	Spector, Mr. Woolf	0	1305	3	male	0	NaN	A.5. 3236
414	39.0	C105	C	108.9000	Oliva y Ocana, Dona. Fermina	0	1306	1	female	0	NaN	PC 17758
415	38.5	NaN	S	7.2500	Saether, Mr. Simon Sivertsen	0	1307	3	male	0	NaN	SOTON/O.Q. 3101262
416	NaN	NaN	S	8.0500	Ware, Mr. Frederick	0	1308	3	male	0	NaN	359309
417	NaN	NaN	C	22.3583	Peter, Master. Michael J	1	1309	3	male	1	NaN	2668

	IsAlone	Survived
0	0	0.505650
1	1	0.303538