This project is my version of machine learning applied to the famous Titanic Case.
Here, we have two data set (train and test) that represents information on passengers.
The train set contains an extra information : did the passenger survive or not, and the goal here is to find the patterns of survivors to predict whether a passenger from the test set will survive or not.
This is a Machine Learning project using supervised learning and more specifically classification.
import pandas as pd
import numpy as np
from pandas.api.types import CategoricalDtype
import matplotlib.pyplot as plt
#Get the data
train_set = pd.read_csv("train.csv")
test_set = pd.read_csv("test.csv")
# Let's stack datasets together to apply modifications simultaneously
stacked = pd.concat([train_set, test_set])
stacked
First impression, Sex and PClass seem to be determining factors for survival onboard.
stacked[["Sex", "Survived"]].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)
stacked[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)
This shows that upper class (in terms of boat's class) and women were more likely to survive.
This analysis help us to select this both features as important ones.
And now, we can transform Sex into a numerical value!
#Sex
#Conditional function définition :
def s(stacked):
if stacked['Sex'] == 'male':
val = 1
else:
val = 0
return val
#Modification of column 'Sex'
stacked['Sex'] = stacked.apply(s, axis=1)
stacked
However, for logical reason, we can remove the names: indeed, your name does not impact whether you will survive or not.
What can make the difference however is your title. It may be an element that can impact your survival in case of evacuation (because of hierarchy) Let's split names from their title
#Get the title out of the name
split1 = stacked['Name'].str.split('.').str[0]
title = split1.str.split(' ').str[1]
title
#Add the column to the dataset
stacked['Title'] = title
stacked = stacked.drop(['Name'], axis=1)
stacked
Let's change it to numerical values:
stacked['Title'].nunique()
There are 34 different titles
#we create a second dataframe with title list and we attribute one number (index) to each title
t = stacked['Title'].unique()
df = pd.DataFrame(t, columns=['Title'])
df['index'] = df.index
df.head()
#merge df with stacked
stacked= pd.merge(stacked, df[['Title', 'index']],left_on = 'Title', right_on = 'Title', how = 'left')
#delete title and rename index
stacked = stacked.drop('Title', axis=1)
stacked = stacked.rename(columns={'index':'Title'})
stacked
If we check our data we will see that some columns have missing values:
#Check for missing values
print(stacked.isnull().sum().sort_values(ascending=False))
We can see that 'Age', 'Cabin', 'Embarked', 'Fare' and 'Survived have missing value.
AGE: with almost 300 missing values, it could be interesting to fill it with value such as average or median which won't mess up our analysis
CABIN: They are too many missing values on Cabin. Therefore, we will just drop the column
EMBARKED and FARE: They are missing respectively only one and two values, we will use the most common value to fill the missing one
SURVIVED: These missing values correspond to the test dataset. Indeed, since we are looking to predict the survival of this persenger, they don't have any value yet.
#Age description
stacked['Age'].describe()
AGE: Here, we can see that average us almost 30 and median is 28. Since we have extrem value (80) and a big difference between the third quartile and the max value, we will choose to use the median to fill missing rows.
#Replace NaN in age by median
median = stacked['Age'].median()
r = stacked['Age'].fillna(median, inplace=True)
Here, we will use a bit of data vizualisation to understand the impact of age and sex on survival
#Visualization with seaborn
import seaborn as sns
grid = sns.FacetGrid(stacked, col='Survived', row ='Sex', size=2.2, aspect=1.6)
grid.map(plt.hist, 'Age', alpha=.5, bins=20)
grid.add_legend();
As we can see, age can be considered using ranges : Let's create those ranges.
#Range of age
stacked['AgeRange'] = pd.cut(stacked['Age'], 5)
ar = stacked[['AgeRange', 'Survived']].groupby(['AgeRange'], as_index=False).mean().sort_values(by='AgeRange', ascending=True)
ar
# transform range into numerical values
ar['index']= ar.index
stacked = pd.merge(stacked, ar[['AgeRange', 'index']],left_on = 'AgeRange', right_on = 'AgeRange', how = 'left')
stacked = stacked.drop('AgeRange', axis=1)
stacked = stacked.drop('Age', axis=1)
stacked = stacked.rename(columns={'index':'Age'})
stacked.head()
#Drop Cabin
stacked = stacked.drop('Cabin', axis = 1)
Now, we can do the same with Embarked
#Extract most common value
d = stacked.groupby('Embarked')['PassengerId'].count().sort_values(ascending = False).idxmax()
d
S is the most common value
# Fillna with the most common value of Embarked
stacked['Embarked'] = stacked['Embarked'].fillna(d)
#Change Embarked into numerical values
e = stacked['Embarked'].unique()
dfa = pd.DataFrame(e, columns=['Embarked'])
dfa['index'] = dfa.index
stacked = pd.merge(stacked, dfa[['Embarked', 'index']],left_on = 'Embarked', right_on = 'Embarked', how = 'left')
stacked = stacked.drop('Embarked', axis=1)
stacked = stacked.rename(columns={'index':'Embarked'})
stacked.head(5)
#Idem for Fare
most_common_val = stacked.groupby('Fare')['PassengerId'].count().sort_values(ascending = False).idxmax()
stacked['Fare'] = stacked['Fare'].fillna(most_common_val)
Ticket has 929 different type of ticket. We can therefore say that this feature is going to be useless: to many different types compare to the number of passenger.
stacked['Ticket'].nunique()
#Drop Cabin
stacked = stacked.drop('Ticket', axis =1)
Now, we are going to flag whether a person is alone or not
# Intermediate column: Family_count
stacked['Family_count'] = stacked['SibSp']+stacked['Parch']
stacked['Family_count'].head()
#Conditional function définition :
def f(stacked):
if stacked['Family_count'] < 1:
val = 1
else:
val = 0
return val
#Creation of column 'IsAlone'
stacked['IsAlone'] = stacked.apply(f, axis=1)
stacked[['IsAlone', 'Survived']].groupby(['IsAlone'], as_index=False).mean()
This information is relevant, we can now drop the following columns : SibSp, Family_count and Parch
#Drop SibSp, Family_count and Parch
stacked = stacked.drop('Parch', axis = 1)
stacked = stacked.drop('SibSp', axis = 1)
stacked = stacked.drop('Family_count', axis = 1)
stacked.head()
Now that our Data is ready, we can re-split it into train and test set :) !
#historical is the share of the data that contains 'Survived' columns
historical = stacked.loc[stacked['Survived'].notnull()]
historical
#Do not contain 'survived'
topredict_set = stacked.loc[stacked['Survived'].isnull()]
topredict_set = topredict_set.drop('Survived', axis = 1)
topredict_set
Now, let's create a test set for cross validation.
#create the test set:
def split_train_test(data, test_ratio):
np.random.seed(42) #always generate the same random data
shuffled_indices = np.random.permutation(len(data))
test_set_size = int(len(data) * test_ratio)
test_indices = shuffled_indices[:test_set_size]
train_indices = shuffled_indices[test_set_size:]
return data.iloc[train_indices], data.iloc[test_indices]
train_set, test_set = split_train_test(historical, 0.20)
#Defining X_train and Y_train
X_train = train_set.drop('Survived', axis=1)
Y_train = train_set['Survived']
#we define display_scores to display different scores that will allow us to select the best model
def display_scores(scores):
print('Scores:', scores)
print('Mean', scores.mean())
print('Standard deviation', scores.std())
from sklearn.ensemble import RandomForestClassifier
RFC = RandomForestClassifier(n_estimators=1000)
RFC.fit(X_train, Y_train)
predicted_Y_train = RFC.predict(X_train)
from sklearn.model_selection import cross_val_score #return the validation
RFC_scores = cross_val_score(RFC, X_train, Y_train, cv=10, scoring ='accuracy')
display_scores(RFC_scores) #50% of the total are true
print("\n")
from sklearn.model_selection import cross_val_predict #return the prediciton
RFC_train_Y_pred= cross_val_predict(RFC, X_train, Y_train, cv=3)
from sklearn.metrics import precision_score, recall_score
RFC_precision= precision_score(Y_train, RFC_train_Y_pred)
RFC_recall= recall_score(Y_train, RFC_train_Y_pred)
print('Precision is:', RFC_precision, 'and Recall is:', RFC_recall)
from sklearn.linear_model import SGDClassifier
SGD = SGDClassifier(random_state = 42)
SGD.fit(X_train, Y_train)
predicted_Y_train = SGD.predict(X_train)
from sklearn.model_selection import cross_val_score #return the validation
SGD_scores = cross_val_score(SGD, X_train, Y_train, cv=10, scoring ='accuracy')
display_scores(SGD_scores) #50% of the total are true
print("\n")
from sklearn.model_selection import cross_val_predict #return the prediciton
SGD_train_Y_pred= cross_val_predict(SGD, X_train, Y_train, cv=3)
from sklearn.metrics import precision_score, recall_score
SGD_precision= precision_score(Y_train, SGD_train_Y_pred)
SGD_recall= recall_score(Y_train, SGD_train_Y_pred)
print('Precision is:', SGD_precision, 'and Recall is:', SGD_recall)
from sklearn.linear_model import LogisticRegression
LR = LogisticRegression(random_state = 42)
from warnings import simplefilter #every iteration with logreg output an annoying future warning
simplefilter(action = 'ignore', category = FutureWarning)
LR.fit(X_train, Y_train)
predicted_Y_train = LR.predict(X_train)
from sklearn.model_selection import cross_val_score #return the validation
LR_scores = cross_val_score(LR, X_train, Y_train, cv=10, scoring ='accuracy')
display_scores(LR_scores) #50% of the total are true
print("\n")
from sklearn.model_selection import cross_val_predict #return the prediciton
LR_train_Y_pred= cross_val_predict(LR, X_train, Y_train, cv=3)
from sklearn.metrics import precision_score, recall_score
LR_precision= precision_score(Y_train, LR_train_Y_pred)
LR_recall= recall_score(Y_train, LR_train_Y_pred)
print('Precision is:', LR_precision, 'and Recall is:', LR_recall)
from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing #eventually to scale X
KN = KNeighborsClassifier()
KN.fit(X_train, Y_train)
predicted_Y_train = KN.predict(X_train)
from sklearn.model_selection import cross_val_score #return the validation
KN_scores = cross_val_score(KN, X_train, Y_train, cv=10, scoring ='accuracy')
display_scores(KN_scores) #50% of the total are true
print("\n")
from sklearn.model_selection import cross_val_predict #return the prediciton
KN_train_Y_pred= cross_val_predict(KN, X_train, Y_train, cv=3)
from sklearn.metrics import precision_score, recall_score
KN_precision= precision_score(Y_train, KN_train_Y_pred)
KN_recall= recall_score(Y_train, KN_train_Y_pred)
print('Precision is:', KN_precision, 'and Recall is:', KN_recall)
from sklearn import tree
DT = tree.DecisionTreeClassifier()
DT.fit(X_train, Y_train)
predicted_Y_train = DT.predict(X_train)
from sklearn.model_selection import cross_val_score #return the validation
DT_scores = cross_val_score(DT, X_train, Y_train, cv=10, scoring ='accuracy')
display_scores(DT_scores) #50% of the total are true
print("\n")
from sklearn.model_selection import cross_val_predict #return the prediciton
DT_train_Y_pred= cross_val_predict(DT, X_train, Y_train, cv=3)
from sklearn.metrics import precision_score, recall_score
DT_precision= precision_score(Y_train, DT_train_Y_pred)
DT_recall= recall_score(Y_train, DT_train_Y_pred)
print('Precision is:', DT_precision, 'and Recall is:', DT_recall)
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
BNB = BernoulliNB(binarize = True) #function will binarize data
BNB.fit(X_train, Y_train)
predicted_Y_train = BNB.predict(X_train)
from sklearn.model_selection import cross_val_score #return the validation
BNB_scores = cross_val_score(BNB, X_train, Y_train, cv=10, scoring ='accuracy')
display_scores(BNB_scores) #50% of the total are true
print("\n")
from sklearn.model_selection import cross_val_predict #return the prediciton
BNB_train_Y_pred= cross_val_predict(BNB, X_train, Y_train, cv=3)
from sklearn.metrics import precision_score, recall_score
BNB_precision= precision_score(Y_train, BNB_train_Y_pred)
BNB_recall= recall_score(Y_train, BNB_train_Y_pred)
print('Precision is:', BNB_precision, 'and Recall is:', BNB_recall)
from sklearn.naive_bayes import GaussianNB
GNB = GaussianNB()
GNB.fit(X_train, Y_train)
predicted_Y_train = GNB.predict(X_train)
from sklearn.model_selection import cross_val_score #return the validation
GNB_scores = cross_val_score(GNB, X_train, Y_train, cv=10, scoring ='accuracy')
display_scores(GNB_scores) #50% of the total are true
print("\n")
from sklearn.model_selection import cross_val_predict #return the prediciton
GNB_train_Y_pred= cross_val_predict(GNB, X_train, Y_train, cv=3)
from sklearn.metrics import precision_score, recall_score
GNB_precision= precision_score(Y_train, GNB_train_Y_pred)
GNB_recall= recall_score(Y_train, GNB_train_Y_pred)
print('Precision is:', GNB_precision, 'and Recall is:', GNB_recall)
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
LDA = LinearDiscriminantAnalysis()
LDA.fit(X_train, Y_train)
predicted_Y_train = LDA.predict(X_train)
from sklearn.model_selection import cross_val_score #return the validation
LDA_scores = cross_val_score(LDA, X_train, Y_train, cv=10, scoring ='accuracy')
display_scores(LDA_scores) #50% of the total are true
print("\n")
from sklearn.model_selection import cross_val_predict #return the prediciton
LDA_train_Y_pred= cross_val_predict(LDA, X_train, Y_train, cv=3)
from sklearn.metrics import precision_score, recall_score
LDA_precision= precision_score(Y_train, LDA_train_Y_pred)
LDA_recall= recall_score(Y_train, LDA_train_Y_pred)
print('Precision is:', LDA_precision, 'and Recall is:', LDA_recall)
#We create a matrix to compare results
MCA = {'accuracy': [np.mean(RFC_scores), np.mean(SGD_scores), np.mean(LR_scores), np.mean(KN_scores), np.mean(DT_scores), np.mean(BNB_scores), np.mean(GNB_scores), np.mean(LDA_scores) ],
'precision': [np.mean(SGD_precision), np.mean(LR_precision), np.mean(KN_precision), np.mean(DT_precision), np.mean(BNB_precision), np.mean(GNB_precision), np.mean(RFC_precision), np.mean(LDA_precision) ],
'Index' :['RFC','SGD','LR','KN','DT','BNB','GNB','LDA']
}
accuracy_matrix = pd.DataFrame(MCA)
accuracy_matrix = accuracy_matrix.set_index('Index')
print(accuracy_matrix.sort_values('accuracy', ascending = False))
Here, we decide to keep the top 3: RFC, LR, LDA
and we will now try to find the best parameters for these models !
#Hypertuning using RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV
param_grid = {
'n_estimators': [150, 200, 400, 700, 1000],
'max_features': ['auto', 'sqrt', 'log2'],
'criterion': ['entropy', 'gini'],
'max_depth': [5, 50, 150],
'min_samples_leaf': [2, 10, 50]}
random = RandomizedSearchCV(estimator = RFC, param_distributions = param_grid, cv = 3, n_jobs=-1)
random_result = random.fit(X_train, Y_train)
# Summarize results
print("Best: %f using %s" % (random_result.best_score_, random_result.best_params_))
RFC = RandomForestClassifier(**random_result.best_params_)
RFC.fit(X_train, Y_train)
from sklearn.model_selection import cross_val_score
RFC_scores_tuned = cross_val_score(RFC, X_train, Y_train, cv=15, scoring ='accuracy')
display_scores(RFC_scores_tuned)
print('/n')
from sklearn.model_selection import cross_val_predict #return the prediciton
RFC_train_Y_pred_tuned = cross_val_predict(RFC, X_train, Y_train, cv=3)
from sklearn.metrics import precision_score, recall_score
RFC_precision_tuned = precision_score(Y_train, RFC_train_Y_pred_tuned)
RFC_recall_tuned = recall_score(Y_train, RFC_train_Y_pred_tuned)
print('Precision is:', RFC_precision_tuned, 'and Recall is:', RFC_recall_tuned)
from sklearn.model_selection import GridSearchCV
LogReg = LogisticRegression()
param_grid = [{'penalty' : ['l2', 'none'], 'solver' : ['newton-cg', 'lbfgs', 'sag', 'saga']}]
grid_search = GridSearchCV(LogReg, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, Y_train)
print(grid_search.best_params_)
LR = LogisticRegression(**grid_search.best_params_)
LR.fit(X_train, Y_train)
from sklearn.model_selection import cross_val_score
LR_scores_tuned = cross_val_score(LR, X_train, Y_train, cv=10, scoring ='accuracy')
display_scores(LR_scores_tuned)
print('\n')
from sklearn.model_selection import cross_val_predict #return the prediciton
LR_train_Y_pred_tuned = cross_val_predict(LR, X_train, Y_train, cv=3)
from sklearn.metrics import precision_score, recall_score
LR_precision_tuned = precision_score(Y_train, LR_train_Y_pred_tuned)
LR_recall_tuned = recall_score(Y_train, LR_train_Y_pred_tuned)
print('Precision is:', LR_precision_tuned, 'and Recall is:', LR_recall_tuned)
LDA = LinearDiscriminantAnalysis()
param_grid = [{'solver' : ['svd', 'lsqr'], 'n_components' : [None, 2, 5, 10, 15]}]
grid_search = GridSearchCV(LDA, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, Y_train)
print(grid_search.best_params_)
LDA = LinearDiscriminantAnalysis(**grid_search.best_params_)
LDA.fit(X_train, Y_train)
from sklearn.model_selection import cross_val_score
LDA_scores_tuned = cross_val_score(LDA, X_train, Y_train, cv=10, scoring ='accuracy')
display_scores(LDA_scores_tuned)
print('\n')
from sklearn.model_selection import cross_val_predict #return the prediciton
LDA_train_Y_pred_tuned = cross_val_predict(LDA, X_train, Y_train, cv=3)
from sklearn.metrics import precision_score, recall_score
LDA_precision_tuned = precision_score(Y_train, LDA_train_Y_pred_tuned)
LDA_recall_tuned = recall_score(Y_train, LDA_train_Y_pred_tuned)
print('Precision is:', LDA_precision_tuned, 'and Recall is:', LDA_recall_tuned)
#We create a matrix to compare results
MCA2 = {'accuracy': [np.mean(RFC_scores_tuned), np.mean(LR_scores_tuned), np.mean(LDA_scores_tuned)],
'precision': [np.mean(RFC_precision_tuned), np.mean(LR_precision_tuned), np.mean(LDA_precision_tuned)],
'Index' :['RFC','LR','LDA']
}
accuracy_matrix = pd.DataFrame(MCA2)
accuracy_matrix = accuracy_matrix.set_index('Index')
print(accuracy_matrix.sort_values('accuracy', ascending = False))
Here we can see that RFC is our best model so far.
This result are not perfect, however they are the best that I've got even after editing my features and hypertuning. Besides, this my first machine learning project. Therefore, I am going to stick to these results.
X_test = test_set.drop('Survived', axis=1)
Y_test = test_set['Survived']
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
#Applying RFC to test_set
RFC = RandomForestClassifier(**random_result.best_params_)
RFC.fit(X_train, Y_train)
RFC.predict(X_test)
#from sklearn.model_selection import cross_val_score
#RFC_scores_MF_predicted_test = cross_val_score(RFC, X_train_Sale_MF, test_set_num['Sale_MF'], cv=15, scoring ='accuracy')
accuracy_test = accuracy_score(Y_test, RFC.predict(X_test))
precision_test = precision_score(Y_test, RFC.predict(X_test))
print('Accuracy :', accuracy_test)
print('Precision :', precision_test)
Our result are similar with both train and test set. We can therefore move on to prediction
RFC = RandomForestClassifier(**random_result.best_params_)
RFC.fit(X_train, Y_train)
prediction = RFC.predict(topredict_set)
topredict_set['prediction'] = prediction
topredict_set
#Changing column type
topredict_set['Prediction'] = topredict_set['prediction'].astype(int)
topredict_set
This is the end!
Thank you for checking my work !