Model Building and Experimentation: End-to-End Series (Part — 3)
In our previous article, we saw how data preprocessing is done.
Now that we have cleaned our data and preprocessed it. Our data is now ready to be fed to the machine so that it learns it’s behaviour and predicts unseen data.
Even though the model building is a small part of the whole Data Science life cycle.
But it is also time-consuming as experimentation takes a lot of time.
- Depending on the data type (qualitative or quantitative) of the target variable (commonly referred to as the Y variable) we are either going to be building a classification (if Y is qualitative) or regression (if Y is quantitative) model.
- Machine learning algorithms could be broadly categorised into one of two types:
→ Supervised learning
* Machine learning task that establishes the mathematical relationship between input X and output Y variables.
* Such X, Y pair constitutes the labelled data that are used for model building in an effort to learn how to predict the output from the input.
→ Unsupervised learning
* Machine learning task that makes use of only the input X variables.
* Such X variables are unlabeled data that the learning algorithm uses in modeling the inherent structure of the data.
Also, Check out our Article on:
- Data Abstraction: End-to-End Series (Part — 1)
- Data Preprocessing and EDA: End-to-End Series (Part — 2)
- Creating a WebApp using Flask+Gunicorn on Ubuntu Server: End-to-End Series (Part — 4)
- Containerizing the WebApp using Docker: End-to-End Series (Part — 5)
- Scaling our Docker Container using Kubernetes: End-to-End Series (Part — 6)
- Automating building and deployment using Jenkins: End-to-End Series (Part — 7)
→ Importing Libraries
import pandas as pd
import numpy as npimport picklefrom sklearn.model_selection import train_test_splitimport matplotlib.pyplot as pltfrom sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifierfrom sklearn.model_selection import GridSearchCVfrom sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix, classification_report
→ Reading data
data = pd.read_csv("/content/Cleaned_Churn.csv")
data.head()
Splitting the independent and dependent features:
X = data_train.drop(['Churn'], axis=1)
y = data_train['Churn']
→ Splitting data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.2)
→ Checking feature importance
model_rf = RandomForestClassifier()
model_rf.fit(X_train,y_train)
# Checking the feature importances of various features.for score, name in sorted(zip(model_rf.feature_importances_,
X_train.columns), reverse=True):
print('Feature importance of', name, ':', score*100, '%')
# Plotting the Feature Importance of each feature.
plt.figure(figsize=(12, 7))
plt.bar(X_train.columns, model_rf.feature_importances_*100, color='green')
plt.xlabel('Features', fontsize=14)
plt.ylabel('Importance', fontsize=14)
plt.xticks(rotation=90)
plt.title('Feature Importance of each Feature', fontsize=16)
→ Model Building
- LogisticRegression
#Logistic Regressionlog_model = LogisticRegression(max_iter=1000)
scores = cross_val_score(estimator=log_model, X=X_train, y=y_train, cv=5, scoring='roc_auc')
print(scores)
print("Mean", scores.mean())
- DecisionTree
# Decision Tree
decision_tree = DecisionTreeClassifier(max_depth = 9, random_state = 123,
splitter = "best", criterion = "gini")
scores = cross_val_score(estimator=decision_tree, X=X_train, y=y_train, cv=5, scoring='roc_auc')
print(scores)
print("Mean", scores.mean())
- RandomForest
# Random Forest
model_rf = RandomForestClassifier(n_estimators=1000 , oob_score = True, n_jobs = -1,
random_state =50, max_features = "auto",
max_leaf_nodes = 30)
scores = cross_val_score(estimator=model_rf, X=X_train, y=y_train, cv=5, scoring='roc_auc')
print(scores)
print("Mean", scores.mean())
→ Trying Hyperparameter Tuning
param_grid = [{'n_estimators': [100, 200, 300],
'max_depth': [None, 2, 3, 10, 20],
'max_features': ['auto', 2, 4, 8, 16, 'log2', None]}]temp_rf = RandomForestClassifier(random_state=0, n_jobs=-1)grid_search = GridSearchCV(estimator=temp_rf, param_grid=param_grid,
scoring='roc_auc', cv=5, n_jobs=-1)%%time
grid_search.fit(X_train, y_train)
# Calculating the best RMSE score found by Grid Search
grid_search.best_score_
grid_search.best_params_
→ Finalizing the best model
# Creating the final random forest model from the grid search
final_rf = grid_search.best_estimator_# Fitting the final model with training set
final_rf.fit(X_train, y_train)
# Making predictions on the train set
y_train_pred = final_rf.predict(X_train)# Making predictions on the test set
y_test_pred = final_rf.predict(X_test)
→ Saving the model
pickle.dump(final_rf, open("randomforest.pkl","wb"))
→ Model Evaluation
- Confusion Matrix
confusion_mat = pd.DataFrame(confusion_matrix(y_test, y_test_pred))confusion_mat.index = ['Actual Negative', 'Actual Positive']
confusion_mat.columns = ['Predicted Negative', 'Predicted Positive']
confusion_mat
- Accuracy Score
# Accuracy score on the training set.
print('Accuracy score for train data is:', accuracy_score(y_train,
y_train_pred))
# Accuracy score on the training set.
print('Accuracy score for train data is:', accuracy_score(y_test,
y_test_pred))
- Precision Score
# Precision score on the training set.
print(precision_score(y_train, y_train_pred))
# Precision score on the test set.
print(precision_score(y_test, y_test_pred))
- Recall Score
# recall score on the training set.
print(recall_score(y_train, y_train_pred))
# recall score on the test set.
print(recall_score(y_test, y_test_pred))
Follow us for more upcoming future articles related to Data Science, Machine Learning, and Artificial Intelligence.
Also, Do give us a Clap👏 if you find this article useful as your encouragement catalyzes inspiration for and helps to create more cool stuff like this.