Classification in PyCaret!!

Accredian Publication
8 min readSep 25, 2020

--

What is Classification?

A classification problem is when the output variable is a category or discrete, such as “red” or “blue” or “spam” and “not spam”. A classification model attempts to draw some conclusions from observed values. Given one or more inputs a classification model will try to predict the value of one or more outcomes.

There are a number of classification models. Classification models include logistic regression, decision tree, random forest, gradient-boosted tree, multilayer perceptron, one-vs-rest, and Naive Bayes.

Also, Check out our Article on:

Complete Guide to PyCaret
Regression in PyCaret
Anomaly Detection using PyCaret
Clustering using PyCaret

Classification using Scikit-learn

→ Importing the necessary Libraries.

import pandas as pdimport numpy as npfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_score,f1_score,roc_auc_score,recall_score,precision_score,matthews_corrcoef,cohen_kappa_scorefrom catboost import CatBoostClassifierfrom sklearn.linear_model import LogisticRegressionfrom sklearn.ensemble import RandomForestClassifier,ExtraTreesClassifierfrom xgboost import XGBClassifierfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.neighbors import KNeighborsClassifier

→ Checking for missing values

data = pd.read_csv("/contents/diabetes.csv")
data.isnull().sum()

→ Then we need to separate the data into independent and dependant features and then split into train and test.

x = data.drop('Class variable',axis = 1)y = data['Class variable']x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.3,random_state = 123)

→ Then we start Building Models

Log_model = LogisticRegression().fit(x_train,y_train)tree_model = DecisionTreeClassifier().fit(x_train,y_train)random_model = RandomForestClassifier().fit(x_train,y_train)extra_model = ExtraTreesClassifier().fit(x_train,y_train)cat_model = CatBoostClassifier().fit(x_train,y_train)xgb_model = XGBClassifier().fit(x_train,y_train)KNN_model = KNeighborsClassifier().fit(x_train,y_train)

→ Making a function to evaluate the model

def evaluate_classification_model(model,x_test,y_test):
pred = model.predict(x_test)
print("Accuarcy Score : ",accuracy_score(y_test,pred))
print("Auc score : ",roc_auc_score(y_test,pred))
print("Recall Score : ",recall_score(y_test,pred))
print("Precision Score : ",precision_score(y_test,pred))
print("F1 Score : ",f1_score(y_test,pred))
print("Kappa Score : ",cohen_kappa_score(y_test,pred))
print("MCC Score : ",matthews_corrcoef(y_test,pred))
  • Logistic Regression
  • Decision Tree
  • RandomForest
  • ExtraTree
  • XGBoost
  • CatBoost
  • KNearest Neighbors

If we look at all this in a single frame it comes around to 30 lines, even though hyperparameter tuning and outlier handling was not done.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,f1_score,roc_auc_score,recall_score,precision_score,matthews_corrcoef,cohen_kappa_score
from catboost import CatBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier,ExtraTreesClassifier
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
data = pd.read_csv("/contents/diabetes.csv")
data.isnull().sum()
x = data.drop('Class variable',axis = 1)
y = data['Class variable']
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.3,random_state = 123)
Log_model = LogisticRegression().fit(x_train,y_train)
tree_model = DecisionTreeClassifier().fit(x_train,y_train)
random_model = RandomForestClassifier().fit(x_train,y_train)
extra_model = ExtraTreesClassifier().fit(x_train,y_train)
cat_model = CatBoostClassifier().fit(x_train,y_train)
xgb_model = XGBClassifier().fit(x_train,y_train)
KNN_model = KNeighborsClassifier().fit(x_train,y_train)
def evaluate_classification_model(model,x_test,y_test):
pred = model.predict(x_test)
print("Accuarcy Score : ",accuracy_score(y_test,pred))
print("Auc score : ",roc_auc_score(y_test,pred))
print("Recall Score : ",recall_score(y_test,pred))
print("Precision Score : ",precision_score(y_test,pred))
print("F1 Score : ",f1_score(y_test,pred))
print("Kappa Score : ",cohen_kappa_score(y_test,pred))
print("MCC Score : ",matthews_corrcoef(y_test,pred))

These 35 lines of code can be brought down to about 10 lines of code using PyCaret.

Getting Started with Classification in PyCaret!!

If you are not familiar with PyCaret. I suggest you to first go through the below link before moving on from here.

“Complete guide to PyCaret

→Reading the data in the PyCaret library.

import pycaretfrom pycaret.classification import *data = pd.read_csv('/contents/diabetes.csv')

We will use the diabetes data

Setting up the PyCaret environment

Before moving on with any kind of experimentation using PyCaret we need to set up the environment.
It is a mandatory step that should be done before any machine learning experiment.

class = setup(data = DataFrame_name, target = 'target_variable_name')

As you know PyCaret helps in model deployment too, so all the experiment done is saved in a pipeline and this pipeline can be deployed into production with ease.

After this press enter and you will get results as shown below.

Compare models

This function compares each and every model present in the PyCaret depending upon the problem statement.
Training of every model is done using the default hyperparameters and evaluates performance metrics using the cross-validation.

compare_models()

The output of the function is a table showing the average score of all models across the folds. The number of folds can be defined using the fold parameters within the compare_models function. By default, the fold is set to 10. The table is sorted (highest to lowest) by the metric of choice and can be defined using the sort parameter. By default, the table is sorted by Accuracy for classification experiments and R2 for regression experiments. Certain models are prevented for comparison because of their longer run-time. In order to bypass this prevention, the turbo parameter can be set to False.

To select the top n numbers of the model, include n_select hyperparameter within the compare_models function.

compare_models(n_select = n)

We can even sort it using the metrics.

compare_models(n_select = 3, sort=‘F1’)

Creating Models

Creating a model in PyCaret is one of the simplest tasks.

The “create_model” function takes in just the model ID as a string and performs the task.

create_model('model_ID')

After performing this we get a table of all the metrics rounded up to 4 decimal digits as an output.
Classification Metrics: Accuracy, AUC, Recall, Precision, F1, Kappa, MCC

Model ID for Classification Models.

+------------+---------------------------------+
| ID | Name |
+------------+---------------------------------+
| ‘lr’ | Logistic Regression |
| ‘knn’ | K Nearest Neighbour |
| ‘nb’ | Naives Bayes |
| ‘dt’ | Decision Tree Classifier |
| ‘svm’ | SVM – Linear Kernel |
| ‘rbfsvm’ | SVM – Radial Kernel |
| ‘gpc’ | Gaussian Process Classifier |
| ‘mlp’ | Multi Level Perceptron |
| ‘ridge’ | Ridge Classifier |
| ‘rf’ | Random Forest Classifier |
| ‘qda’ | Quadratic Discriminant Analysis |
| ‘ada’ | Ada Boost Classifier |
| ‘gbc’ | Gradient Boosting Classifier |
| ‘lda’ | Linear Discriminant Analysis |
| ‘et’ | Extra Trees Classifier |
| ‘xgboost’ | Extreme Gradient Boosting |
| ‘lightgbm’ | Light Gradient Boosting |
| ‘catboost’ | CatBoost Classifier |
+------------+---------------------------------+

Tune Model

It provides just one line function to perform hyperparameter tuning of any model present in the PyCaret Library.

It tunes the hyperparameter of the model passed as an estimator using a Random grid search with pre-defined grids that are fully customizable.

tuned = tune_model(dt, n_iter = 50)

Plot a Model

It helps in checking the performance of a model with different graphs in one line of code.

model = create_model('Model_name')plot_model(model)

By default, AUC is plotted using the function.

Plotting the decision boundary.

Plot ID for Classification Models

+-----------------------------+--------------------+
| Name | Plot |
+-----------------------------+--------------------+
| Area Under the Curve | ‘auc’ |
| Discrimination Threshold | ‘threshold’ |
| Precision Recall Curve | ‘pr’ |
| Confusion Matrix | ‘confusion_matrix’ |
| Class Prediction Error | ‘error’ |
| Classification Report | ‘class_report’ |
| Decision Boundary | ‘boundary’ |
| Recursive Feature Selection | ‘rfe’ |
| Learning Curve | ‘learning’ |
| Manifold Learning | ‘manifold’ |
| Calibration Curve | ‘calibration’ |
| Validation Curve | ‘vc’ |
| Dimension Learning | ‘dimension’ |
| Feature Importance | ‘feature’ |
| Model Hyperparameter | ‘parameter’ |
+-----------------------------+--------------------+

Interpret Model

After building a model one of the most important task is to interpret the results.

Model Interpretability helps debug the model by analyzing what the model really thinks is important.

model = create_model('Model_name')interpret_model(model)

Predict Model

Finalize Model

It is the last step of building a model in PyCaret.

This function takes a trained model object and returns a model that has been trained on the entire dataset.

model = create_model('Model_name')finalize_model(model)

Save Models

Saving a trained model in PyCaret is as simple as writing save_model. The function takes a trained model object and saves the entire transformation pipeline and trained model object as a transferable binary pickle file for later use.

Looking at every code from PyCaret in a single Frame.

import pycaret
from pycaret.classification import *
data = pd.read_csv('/contents/diabetes.csv')
class = setup(data = DataFrame_name, target = 'target_variable_name')compare_models()create_models('Model_ID')tuned = tune_model(dt, n_iter = 50)plot_model(tuned)interpret_model(tuned)predict_model(tuned)finalize_model(model)save_model(tuned)

These 12 lines of code not only builds a model and save it, but also checks for outliers, missing values, splits the data, and perform cross-validation and hyperparameter tuning.

Also, Check out our Article on:

Complete Guide to PyCaret
Regression in PyCaret
Anomaly Detection using PyCaret
Clustering using PyCaret

Follow us for more upcoming future articles related to Data Science, Machine Learning, and Artificial Intelligence.

Also, Do give us a Clap👏 if you find this article useful as your encouragement catalyzes inspiration for and helps to create more cool stuff like this.

Visit us on https://www.insaid.co/

--

--

Accredian Publication

One of India’s leading institutions providing world-class Data Science & AI programs for working professionals with a mission to groom Data leaders of tomorrow!