Applying AutoML (Part-3) with TPOT

Accredian Publication
4 min readMar 12, 2021

--

  • It can be thought of as a Data Scientists’ assistant.
  • TPOT provides a python code for the pipeline is formed. This file is available even when you manually end the execution.
  • TPOT takes into account the best-generated pipeline and exports the python file accordingly.
  • TPOT is built on the sci-kit learn library and can be used for regression and classification tasks.
  • TPOT is open source and under active development.
  • It is the most famous package with 7.6k GitHub stars.

Also, Check out our Article on:

Introduction to AutoML-The future of industry ML execution
Applying AutoML (Part-1) using Auto-Sklearn
Applying AutoML(Part-2) with MLBox
Applying AutoML (Part-4) using H2O
Automated Hyperparameter tuning
AutoML on the Cloud

  • TPOT uses a tree-based structure to represent a model pipeline for a predictive modeling problem, including data preparation and modeling algorithms and model hyperparameters.
  • An optimization procedure is then performed to find a tree structure that performs best for a given dataset.
  • A stochastic global optimization on programs represented as trees is performed using a genetic programming algorithm.
  • TPOT might run for hours or maybe days depending on the size of the data it has to fit.

TPOT Pipeline includes

As shown in the image, TPOT automates the task that generally takes days to do.

How does TPOT work?

  • TPOT creates multiple copies of a dataset and sends it parallelly with a feature engineering pipeline.
  • The feature engineered or selected are then combined and sent forward where the features that hold the highest importance are considered.
  • These features are then sent forward to model

Why TPOT is Time-Consuming?

  • TPOT will take a while to run on larger datasets, but it’s important to realize why.
  • With the default TPOT settings (100 generations with 100 population size), TPOT will evaluate 10,000 pipeline configurations before finishing.
  • To put this number into context, think about a grid search of 10,000 hyperparameter combinations for a machine learning algorithm and how long that grid search will take.
  • That is 10,000 model configurations to evaluate with 10-fold cross-validation, which means that roughly 100,000 models are fit and evaluated on the training data in one grid search.
  • That’s a time-consuming procedure, even for simpler models like decision trees.

Advantages

  • It produces python code for the best-performing model.​
  • Finds the best hyperparameters automatically.​
  • Low code methodology.​
  • Compare several machine learning models.​

Disadvantages

  • It is time-consuming.​
  • Produces highly complex results when not passed correct hyperparameters.​

Python Implementation

→ Installing Library

!pip install tpot

→ Importing Library

import pandas as pd
from sklearn.model_selection import train_test_split
import tpot
from tpot import TPOTClassifier

→ Reading data

data = pd.read_csv("/content/sonar_csv.csv")

→ Splitting data into Train and Test sets

data['Class'].replace(['Rock','Mine'], [0,1], inplace = True)
x = data.drop('Class',axis = 1)
y = data['Class']
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2)

→ Fitting it to TPOT

%%timetpot = TPOTClassifier(verbosity=2,generations=20)tpot.fit(x_train, y_train)
  1. generations
    Number of iterations to the run pipeline optimization process. It must be a positive number or None.
  2. population_size
    Number of individuals to retain in the genetic programming population every generation.
  3. mutation_rate
    This parameter tells the GP algorithm how many pipelines to apply random changes to every generation.
  4. crossover_rate
    This parameter tells the genetic programming algorithm how many pipelines to "breed" every generation.
  5. scoring
    Function used to evaluate the quality of a given pipeline for the classification problem. It takes accuracy by default.

Following functions can also be used:

  • ‘accuracy’
  • ‘adjusted_rand_score’
  • ‘average_precision’
  • ‘balanced_accuracy’
  • ‘f1’
  • ‘f1_macro’
  • ‘f1_micro’
  • ‘f1_samples’
  • ‘f1_weighted’
  • ‘neg_log_loss’
  • ‘precision’ etc. (suffixes apply as with ‘f1’)
  • ‘recall’ etc. (suffixes apply as with ‘f1’)
  • ‘jaccard’ etc. (suffixes apply as with ‘f1’)
  • ‘roc_auc’
  • ‘roc_auc_ovr’
  • ‘roc_auc_ovo’
  • ‘roc_auc_ovr_weighted’
  • ‘roc_auc_ovo_weighted’

For full details refer to this link.

→ Evaluating TPOT

tpot.score(x_test, y_test)

→ Exporting .py file

tpot.export("Sonar_data.py")

This pipeline was exported by TPOT, you can play with this and change things as needed by you.

Also, Check out our Article on:

Introduction to AutoML-The future of industry ML execution
Applying AutoML (Part-1) using Auto-Sklearn
Applying AutoML(Part-2) with MLBox
Applying AutoML (Part-4) using H2O
Automated Hyperparameter tuning
AutoML on the Cloud

Follow us for more upcoming future articles related to Data Science, Machine Learning, and Artificial Intelligence.

Also, Do give us a Clap👏 if you find this article useful as your encouragement catalyzes inspiration for and helps to create more cool stuff like this.

Visit us on https://www.insaid.co/

--

--

Accredian Publication

One of India’s leading institutions providing world-class Data Science & AI programs for working professionals with a mission to groom Data leaders of tomorrow!