Applying AutoML (Part-1) using Auto-Sklearn

Accredian Publication
5 min readMar 12, 2021

--

  • It is built around the sci-kit learn library.
  • It automatically searches for the right Machine Learning models fitting the data.
  • Auto-sklearn builds an ensemble of all models tested during the global optimization process.
  • In order to speed up the optimization process, auto-sklearn uses meta-learning to identify similar datasets and use the knowledge gathered in the past.
  • Auto-Sklearn wraps a total of 15 classification algorithms, 14 feature preprocessing algorithms, and takes care of data scaling, the encoding of categorical parameters, and missing values.

Advantages

  • Along with data preparation and model building, it also learns from models that have been used on similar datasets and can create automatic ensemble models for better accuracy.
  • Uses Bayesian Optimization for faster results

Disadvantages

  • auto-sklearn is completely automatic and black-box. It searches a vast space of models and constructs complex ensembles of high accuracy, taking a substantial amount of computation and time in the process. The goal of auto-sklearn is to build the best model possible given the data.

Also, Check out our Article on:

Introduction to AutoML-The future of industry ML execution
Applying AutoML(Part-2) with MLBox
Applying AutoML (Part-3) with TPOT
Applying AutoML (Part-4) using H2O
Automated Hyperparameter tuning
AutoML on the Cloud

Python Implementation

→ Installing library

!apt-get install swig -y!pip install Cython numpy

After installing the above dependencies you may proceed to install the auto-sklearn package

!pip install auto-sklearn

Restart the kernels once everything is installed.

→ Importing library

import pandas as pd                                                 # Importing for panel data analysisfrom pandas_profiling import ProfileReport                          # Import Pandas Profiling (To generate Univariate Analysis)pd.set_option('display.max_columns', None)                          # Unfolding hidden features if the cardinality is highpd.set_option('display.max_colwidth', None)                         # Unfolding the max feature width for better clearitypd.set_option('display.max_rows', None)                             # Unfolding hidden data points if the cardinality is highpd.set_option('mode.chained_assignment', None)                      # Removing restriction over chained assignments operationspd.set_option('display.float_format', lambda x: '%.5f' % x)         # To suppress scientific notation over exponential values#-------------------------------------------------------------------------------------------------------------------------------import numpy as np                                                  # Importing package numpys (For Numerical Python)np.set_printoptions(precision=4)                                    # To display values only upto four decimal places.#-------------------------------------------------------------------------------------------------------------------------------import matplotlib.pyplot as plt                                     # Importing pyplot interface using matplotlibfrom matplotlib.pylab import rcParams                               # Backend used for rendering and GUI integrationimport seaborn as sns                                               # Importing seaborm library for interactive visualization# To get graph in Notbook.%matplotlib inline#-------------------------------------------------------------------------------------------------------------------------------import time                                                         # To get time for the execution#-------------------------------------------------------------------------------------------------------------------------------from smac.tae import StatusType                                     # To get the Status of the execution#-------------------------------------------------------------------------------------------------------------------------------from sklearn.model_selection import train_test_split                # To split the data in training and testing partfrom sklearn.metrics import accuracy_score, f1_score                # For Checking the accuracy and F1-Score of our modelimport autosklearn.classification                                   # For using AutoML

→ Reading data

data = pd.read_csv("https://raw.githubusercontent.com/insaid2018/Term-2/master/Data/credit_fraud.csv")data.head()

→ Splitting into Train and Test data

x = data.drop('Class', axis = 1)
y = np.array(data['Class'])
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2)

→ Fitting Auto-Sklearn

  • Preprocessing in auto-sklearn is divided into data preprocessing and feature preprocessing.
  • Data preprocessing includes One-Hot encoding of categorical features, imputation of missing values and the normalization of features or samples
# configure auto-sklearnautoml = autosklearn.classification.AutoSklearnClassifier(
time_left_for_this_task=30,
per_run_time_limit=10)

There are a lot of hyperparameters one can pass to configure the model, we will look at the top few Hyperparameters. To look at all the hyperparameters present check out this link.

  • time_left_for_this_task
    - Time limit in seconds for the search of appropriate models. By increasing this value, auto-sklearn has a higher chance of finding better models.
    - By default it is 3600.
  • per_run_time_limit
    - Time limit for a single call to the machine learning model.
    - Model fitting will be terminated if the machine learning algorithm runs over the time limit.
    - By default, it is 1/10 of time_left_for_this_task
  • ensemble_size
    - Number of models added to the ensemble built by Ensemble selection from libraries of models.
    - By default, it is 50
  • memory_limit
    - auto-sklearn will stop fitting the machine learning algorithm if it tries to allocate more than memory_limit MB
    - Bt default, it is 3072 MB.
automl.fit(x_train, y_train)

→ Evaluating Performance

# evaluatepred = automl.predict(x_test)test_acc = accuracy_score(y_test, pred)print("Test Accuracy score {0}".format(test_acc))test_f1 = f1_score(y_test, pred)print(f"Test F1-Score {test_f1}")

→ Checking reports of models built by Auto-Sklearn

print(automl.show_models())

→ Using Resampling Technique to fit Auto-Sklearn

  • There are multiple techniques available in Auto-Sklearn.
  • In order to use resampling technique we need to pass in two hyperparameters to the AutoML functon
    resampling_strategy and resampling_strategy_arguments
automl_Hold = autosklearn.classification.AutoSklearnClassifier(
time_left_for_this_task=120,
per_run_time_limit = 30,
resampling_strategy = 'holdout',
resampling_strategy_arguments = {'train_size':0.8})

Other Resampling Methods include:

  • ‘holdout’: 67:33 (train:test) split
  • ‘holdout-iterative-fit’: 67:33 (train: test) split, calls iterative fit where possible
  • ‘cv’: cross-validation, requires ‘folds’
  • ‘cv-iterative-fit’: cross-validation, calls iterative fit where possible
  • ‘partial-cv’: cross-validation with intensification, requires ‘folds’

→ Refit function

  • During fit(), models are fit on individual cross-validation folds.
  • To use all available data, we call refit() which trains all models in the final ensemble on the whole dataset.
automl_cv.refit(x_train.copy(),y_train.copy())
# evaluatepred = automl_cv.predict(x_test)print("After Re-fit")print("-----------------------------")test_acc = accuracy_score(y_test, pred)print("Accuracy score {0}".format(test_acc))test_f1 = f1_score(y_test, pred)print(f"F1-Score {test_f1}")

Also, Check out our Article on:

Introduction to AutoML-The future of industry ML execution
Applying AutoML(Part-2) with MLBox
Applying AutoML (Part-3) with TPOT
Applying AutoML (Part-4) using H2O
Automated Hyperparameter tuning
AutoML on the Cloud

Follow us for more upcoming future articles related to Data Science, Machine Learning, and Artificial Intelligence.

Also, Do give us a Clap👏 if you find this article useful as your encouragement catalyzes inspiration for and helps to create more cool stuff like this.

Visit us on https://www.insaid.co/

--

--

Accredian Publication

One of India’s leading institutions providing world-class Data Science & AI programs for working professionals with a mission to groom Data leaders of tomorrow!