Applying AutoML (Part-1) using Auto-Sklearn

Accredian Publication

5 min readMar 12, 2021

It is built around the sci-kit learn library.
It automatically searches for the right Machine Learning models fitting the data.
Auto-sklearn builds an ensemble of all models tested during the global optimization process.
In order to speed up the optimization process, auto-sklearn uses meta-learning to identify similar datasets and use the knowledge gathered in the past.
Auto-Sklearn wraps a total of 15 classification algorithms, 14 feature preprocessing algorithms, and takes care of data scaling, the encoding of categorical parameters, and missing values.

Advantages

Along with data preparation and model building, it also learns from models that have been used on similar datasets and can create automatic ensemble models for better accuracy.
Uses Bayesian Optimization for faster results

Disadvantages

auto-sklearn is completely automatic and black-box. It searches a vast space of models and constructs complex ensembles of high accuracy, taking a substantial amount of computation and time in the process. The goal of auto-sklearn is to build the best model possible given the data.

Also, Check out our Article on:

Introduction to AutoML-The future of industry ML execution
Applying AutoML(Part-2) with MLBox
Applying AutoML (Part-3) with TPOT
Applying AutoML (Part-4) using H2O
Automated Hyperparameter tuning
AutoML on the Cloud

Python Implementation

→ Installing library

!apt-get install swig -y!pip install Cython numpy

After installing the above dependencies you may proceed to install the auto-sklearn package

!pip install auto-sklearn

Restart the kernels once everything is installed.

→ Importing library

import pandas as pd                                                 # Importing for panel data analysisfrom pandas_profiling import ProfileReport                          # Import Pandas Profiling (To generate Univariate Analysis)pd.set_option('display.max_columns', None)                          # Unfolding hidden features if the cardinality is highpd.set_option('display.max_colwidth', None)                         # Unfolding the max feature width for better clearitypd.set_option('display.max_rows', None)                             # Unfolding hidden data points if the cardinality is highpd.set_option('mode.chained_assignment', None)                      # Removing restriction over chained assignments operationspd.set_option('display.float_format', lambda x: '%.5f' % x)         # To suppress scientific notation over exponential values#-------------------------------------------------------------------------------------------------------------------------------import numpy as np                                                  # Importing package numpys (For Numerical Python)np.set_printoptions(precision=4)                                    # To display values only upto four decimal places.#-------------------------------------------------------------------------------------------------------------------------------import matplotlib.pyplot as plt                                     # Importing pyplot interface using matplotlibfrom matplotlib.pylab import rcParams                               # Backend used for rendering and GUI integrationimport seaborn as sns                                               # Importing seaborm library for interactive visualization# To get graph in Notbook.%matplotlib inline#-------------------------------------------------------------------------------------------------------------------------------import time                                                         # To get time for the execution#-------------------------------------------------------------------------------------------------------------------------------from smac.tae import StatusType                                     # To get the Status of the execution#-------------------------------------------------------------------------------------------------------------------------------from sklearn.model_selection import train_test_split                # To split the data in training and testing partfrom sklearn.metrics import accuracy_score, f1_score                # For Checking the accuracy and F1-Score of our modelimport autosklearn.classification                                   # For using AutoML

→ Reading data

data = pd.read_csv("https://raw.githubusercontent.com/insaid2018/Term-2/master/Data/credit_fraud.csv")data.head()

→ Splitting into Train and Test data

x = data.drop('Class', axis = 1)
y = np.array(data['Class'])x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2)

→ Fitting Auto-Sklearn

Preprocessing in auto-sklearn is divided into data preprocessing and feature preprocessing.
Data preprocessing includes One-Hot encoding of categorical features, imputation of missing values and the normalization of features or samples

# configure auto-sklearnautoml = autosklearn.classification.AutoSklearnClassifier(
                                    time_left_for_this_task=30, 
                                    per_run_time_limit=10)

There are a lot of hyperparameters one can pass to configure the model, we will look at the top few Hyperparameters. To look at all the hyperparameters present check out this link.

time_left_for_this_task
- Time limit in seconds for the search of appropriate models. By increasing this value, auto-sklearn has a higher chance of finding better models.
- By default it is 3600.
per_run_time_limit
- Time limit for a single call to the machine learning model.
- Model fitting will be terminated if the machine learning algorithm runs over the time limit.
- By default, it is 1/10 of time_left_for_this_task
ensemble_size
- Number of models added to the ensemble built by Ensemble selection from libraries of models.
- By default, it is 50
memory_limit
- auto-sklearn will stop fitting the machine learning algorithm if it tries to allocate more than memory_limit MB
- Bt default, it is 3072 MB.

automl.fit(x_train, y_train)

→ Evaluating Performance

# evaluatepred = automl.predict(x_test)test_acc = accuracy_score(y_test, pred)print("Test Accuracy score {0}".format(test_acc))test_f1 = f1_score(y_test, pred)print(f"Test F1-Score {test_f1}")

→ Checking reports of models built by Auto-Sklearn

print(automl.show_models())

→ Using Resampling Technique to fit Auto-Sklearn

There are multiple techniques available in Auto-Sklearn.
In order to use resampling technique we need to pass in two hyperparameters to the AutoML functon
resampling_strategy and resampling_strategy_arguments

automl_Hold = autosklearn.classification.AutoSklearnClassifier(
                                    time_left_for_this_task=120,
                                    per_run_time_limit = 30,
                                    resampling_strategy = 'holdout',
                 resampling_strategy_arguments = {'train_size':0.8})

Other Resampling Methods include:

‘holdout’: 67:33 (train:test) split
‘holdout-iterative-fit’: 67:33 (train: test) split, calls iterative fit where possible
‘cv’: cross-validation, requires ‘folds’
‘cv-iterative-fit’: cross-validation, calls iterative fit where possible
‘partial-cv’: cross-validation with intensification, requires ‘folds’

→ Refit function

During fit(), models are fit on individual cross-validation folds.
To use all available data, we call refit() which trains all models in the final ensemble on the whole dataset.

automl_cv.refit(x_train.copy(),y_train.copy())

# evaluatepred = automl_cv.predict(x_test)print("After Re-fit")print("-----------------------------")test_acc = accuracy_score(y_test, pred)print("Accuracy score {0}".format(test_acc))test_f1 = f1_score(y_test, pred)print(f"F1-Score {test_f1}")

Also, Check out our Article on:

Follow us for more upcoming future articles related to Data Science, Machine Learning, and Artificial Intelligence.

Also, Do give us a Clap👏 if you find this article useful as your encouragement catalyzes inspiration for and helps to create more cool stuff like this.

Visit us on https://www.insaid.co/

Applying AutoML (Part-1) using Auto-Sklearn

Advantages

Disadvantages

Also, Check out our Article on:

Python Implementation

→ Installing library

→ Importing library

→ Reading data

→ Splitting into Train and Test data

→ Fitting Auto-Sklearn

→ Evaluating Performance

→ Checking reports of models built by Auto-Sklearn

→ Using Resampling Technique to fit Auto-Sklearn

→ Refit function

Also, Check out our Article on:

Visit us on https://www.insaid.co/

Written by Accredian Publication

Responses (1)