Applying AutoML (Part-1) using Auto-Sklearn
- It is built around the sci-kit learn library.
- It automatically searches for the right Machine Learning models fitting the data.
- Auto-sklearn builds an ensemble of all models tested during the global optimization process.
- In order to speed up the optimization process, auto-sklearn uses meta-learning to identify similar datasets and use the knowledge gathered in the past.
- Auto-Sklearn wraps a total of 15 classification algorithms, 14 feature preprocessing algorithms, and takes care of data scaling, the encoding of categorical parameters, and missing values.
Advantages
- Along with data preparation and model building, it also learns from models that have been used on similar datasets and can create automatic ensemble models for better accuracy.
- Uses Bayesian Optimization for faster results
Disadvantages
- auto-sklearn is completely automatic and black-box. It searches a vast space of models and constructs complex ensembles of high accuracy, taking a substantial amount of computation and time in the process. The goal of auto-sklearn is to build the best model possible given the data.
Also, Check out our Article on:
Introduction to AutoML-The future of industry ML execution
Applying AutoML(Part-2) with MLBox
Applying AutoML (Part-3) with TPOT
Applying AutoML (Part-4) using H2O
Automated Hyperparameter tuning
AutoML on the Cloud
Python Implementation
→ Installing library
!apt-get install swig -y!pip install Cython numpy
After installing the above dependencies you may proceed to install the auto-sklearn package
!pip install auto-sklearn
Restart the kernels once everything is installed.
→ Importing library
import pandas as pd # Importing for panel data analysisfrom pandas_profiling import ProfileReport # Import Pandas Profiling (To generate Univariate Analysis)pd.set_option('display.max_columns', None) # Unfolding hidden features if the cardinality is highpd.set_option('display.max_colwidth', None) # Unfolding the max feature width for better clearitypd.set_option('display.max_rows', None) # Unfolding hidden data points if the cardinality is highpd.set_option('mode.chained_assignment', None) # Removing restriction over chained assignments operationspd.set_option('display.float_format', lambda x: '%.5f' % x) # To suppress scientific notation over exponential values#-------------------------------------------------------------------------------------------------------------------------------import numpy as np # Importing package numpys (For Numerical Python)np.set_printoptions(precision=4) # To display values only upto four decimal places.#-------------------------------------------------------------------------------------------------------------------------------import matplotlib.pyplot as plt # Importing pyplot interface using matplotlibfrom matplotlib.pylab import rcParams # Backend used for rendering and GUI integrationimport seaborn as sns # Importing seaborm library for interactive visualization# To get graph in Notbook.%matplotlib inline#-------------------------------------------------------------------------------------------------------------------------------import time # To get time for the execution#-------------------------------------------------------------------------------------------------------------------------------from smac.tae import StatusType # To get the Status of the execution#-------------------------------------------------------------------------------------------------------------------------------from sklearn.model_selection import train_test_split # To split the data in training and testing partfrom sklearn.metrics import accuracy_score, f1_score # For Checking the accuracy and F1-Score of our modelimport autosklearn.classification # For using AutoML
→ Reading data
data = pd.read_csv("https://raw.githubusercontent.com/insaid2018/Term-2/master/Data/credit_fraud.csv")data.head()
→ Splitting into Train and Test data
x = data.drop('Class', axis = 1)
y = np.array(data['Class'])x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2)
→ Fitting Auto-Sklearn
- Preprocessing in auto-sklearn is divided into data preprocessing and feature preprocessing.
- Data preprocessing includes One-Hot encoding of categorical features, imputation of missing values and the normalization of features or samples
# configure auto-sklearnautoml = autosklearn.classification.AutoSklearnClassifier(
time_left_for_this_task=30,
per_run_time_limit=10)
There are a lot of hyperparameters one can pass to configure the model, we will look at the top few Hyperparameters. To look at all the hyperparameters present check out this link.
- time_left_for_this_task
- Time limit in seconds for the search of appropriate models. By increasing this value, auto-sklearn has a higher chance of finding better models.
- By default it is 3600. - per_run_time_limit
- Time limit for a single call to the machine learning model.
- Model fitting will be terminated if the machine learning algorithm runs over the time limit.
- By default, it is 1/10 of time_left_for_this_task - ensemble_size
- Number of models added to the ensemble built by Ensemble selection from libraries of models.
- By default, it is 50 - memory_limit
- auto-sklearn will stop fitting the machine learning algorithm if it tries to allocate more than memory_limit MB
- Bt default, it is 3072 MB.
automl.fit(x_train, y_train)
→ Evaluating Performance
# evaluatepred = automl.predict(x_test)test_acc = accuracy_score(y_test, pred)print("Test Accuracy score {0}".format(test_acc))test_f1 = f1_score(y_test, pred)print(f"Test F1-Score {test_f1}")
→ Checking reports of models built by Auto-Sklearn
print(automl.show_models())
→ Using Resampling Technique to fit Auto-Sklearn
- There are multiple techniques available in Auto-Sklearn.
- In order to use resampling technique we need to pass in two hyperparameters to the AutoML functon
resampling_strategy
andresampling_strategy_arguments
automl_Hold = autosklearn.classification.AutoSklearnClassifier(
time_left_for_this_task=120,
per_run_time_limit = 30,
resampling_strategy = 'holdout',
resampling_strategy_arguments = {'train_size':0.8})
Other Resampling Methods include:
- ‘holdout’: 67:33 (train:test) split
- ‘holdout-iterative-fit’: 67:33 (train: test) split, calls iterative fit where possible
- ‘cv’: cross-validation, requires ‘folds’
- ‘cv-iterative-fit’: cross-validation, calls iterative fit where possible
- ‘partial-cv’: cross-validation with intensification, requires ‘folds’
→ Refit function
- During
fit()
, models are fit on individual cross-validation folds. - To use all available data, we call
refit()
which trains all models in the final ensemble on the whole dataset.
automl_cv.refit(x_train.copy(),y_train.copy())
# evaluatepred = automl_cv.predict(x_test)print("After Re-fit")print("-----------------------------")test_acc = accuracy_score(y_test, pred)print("Accuracy score {0}".format(test_acc))test_f1 = f1_score(y_test, pred)print(f"F1-Score {test_f1}")
Also, Check out our Article on:
Introduction to AutoML-The future of industry ML execution
Applying AutoML(Part-2) with MLBox
Applying AutoML (Part-3) with TPOT
Applying AutoML (Part-4) using H2O
Automated Hyperparameter tuning
AutoML on the Cloud
Follow us for more upcoming future articles related to Data Science, Machine Learning, and Artificial Intelligence.
Also, Do give us a Clap👏 if you find this article useful as your encouragement catalyzes inspiration for and helps to create more cool stuff like this.