Applying AutoML (Part-4) using H2O

Accredian Publication
5 min readMar 12, 2021

--

  • H2O’s AutoML can be used for automating the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time-limit.
  • It provides a simple wrapper function that performs a large number of modeling-related tasks that would typically require many lines of code.
  • This helps free up time to focus on other aspects of the data science pipeline tasks such as data preprocessing, feature engineering, and model deployment.
  • Stacked Ensembles are created
    one based on all previously trained models
    — another one on the best model of each family
  • Stacked Ensembles will be automatically trained on collections of individual models to produce highly predictive ensemble models which, in most cases, will be the top-performing models in the AutoML Leaderboard.
  • The H2O AutoML interface is designed to have as few parameters as possible so that all the user needs to do is a point to their dataset
  • H2O AutoML performs (simple) data preprocessing, automates the process of training a large selection of candidate models, tunes hyperparameters of the models, and creates stacked ensembles.
  • Identify the response column and optionally specify a time constraint or limit on the number of total models trained.
  • H2O’s AutoML is also a helpful tool for the advanced user.
  • It can be used in both R and Python

Also, Check out our Article on:

Introduction to AutoML-The future of industry ML execution
Applying AutoML (Part-1) using Auto-Sklearn
Applying AutoML(Part-2) with MLBox
Applying AutoML (Part-3) with TPOT
Automated Hyperparameter tuning
AutoML on the Cloud

Advantages

  • Fully integrated with the latest GPUs so that it can take advantage of the latest and greatest hardware
  • Readily available algorithms, easy to use in your analytical projects
  • Faster than Python Scikit learn (in machine learning supervised learning area)
  • Ability to scale up horizontally by provisioning dynamic clusters

Python Implementation

→ Installing H2O AutoML

!pip install h2o

→ Importing Packages

import h2o 
from h2o.automl import H2OAutoML

→ Initializing the H2O Cluster

h2o.init()

This is a necessary step as this develops the local cluster for H2O so that you can start using H2O and develop models.

→ Reading the data

crime = h2o.import_file("https://raw.githubusercontent.com/insaid2018/Term-2/master/CaseStudy/criminal_train.csv")

We will use the h2o.import_file() Function to import the CSV file to our notebook.

→ Checking data statistics

H2O’s describe() not only provides the measures for our data but also provides characteristics of each feature

crime.describe()
crime['Criminal'] = crime['Criminal'].asfactor()

→ Splitting the data

H2O provides a function that helps split data in a train, valid, and test sets.

We will use the train and valid sets to build and validate our model and with the test set, we will see how it performs.

train, valid, test = crime.split_frame(ratios=[0.7,0.2], seed=1234)print("Number of rows in train : ", train.shape[0])
print("Number of rows in test : ", test.shape[0])
print("Number of rows in Validation : ", valid.shape[0])

→ fitting the data

aml = H2OAutoML(max_models = 10, max_runtime_secs=300,exclude_algos=
['StackedEnsemble','DeepLearning'], seed = 1)
  • Here, we removed algorithms like Stacked ensemble and Deep Learning, because in the real-world we want our model to be as simple as possible.
  • Even though Stacked Ensemble is one of the best models in H2O we will avoid it.

Few more important hyperparameters are:

  • nfolds=5number of folds for k-fold cross-validation (nfolds=0 disables cross-validation)
  • balance_classes=False balance training data class counts via over/under-sampling
  • max_runtime_secs=3600 how long the AutoML run will execute (in seconds)
  • max_models=None the maximum number of models to build in an AutoML run (None means no limitation)
  • include_algos=None list of algorithms to restrict to during the model-building phase.
  • exclude_algos=None list of algorithms to skip during the model-building phase, None will use every algorithm possible.
  • seed=None random seed for reproducibility

→ Training H2O AutoML

aml.train(x = independent_features, y = dependent_features,
training_frame = train, validation_frame=valid)

→ Checking the Leaderboard

lb = aml.leaderboardlb.head(10)

→ getting model explanation

explain_model = aml.leader.explain(train)
  • Explanations can be generated automatically with a single function call, providing a simple interface to exploring and explaining the AutoML models.
  • A large number of multi-model comparison and a single model (AutoML leader) plots can be generated automatically with a single call to h2o.explain().

→ Predicting on test data

preds = aml.predict(test)

H2O’s predict function outputs classes are predicted as well as the probabilities for each of the classes.

→ Checking model performance on test data

This gives us a high-level report of the performance of our model

aml.leader.model_performance(test)

Also, Check out our Article on:

Introduction to AutoML-The future of industry ML execution
Applying AutoML (Part-1) using Auto-Sklearn
Applying AutoML(Part-2) with MLBox
Applying AutoML (Part-3) with TPOT
Automated Hyperparameter tuning
AutoML on the Cloud

Follow us for more upcoming future articles related to Data Science, Machine Learning, and Artificial Intelligence.

Also, Do give us a Clap👏 if you find this article useful as your encouragement catalyzes inspiration for and helps to create more cool stuff like this.

Visit us on https://www.insaid.co/

--

--

Accredian Publication

One of India’s leading institutions providing world-class Data Science & AI programs for working professionals with a mission to groom Data leaders of tomorrow!