Applying AutoML (Part-4) using H2O
- H2O’s AutoML can be used for automating the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time-limit.
- It provides a simple wrapper function that performs a large number of modeling-related tasks that would typically require many lines of code.
- This helps free up time to focus on other aspects of the data science pipeline tasks such as data preprocessing, feature engineering, and model deployment.
- Stacked Ensembles are created
— one based on all previously trained models
— another one on the best model of each family - Stacked Ensembles will be automatically trained on collections of individual models to produce highly predictive ensemble models which, in most cases, will be the top-performing models in the AutoML Leaderboard.
- The H2O AutoML interface is designed to have as few parameters as possible so that all the user needs to do is a point to their dataset
- H2O AutoML performs (simple) data preprocessing, automates the process of training a large selection of candidate models, tunes hyperparameters of the models, and creates stacked ensembles.
- Identify the response column and optionally specify a time constraint or limit on the number of total models trained.
- H2O’s AutoML is also a helpful tool for the advanced user.
- It can be used in both R and Python
Also, Check out our Article on:
Introduction to AutoML-The future of industry ML execution
Applying AutoML (Part-1) using Auto-Sklearn
Applying AutoML(Part-2) with MLBox
Applying AutoML (Part-3) with TPOT
Automated Hyperparameter tuning
AutoML on the Cloud
Advantages
- Fully integrated with the latest GPUs so that it can take advantage of the latest and greatest hardware
- Readily available algorithms, easy to use in your analytical projects
- Faster than Python Scikit learn (in machine learning supervised learning area)
- Ability to scale up horizontally by provisioning dynamic clusters
Python Implementation
→ Installing H2O AutoML
!pip install h2o
→ Importing Packages
import h2o
from h2o.automl import H2OAutoML
→ Initializing the H2O Cluster
h2o.init()
This is a necessary step as this develops the local cluster for H2O so that you can start using H2O and develop models.
→ Reading the data
crime = h2o.import_file("https://raw.githubusercontent.com/insaid2018/Term-2/master/CaseStudy/criminal_train.csv")
We will use the h2o.import_file()
Function to import the CSV file to our notebook.
→ Checking data statistics
H2O’s describe()
not only provides the measures for our data but also provides characteristics of each feature
crime.describe()
crime['Criminal'] = crime['Criminal'].asfactor()
→ Splitting the data
H2O provides a function that helps split data in a train, valid, and test sets.
We will use the train and valid sets to build and validate our model and with the test set, we will see how it performs.
train, valid, test = crime.split_frame(ratios=[0.7,0.2], seed=1234)print("Number of rows in train : ", train.shape[0])
print("Number of rows in test : ", test.shape[0])
print("Number of rows in Validation : ", valid.shape[0])
→ fitting the data
aml = H2OAutoML(max_models = 10, max_runtime_secs=300,exclude_algos=
['StackedEnsemble','DeepLearning'], seed = 1)
- Here, we removed algorithms like Stacked ensemble and Deep Learning, because in the real-world we want our model to be as simple as possible.
- Even though Stacked Ensemble is one of the best models in H2O we will avoid it.
Few more important hyperparameters are:
nfolds=5
number of folds for k-fold cross-validation (nfolds=0 disables cross-validation)balance_classes=False
balance training data class counts via over/under-samplingmax_runtime_secs=3600
how long the AutoML run will execute (in seconds)max_models=None
the maximum number of models to build in an AutoML run (None means no limitation)include_algos=None
list of algorithms to restrict to during the model-building phase.exclude_algos=None
list of algorithms to skip during the model-building phase, None will use every algorithm possible.seed=None
random seed for reproducibility
→ Training H2O AutoML
aml.train(x = independent_features, y = dependent_features,
training_frame = train, validation_frame=valid)
→ Checking the Leaderboard
lb = aml.leaderboardlb.head(10)
→ getting model explanation
explain_model = aml.leader.explain(train)
- Explanations can be generated automatically with a single function call, providing a simple interface to exploring and explaining the AutoML models.
- A large number of multi-model comparison and a single model (AutoML leader) plots can be generated automatically with a single call to
h2o.explain()
.
→ Predicting on test data
preds = aml.predict(test)
H2O’s predict function outputs classes are predicted as well as the probabilities for each of the classes.
→ Checking model performance on test data
This gives us a high-level report of the performance of our model
aml.leader.model_performance(test)
Also, Check out our Article on:
Introduction to AutoML-The future of industry ML execution
Applying AutoML (Part-1) using Auto-Sklearn
Applying AutoML(Part-2) with MLBox
Applying AutoML (Part-3) with TPOT
Automated Hyperparameter tuning
AutoML on the Cloud
Follow us for more upcoming future articles related to Data Science, Machine Learning, and Artificial Intelligence.
Also, Do give us a Clap👏 if you find this article useful as your encouragement catalyzes inspiration for and helps to create more cool stuff like this.