Automated Hyperparameter tuning
Ever since the introduction of a few advanced algorithms in the field of Machine Learning,
Hyparameter tuning has been a tedious task.
- Hyperparameters are tunable and can be used to get the optimal performing model.
- It’s always tricky to find the optimal combinations of any ML model for a specific task.
- Not only it takes up time writing lines and lines of code, but it also takes up time to train.
- A Hyperparameter controls the result/conduct of any ML models.
Also, Check out our Article on:
Introduction to AutoML-The future of industry ML execution
Applying AutoML (Part-1) using Auto-Sklearn
Applying AutoML(Part-2) with MLBox
Applying AutoML (Part-3) with TPOT
Applying AutoML (Part-4) using H2O
AutoML on the Cloud
Traditional Hyperparameter Tuning!
Let’s look at the traditional way to tune a RandomForest model.
We are taking the RandomForest model as most of us are quite comfortable with it and knows most of the hyperparameter associated with it.
There are three types of hyperparameter searches:
a. GridSearch
b. RandomSearch
c. ManualSearch
Read in detail about it with our article:
Hyper-parameter Tuning
We will be seeing the time taken by both Grid and Random Search to search for the optimal hyperparameter.
→ Grid Search:
As you can see it took 3 hours to find a set of hyperparameters for only 800 data points. Imagine passing millions of data!!
Grid Search will take up days to model.
→ Random Search:
Even though RandomSearch took 2 min, it never guarantees an optimal set of hyperparameters
→ Problems with the traditional method.
- Increased time complexity.
- GridSearch is a greedy search algorithm that runs over every value we passed to tune the model. Whereas, the RandomSearch randomly chooses over the value.
- GridSearch suffers from the curse of dimensionality:
→ The number of model evaluations with each set of hyperparameters grows drastically during the optimization process.
→ Additionally, it is not even guaranteed to find the best solution. - The drawback of RandomSearch is unnecessarily high variance.
The method is entirely random and uses no intelligence in selecting which points to try.
Why use Automated-Hyperparameter tuning?
- GridSearch and RandomSearch are hands-off, however, it takes very long run times to execute the process.
This is due to the fact that they waste most of their time evaluating parameters on the search space that do not add any value. - Increasingly, hyperparameter tuning is done by automated methods that aim to find optimal hyperparameters in less time using an informed search with no manual effort necessary beyond the initial set-up.
Before moving on to the packages for Automated Hyperparameter tuning in Python. Let us understand what Bayesian Optimization is.
Bayesian Optimization
- Sequential model-based optimization (also known as Bayesian optimization) is one of the most proficient strategies(per function evaluation) of function minimization.
- This effectiveness makes it suitable for improving the hyperparameters of the ML models which are slow to train.
- SMBO methods are used where a user wants to minimize some scalar-valued function f(x) that takes a lot of time to evaluate.
- The advantages of SMBO are that it:
• Leverages smoothness without an analytic gradient.
• Handles every type of variable (real-valued, discrete, and conditional variables).
• Handles evaluation of the function f(x) parallelly.
• Adapts to hundreds of variables, even with a limit of just a few hundred function evaluations. - A Bayesian Optimization is an approach that uses the Bayes Theorem to direct the search in order to find the minimum or maximum of an objective function.
- This is mostly useful for objective functions that can be complex, noisy, and/or expensive to evaluate.
- Bayesian optimization in turn considers past evaluations while picking the set of hyperparameters.
- By using this informative method of picking the set of hyperparameters it enables itself to focus on those areas of the parameter space that it accepts will bring the most promising validation scores.
Idea Behind Bayesian Optimization
To understand let’s go back to the Bayes Theorem.
The Optimization uses the Bayes Theorem to direct the search in order to find the minimum or maximum of an objective function.
- We know that the Bayes Theorem describes the probability of an event, supported by prior knowledge of conditions that may be associated with the event.
In simple terms, Bayes Theorem calculates the conditional probability of an event.
Now, when we apply the same logic to the hyperparameter tuning, we get:
Here,
* P(metric | combination of hyperparameter) gives the probability of the given metric to be minimized/maximized given the combination of hyperparameter values.
* P(combination of hyperparameter | metric) is the probability of a certain hyperparameter combination if the given metric is minimized/maximized.
* P(metric) is the initial metric quantity in the scalar.
* P(combination of hyperparameter) is the probability of getting that particular hyperparameter combination.
- However, we do not want to calculate the conditional probabilities instead we want to optimize a quantity.
We can simplify the equation by removing the normalizing value P(B), these steps make the conditional probability equation a proportional quantity.
Now, we get an equation:
P(A|B) = P(B|A) * P(A)
This can also be written as posterior = likelihood * prior
- Now, various hyperparameter configurations are navigated through, which take advantage of the previous ones and eventually help the given machine learning model train with better hyperparameter combination with each passing run.
The posterior captures the updated beliefs about the unknown objective function. One may also translate this step of Bayesian optimization as finding the objective function with a surrogate function (also called a response surface).
- Bayesian optimization finds a posterior distribution because of the function that has to be optimized during the parameter optimization, then uses an acquisition function to sample from that posterior to seek out the subsequent set of parameters to be explored.
Surrogate Function
- The surrogate function is often interpreted as a Substitute for the target function.
- It is used to propose parameter sets to the objective function that likely yields an improvement in terms of accuracy score.
- The surrogate function is a technique used to find the approximate mapping of input examples to an output score.
- The problem of forming a surrogate model is usually handled as a regression problem where we provide the info as input (with a group of hyperparameters) and it returns an estimation of the objective function parameterized by a mean and a standard deviation.
The common choices for surrogate models are:
→ Gaussian Process Regression
* Gaussian Processes are considered as a good method for modeling loss functions in a model-based optimization context.
* The Gaussian Process works by building a joint probability between the input features and the true values of the objective function. By this method, with sufficient iterations, it is able to capture a valid estimate of the objective function.
→ Tree-structured Parzen Estimator(TPE)
* Tree-structured Parzen Estimator (TPE) algorithm is made to optimize the hyperparameters and find a configuration that helps to generate an expected accuracy target and fits the best possible response-time improvement.
* TPE is an iterative process that uses the history of evaluated hyperparameters to create a probabilistic model, which is used to suggest the next set of hyperparameters to evaluate.
* The tree-structured Parzen Estimator (TPE) models p(x|y) by transforming that yielding process, replacing the distributions of the configuration earlier with non-parametric densities.
* TPE supports a wide variety of variables in parameter search space e.g., uniform, log-uniform, quantized log-uniform, normally-distributed real value, categorical.
Acquisition Function
The surrogate function is used to test a range of candidate samples in the domain
- The acquisition function is maximized at every iteration to decide where next to sample from the objective function — the acquisition function takes into account the mean and variance of the forecasts over the space to model the efficiency of sampling.
- From these results, one or more candidates can be selected and evaluated with the real, and in normal practice, computationally expensive cost function.
- The function is then sampled at the argmax of the acquisition function, the Gaussian process is updated and the whole method is repeated.
Packages for Automatic Hyperparameter tuning
Scikit-Optimize
- Scikit-Optimize is one of the libraries that’s easy to implement than other hyperparameter optimization libraries and also has better community support and documentation.
- This library implements several methods for sequential model-based optimization by reducing expensive and noisy black-box functions.
→ Installing Scikit-Optimize
!pip install scikit-optimize
→ Importing necessary modules
import skopt
from skopt import gp_minimize
from skopt.space import Integer, Categorical
from skopt.utils import use_named_args
from skopt.plots import plot_convergence
→ Defining the parameter space
space = [
Integer(200,1500,name = "n_estimators"),
Integer(10, 80, name = "max_depth"),
Categorical(["auto", "sqrt"], name = "max_features"),
Integer(2,15, name = "min_samples_split"),
Integer(1,9, name = "min_samples_leaf"),
Categorical([True,False], name = "bootstrap")
]
→ Initializing the objective function
@use_named_args(space)def objective(**params):
rf.set_params(**params)
return -np.mean(cross_val_score(rf, x_train, y_train, cv=5,
n_jobs=-1,
scoring="neg_mean_absolute_error"))
→ Optimizing the function
%%time
tune_rand_gp = gp_minimize(objective,space,random_state=1234)
We can see that at a total time of ‘9 min and 34 seconds’ the skopt package found the best set of parameters for our RandomForest Model
→ Checking the best parameters
print(f"Best parameters: \n")
print(f'n_estimators={tune_rand_gp.x[0]}')
print(f'max_depth={tune_rand_gp.x[1]}')
print(f'max_features={tune_rand_gp.x[2]}')
print(f'min_samples_split={tune_rand_gp.x[3]}')
print(f'min_samples_leaf={tune_rand_gp.x[4]}')
print(f'bootstrap = {tune_rand_gp.x[5]}')
→ Visualizing the convergence
plot_convergence(tune_rand_gp)
Hyperopt
- HyperOpt takes as an input space of hyperparameters in which it will search and moves according to the result of past trials this means that we get an optimizer that could minimize/maximize any function for us.
- The Hyperopt library provides different algorithms and a way to parallelize by building an infrastructure for performing hyperparameter optimization (model selection) in Python.
- HyperOpt provides an optimization interface that identifies a configuration space and an evaluation function that attaches real-valued loss values to points within the configuration space.
→ Importing the necessary modules:
from hyperopt import fmin, tpe, hp,Trials,STATUS_OK
→ Initializing the parameters:
Hyperopt provides us with a range of parameter expressions:
- hp.choice(labels,options):
Returns one of the n examples provided, the options should be a list or a tuple. - hp.randint(label,upper):
Returns a random integer from o to upper. - hp.uniform(label, lower, upper):
Returns the uniform range of values between the lower and the upper limit. - hp.quniform(label, lower, upper):
Returns a value like round(uniform(low, high) / q) * q
Suitable for a discrete value with respect to which the objective is still somewhat “smooth”, but which should be bounded both above and below. - hp.loguniform(label, low, high):
Returns a uniformly distributed value by drawing value according to exp(uniform(low, high))
To read more about the parameters refer to this link.
space = {"n_estimators": hp.choice("n_estimators",[200,600,900,1200,1500]),"max_depth": hp.quniform("max_depth", 10, 80,5),"max_features": hp.choice("criterion", ["auto", "sqrt"]),"min_samples_split":hp.choice("min_samples_split",[2, 5, 10,12,15]),"min_samples_leaf":hp.choice("min_samples_leaf",[1, 2, 4,7,9]),"bootstrap": hp.choice("bootstrap",[True,False])}
→ Defining the function to minimize/maximize:
def tune_random(params):
rand = RandomForestClassifier(**params,n_jobs=-1)
acc = cross_val_score(rand, x, y,scoring="accuracy").mean()
return {"loss": -acc, "status": STATUS_OK}
→ Minimizing the function:
%%timetrials = Trials()best = fmin(fn=tune_random, space = space, algo=tpe.suggest,
max_evals=100, trials=trials)print("Best: {}".format(best))
After minimizing the function the result returns the best set of hyperparameters.
It can be seen that the hyperparameters that could help our model perform better was found in 11 min and 31 seconds
→ Checking out results, losses and statuses:
trials.results
The result returns the set of losses and status of each trials.
trials.losses
To check only losses or statuses from individual trials we can use these two methods.
trials.statuses
Optuna
- It is Platform agnostic that makes it usable with any kind of framework like TensorFlow, PyTorch and sci-kit learn.
- It has a basic template.
import optunadef objective(trial):
#ML Logic here return evaluation_scorestudy = optuna.create_study()
study.optimize(objective, n_trial=....)#Here,n_trials means the number of trials you want to go through.
- It helps minimize or maximize any function we want.
- It provides an easy mechanism to distribute the optimization, so if we have multiple machines, we can simultaneously run multiple trials that are asynchronous and show near-linear scalability.
- To set up the distribution we just need to change a few lines of the code.
study = optuna.create_study(study_name = '',#name of experiment
storage = '',#url of the database
load_if_exist = True)
- Optuna can share the history among six processes parallelly.
Let’s look at it Practically
→ Installing Optuna
!pip install optuna
→ Importing Optuna and defining the objective function to minimize/maximize
import optunadef objective(trial):
n_estimators = trial.suggest_int("n_estimators",200,1500)
max_features = trial.suggest_categorical("max_features",
["auto","sqrt"])
max_depth = trial.suggest_int("max_depth",10,80,log = True)
min_samples_split = trial.suggest_int("min_samples_split",2,15)
min_samples_leaf = trial.suggest_int("min_samples_leaf",1,9)
bootstrap = trial.suggest_categorical("bootstrap",[True,False])
rand = RandomForestClassifier(n_estimators=n_estimators,
max_features=max_features,
max_depth=max_depth,
min_samples_leaf = min_samples_leaf,
min_samples_split = min_samples_split,
bootstrap = bootstrap)
score = cross_val_score(rand,x,y,n_jobs = -1,cv=5)
accuracy = score.mean()
return accuracy
→ Creating and optimizing the optimization task:
- Creating:
study = optuna.create_study(direction='maximize')
- Optimizing:
%%timestudy.optimize(objective, n_trials=100)
It took roughly 6 min for Optuna to come up with the best set of hyperparameters
Conclusions:
- When dealing with huge data, Bayesian Optimization tends to reduce the time complexity for hyperparameter tuning.
- The Set of hyperparameters found using the Bayesian Optimization methods are the optimal set of hyperparameters.
References:
- https://conference.scipy.org/proceedings/scipy2013/pdfs/bergstra_hyperopt.pdf
- https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf
- https://arxiv.org/pdf/1012.2599.pdf
- https://machinelearningmastery.com/what-is-bayesian-optimization/
Also, Check out our Article on:
Introduction to AutoML-The future of industry ML execution
Applying AutoML (Part-1) using Auto-Sklearn
Applying AutoML(Part-2) with MLBox
Applying AutoML (Part-3) with TPOT
Applying AutoML (Part-4) using H2O
AutoML on the Cloud
Follow us for more upcoming future articles related to Data Science, Machine Learning, and Artificial Intelligence.
Also, Do give us a Clap👏 if you find this article useful as your encouragement catalyzes inspiration for and helps to create more cool stuff like this.