Cross-Validation Techniques!!!
Imagine building a model on a dataset and it fails on unseen data.
We cannot just fit the model on our training data and lay back hoping it will perform brilliantly for the real unseen data.
This is a case of over-fitting, where our model has learned all the patterns and noise of training data, to avoid this we need some kind of way to guarantee that our model has captured most of the patterns and is not picking up every noise in the data(low bias and low variance), one of the many techniques to handle this is the Cross-Validation.
What is Cross-Validation?
- In machine learning, Cross-validation is a technique for evaluating ML models by training several ML models on subsets of the available input data and evaluating them on the complementary subset of the data.
- It is mainly used to estimate any quantitative measure of fit that is appropriate for both data and model.
- In the cross-validation approach, the data used for training and testing are non-overlapping, and thereby test results which are usually reported are not biased.
A Flowchart of typical cross validation workflow in model training.
Different methods of Cross-Validation are:
→ Validation(Holdout) Method:
- It is a simple train test split method.
- In this method, we split the data in train and test.
- Further, the test data is split into validation data and test data.
for example:
1. Suppose there are 1000 data points, we split the data into 80% train and 20% test.
2. My train data consist of 800 data points and the test will contain 200 data points.
3. Then we split our test data into 50% validation data and 50% test data.
from sklearn.model_selection import train_test_splitx_train,x_temp,y_train,y_temp= train_test_split(x,y,test_size = 0.2)#Splitting the data into test and validation data.
x_test,x_val,y_test,y_val = train_test_split(x_train,y_train,
test_size = 0.5)
→ K-Folds Method:
- In this method, we split the data-set into k number of subsets(known as folds) then we perform training on all the subsets but leave one(k-1) subset for the evaluation of the trained model.
- We iterate k times with a different subset reserved for the testing purposes each time.
- It ensures that every observation from the original dataset has the chance of appearing in training and test set.
- The k results from the folds can then be averaged (or otherwise combined) to produce a single estimation. The advantage of this method is that all observations are used for both training and validation, and each observation is used for validation exactly once.
from sklearn.model_selection import KFoldx = independent variable
y = dependent variablekf = KFold(n_splits=5) #n_splits defines the number of folds.kf.get_n_splits(X) #returns the number of splitting iterations
in the cross-validatorprint(kf)
KFold(n_splits=2, random_state=None, shuffle=False)#to check the results
for train_index, test_index in kf.split(X):
print(“TRAIN:”, train_index, “TEST:”, test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
('TRAIN:', array([2, 3]), 'TEST:', array([0, 1]))
('TRAIN:', array([0, 1]), 'TEST:', array([2, 3]))
→ Stratified K-Fold Method:
- The difference between K-Fold Cross-Validation and Stratified K-Fold is that the K-Fold splits the data into k “random” folds, meaning the subsets consist of data points randomly picked and placed. Whereas, In Stratified Cross-Validation splits the data into k folds, making sure each fold is an appropriate representative of the original data.
- The biggest issue with any classification problem is that of the problem occurring due to imbalanced class.
If we use the K-Fold CV method on the imbalanced data, we may cause training to be biased in one class. As in K-Fold we randomly take out K subsets and there is a high chance that we may get folds that consist of majority classes. In order to handle this type of issue, Stratified K-Fold is used with the help of the Stratification Process.
Stratification is the process of rearranging the data so as to ensure that each fold is a good representative of the whole. For example, in a binary classification problem where each class comprises of 50% of the data, it is best to arrange the data such that in every fold, each class comprises of about half the instances.
from sklearn.model_selection import StratifiedKFoldskf = StratifiedKFold(n_splits=4, shuffle=True, random_state=1)
for train_index, test_index in skf.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index)
→ Leave-One Out Method:
- The Leave-One-Out Cross-Validation, or LOOCV, the procedure is used to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model.
- This approach leaves 1 data point out of training data, i.e. if there are n data points in the original sample then, n-1 samples are used to train the model, and p points are used as the validation set.
- This is repeated for all combinations in which the original sample can be separated this way, and then the error is averaged for all trials, to give overall effectiveness.
from sklearn.model_selection import LeaveOneOut
X = independent variable
y = dependent variable
loo = LeaveOneOut()
loo.get_n_splits(X)
for train_index, test_index in loo.split(X):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
print(X_train, X_test, y_train, y_test)
→ Leave-P-Out Method:
- This is similar to that of the Leave-One-Out Method, However, In this method, we take p number of points out from the total number of data points in the dataset containing n number of data points.
- The model is trained on (n-p) data points and tested on p data points.
from sklearn.model_selection import LeavePOut
lpo = LeavePOut(p=2)for train, validate in lpo.split(data):
print("Train set:{}".format(data[train]), "Test set:
}".format(data[validate]))
Conclusion
In Machine Learning we often encounter the problem of over-fitting, Cross-Validation is one of the few techniques that help tackle this problem.
However, It comes with a cost. The training time of our model increases in order to prevent it from over-fitting.
Follow us for more upcoming future articles related to Data Science, Machine Learning, and Artificial Intelligence.
Also, Do give us a Clap👏 if you find this article useful as your encouragement catalyzes inspiration for and helps to create more cool stuff like this.