Regression in RAPIDS vs Sklearn

5 min readJan 27, 2022

Introduction

So far, we have explained how to set up the RAPIDS functionality to use GPU for machine learning. In this article, we will walk you through the actual power of utilizing RAPIDS in comparison to Sklearn, along with their performance assessment with the Regression use case. But before diving in, let’s clear out some things that you have to know about the use case.

About the Use Case & Data

The use case is the price prediction of the used cars obtained from the UCI Machine Learning Repository. The original data is less in size. So, to show the functionality of RAPIDS in comparison to Sklearn, we have synthesized it using Synthetic Data Vault (SDV) library, which you can download from here. If you want to download the cleaned version of the data, click here.

Data Loading

To load the data, we will use be using the pandas read_csv() method.

path = “https://github.com/insaid2018/articles/blob/main/data/01-used_cars.csv"data = pd.read_csv(filepath_or_buffer=path)print(‘Data Shape:’, data.shape)Output:Data Shape: (100000, 27)

Data Cleaning

Next, we will clean/handle the data to prepare it for our model development. Please note that we have already identified the steps that will take place to clean the data. You may go ahead and analyze the data before taking any action.

# Replacing ? with null valuesdata.replace(“?”, np.nan, inplace = True)# Estimating median values for missing featuresmedian_normalizedlosses = data[‘normalized-losses’].median()median_bore = data[‘bore’].median()median_stroke = data[‘stroke’].median()median_horsepower = data[‘horsepower’].median()median_peakrpm = data[‘peak-rpm’].median()mode_numofdoors = data[‘num-of-doors’].mode()[0]# Imputing missing values with median value of featuresdata[‘normalized-losses’] = data[‘normalized-losses’].replace(np.nan, median_normalizedlosses)data[‘bore’] = data[‘bore’].replace(np.nan, median_bore)data[‘stroke’] = data[‘stroke’].replace(np.nan, median_stroke)data[‘horsepower’] = data[‘horsepower’].replace(np.nan, median_horsepower)data[‘peak-rpm’] = data[‘peak-rpm’].replace(np.nan, median_peakrpm)data[‘num-of-doors’] = data[‘num-of-doors’].replace(np.nan, mode_numofdoors)# Dropping the entire rows of missing price valuedata.dropna(subset=[‘price’], axis=0, inplace=True)# Converting features having inconsistent data types to consistent onedata[‘normalized-losses’] = data[‘normalized-losses’].astype(int)data[‘bore’] = data[‘bore’].astype(float)data[‘stroke’] = data[‘stroke’].astype(float)data[‘horsepower’] = data[‘horsepower’].astype(float)data[‘peak-rpm’] = data[‘peak-rpm’].astype(float)data[‘price’] = data[‘price’].astype(float)print(‘Success!’)Output:Success!

Feature Encoding

In this step, we will be encoding all the categorical features to numeric and then proceed with data scaling. Here, we will be using an advanced algorithm to encode the data known by the name of K-Fold Target Encoding.

class KFoldTargetEncoder(BaseEstimator, TransformerMixin):    def __init__(self ,colnames , targetName, n_fold=5, verbosity=True, discardOriginal_col=False):        self.colnames = colnames        self.targetName = targetName        self.n_fold = n_fold        self.verbosity = verbosity        self.discardOriginal_col = discardOriginal_col     def fit(self, X, y=None):        return self    def transform(self, X):
        assert(type(self.targetName) == str)        assert(type(self.colnames) == str)        assert(self.colnames in X.columns)        assert(self.targetName in X.columns)        mean_of_target = X[self.targetName].mean()        kf = KFold(n_splits = self.n_fold, shuffle = False, random_state = None)        col_mean_name = 'E_' + self.colnames        X[col_mean_name] = np.nan        for tr_ind, val_ind in kf.split(X):            X_tr, X_val = X.iloc[tr_ind], X.iloc[val_ind]            X.loc[X.index[val_ind], col_mean_name] =         X_val[self.colnames].map(X_tr.groupby(self.colnames)[self.targetName].mean())            X[col_mean_name].fillna(mean_of_target, inplace = True)         if self.verbosity:            encoded_feature = X[col_mean_name].values            print('Correlation between [{}] and, [{}] is {}.'.format(col_mean_name, self.targetName, np.corrcoef(X[self.targetName].values, encoded_feature)[0][1]))        if self.discardOriginal_col:           X = X.drop(self.colnames, axis=1)
        return Xcolumns = [‘make’, ‘fuel-type’, ‘aspiration’, ‘num-of-doors’,           ‘body-style’, ‘drive-wheels’, ‘engine-location’,           ‘engine-type’, ‘num-of-cylinders’, ‘fuel-system’]for column in columns:    k_object = KFoldTargetEncoder(colnames=column,                                  targetName=’price’,                                  discardOriginal_col=True)    data = kfold_make.fit_transform(X=data)encodedData = data

Feature Scaling

In this step, we will scale our data and split the data into input and target format so that we can proceed with the feature selection part.

cols = encodedData.columns.to_list()price_index = cols.index(‘price’)print(‘Location of Price feature in encoded dataframe is ‘, price_index)# Creating scalers for each featurescalers = [MinMaxScaler() for obj in range(len(cols))]for scaler, col in zip(scalers, cols):    encodedData[col] = scaler.fit_transform(encodedData[[col]])normalizedData = encodedData

Feature Selection

Now that we have all the features on one scale, we will use an algorithmic approach to find essential data features for model development. In this step, we will be using a random forest model to get a subset of features (best) for model development.

X = normalizedData.drop(labels=’price’, axis=1)y = normalizedData[‘price’]# Have some patience, may take some time :)selector = SelectFromModel(RandomForestRegressor(n_estimators = 100, random_state = 42, n_jobs = -1))selector.fit(X, y)# Extracting list of important featuresselected_feat = X.columns[(selector.get_support())].tolist()print(‘Total Features Selected are’, len(selected_feat))# Estimated by taking mean (default) of feature importanceprint(‘Threshold set by Model:’, np.round(selector.threshold_, decimals=2))print(‘Features:’, selected_feat)Output:Total Features Selected are 13 
Threshold set by Model: 0.04 
Features: ['ID', 'normalized-losses', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-size', 'bore', 'stroke', 'compression-ratio', 'horsepower', 'E_make']

Data Splitting

Now, we will take only those features into account which are identified as best for model development by the random forest.

# Get only essential featuresX = normalizedData[selected_feat]y = normalizedData[‘price’]# Split the data based on the essential features for model developmentX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)# Display splitted data shapeprint(‘Training Data Shape:’, X_train.shape, y_train.shape)print(‘Testing Data Shape:’, X_test.shape, y_test.shape)Output:Training Data Shape: (89036, 12) (89036,) 
Testing Data Shape: (9893, 12) (9893,)

Model Development

In this step, we will develop machine models using Sklearn and RAPIDS and train them using GPUs. Please make sure that GPUs are running in your environment.

Model Development using Sklearn Algorithms

sk_results = list()models = [skLinearRegression(), skRidge(random_state=42),          skLasso(random_state=42), skElasticNet(random_state=42),          skRandomForestRegressor(random_state=42),          skKNeighborsRegressor()]for model in models:    start_time = time.time()    model.fit(X_train, y_train)    duration = time.time() — start_time    y_train_pred = model.predict(X_train)    y_test_pred = model.predict(X_test)    rmse_train = np.round(np.sqrt(mean_squared_error(y_true=y_test, y_pred=y_test_pred)), decimals=2)    rmse_test = np.round(np.sqrt(mean_squared_error(y_true=y_test, y_pred=y_test_pred)), decimals=2)    r_squared_train = np.round(model.score(X_train, y_train), decimals=2)    r_squared_test = np.round(model.score(X_test, y_test), decimals=2)    record = [type(model).__name__, duration, rmse_train, rmse_test, r_squared_train, r_squared_test]    sk_results.append(record)    print(type(model).__name__, ‘Finished Training!’)Output:LinearRegression Finished Training! 
Ridge Finished Training! 
Lasso Finished Training! 
ElasticNet Finished Training! 
RandomForestRegressor Finished Training! 
KNeighborsRegressor Finished Training!

Model Development using RAPIDS Algorithms

cu_results = list()models = [cuLinearRegression(algorithm=’svd’), cuRidge(),          cuLasso(), cuElasticNet(),          cuRandomForestRegressor(random_state=42),          cuKNeighborsRegressor()]for model in models:    start_time = time.time()    model.fit(X_train, y_train)    duration = time.time() — start_time    rmse_train = np.round(np.sqrt(mean_squared_error(y_true=y_test, y_pred=y_test_pred)), decimals=2)    rmse_test = np.round(np.sqrt(mean_squared_error(y_true=y_test, y_pred=y_test_pred)), decimals=2)    r_squared_train = np.round(model.score(X_train, y_train), decimals=2) * 100    r_squared_test = np.round(model.score(X_test, y_test), decimals=2) * 100    record = [type(model).__name__, duration, rmse_train, rmse_test, r_squared_train, r_squared_test]    cu_results.append(record)    print(type(model).__name__, ‘Finished Training!’)LinearRegression Finished Training! 
Ridge Finished Training! 
Lasso Finished Training! 
ElasticNet Finished Training! 
RandomForestRegressor Finished Training! 
KNeighborsRegressor Finished Training!

Performance Analysis

Now that we have developed models using both Sklearn and RAPIDS libraries, it’s time to visualize their execution performance using the help of Plotly go. We have shown the performance using all the algorithms vs excluding random forest because it will be hard to visualize the clear difference. But before that, we need to create data frames of the results.

columns = [‘Model (Sklearn)’, ‘Execution Time’, ‘RMSE (Train)’, ‘RMSE (Test)’, ‘R² (Train)’, ‘R² (Test)’]sk_frame = pd.DataFrame(data=sk_results, columns=columns)columns = ['Model (RAPIDS)', 'Execution Time', 'RMSE (Train)', 'RMSE (Test)', 'R^2 (Train)', 'R^2 (Test)']cu_frame = pd.DataFrame(data=cu_results, columns=columns)

Let’s plot the graph.

# Saving result excluding random forestsk_exclude_forest = sk_frame[sk_frame[‘Model (Sklearn)’] != ‘RandomForestRegressor’]cu_exclude_forest = cu_frame[cu_frame[‘Model (RAPIDS)’] != ‘RandomForestRegressor’]fig = go.Figure()fig = make_subplots(rows=2, cols=1, subplot_titles=[‘Execution Performance (All Models)’, ‘Execution Performance (Excluding Random Forest)’])fig.add_trace(go.Scatter(x=sk_frame[‘Model (Sklearn)’],                         y=sk_frame[‘Execution Time’]*1000,                         mode=’lines+markers’,                         name=’Sklearn’), row=1, col=1)fig.add_trace(go.Scatter(x=cu_frame[‘Model (RAPIDS)’],                         y=cu_frame[‘Execution Time’]*1000,                         mode=’lines+markers’,                         name=’RAPIDS’), row=1, col=1)fig.add_trace(go.Scatter(x=sk_exclude_forest[‘Model (Sklearn)’],                         y=sk_exclude_forest[‘Execution Time’]*1000,                         mode=’lines+markers’,                         name=’Sklearn’), row=2, col=1)fig.add_trace(go.Scatter(x=cu_exclude_forest[‘Model (RAPIDS)’],                         y=cu_exclude_forest[‘Execution Time’]*1000,                         mode=’lines+markers’,                         name=’RAPIDS’), row=2, col=1)# Updating figure layoutfig.update_layout(height=600, width=1200)# Updating axis labelsfig['layout']['xaxis']['title'] = 'Models'fig['layout']['xaxis2']['title'] = 'Models'fig['layout']['yaxis']['title'] = 'Execution Time (in milliseconds)'fig['layout']['yaxis2']['title'] = 'Execution Time (in milliseconds)'# Display the figurefig.show()