Data Preprocessing and EDA: End-to-End Series (Part — 2)

4 min readMar 12, 2021

In our previous article, we looked at ways to extract data from database and convert them into human-readable CSV format.

Now, that we have retrieved our data from the Database as a Data Scientist our main task starts.

Data Cleaning is one of the most important tasks in the field, as the saying goes “Garbage In Garbage Out” meaning our model is as good as our data.
Data cleaning and Preprocessing involves:
→ Removing any unwanted information that may hamper our model’s performance
→ Creating a new feature that may help us in not helping any information as well as reduce complexity for our model.
Data preprocessing is a very important first step for anyone dealing with data sets.
It helps us convert our raw noisy data to better datasets, that are cleaner and are more sensible, an absolute necessity for any business attempting to get significant information from the data it assembles.
An EDA is meant to uncover the underlying structure of a data set and is significant for an organization since it uncovers patterns, examples, and connections that are not promptly clear.

Also, Check out our Article on:

Data Abstraction: End-to-End Series (Part — 1)
Model Building and Experimentation: End-to-End Series (Part — 3)
Creating a WebApp using Flask+Gunicorn on Ubuntu Server: End-to-End Series (Part — 4)
Containerizing the WebApp using Docker: End-to-End Series (Part — 5)
Scaling our Docker Container using Kubernetes: End-to-End Series (Part — 6)
Automating building and deployment using Jenkins: End-to-End Series (Part — 7)

→ Loading Churn data

data = pd.read_csv("Churn_data.csv")
data.head()

→ Data Description

→ Getting Data information

data.info()

Observations:

The data has 7043 samples (rows) and 9 columns.
There are 3 columns with a numeric data type and 6 columns with an object datatype.
There are 0 missing values in the data.

→ Checking data Statistics

data.describe()

Observations:

The distribution of MonthlyCharges is Normal because the difference between the mean of MonthlyCharges (about 65) and median (about 70) is not huge, on a distribution plot.
The max value of MonthlyCharges is 118.
No outliers are present

→ Checking for unique values in every column

print(data.nunique())

→ Handling zero values in the tenure column

data[data["tenure"] == 0]

sns.distplot(data.tenure, color = "red")

data["tenure"] = data["tenure"].replace(0,data["tenure"].mean())

→ Encoding Categorical columns

cols = data.columnsfrom sklearn.preprocessing import LabelEncoder
label_encoded_train = data.copy()
#Label Encooding
le = LabelEncoder()
for i in cols:
    label_encoded_train[i] = le.fit_transform(label_encoded_train[i])
data_train = label_encoded_train
data_train.head()

→ Plotting the distribution of Churn column

# Plotting our dependent variable, y columndata_train['Churn'].value_counts().plot(kind='pie', legend=True,explode = [0, 0.09], autopct = "%3.1f%%", shadow = True, figsize=(8,8), fontsize=14)plt.title('Proportional distribution of Churn', fontsize=16)
plt.ylabel('Churn', fontsize=14)

→ Checking for correlation between columns

corr_mat = label_encoded_train.corr().round(2)
plt.figure(figsize=(12, 9))
sns.heatmap(corr_mat, annot=True, cmap='viridis')

→ Checking the relationship between columns tenure, Monthlycharges and TotalCharges

label_encoded_train['product_of_tenure_MonthCharges'] = 
                         label_encoded_train['tenure'] * 
                         label_encoded_train['MonthlyCharges']
label_encoded_train[['product_of_tenure_MonthCharges','TotalCharges']].corr()

Observations:

Our created feature is highly correlated to the ‘TotalCharges’
Dropping Total Charges will remove multicollinearity in the data, without losing any information from the data.

→ Dropping unwanted columns

label_encoded_train.drop(
                         ['product_of_tenure_MonthCharges',   
                          'TotalCharges'], axis = 1, inplace = True
                         )

→ Saving the cleaned data

label_encoded_train.to_csv("clean_data.csv")

Follow us for more upcoming future articles related to Data Science, Machine Learning, and Artificial Intelligence.

Also, Do give us a Clap👏 if you find this article useful as your encouragement catalyzes inspiration for and helps to create more cool stuff like this.

Visit us on https://www.insaid.co/

Data Preprocessing and EDA: End-to-End Series (Part — 2)

Also, Check out our Article on:

→ Loading Churn data

→ Data Description

→ Getting Data information

→ Checking data Statistics

→ Checking for unique values in every column

→ Handling zero values in the tenure column

→ Encoding Categorical columns

→ Plotting the distribution of Churn column

→ Checking for correlation between columns

→ Checking the relationship between columns tenure, Monthlycharges and TotalCharges

→ Dropping unwanted columns

→ Saving the cleaned data

Visit us on https://www.insaid.co/

Written by Accredian Publication

No responses yet