Feature Engineering And Linear Regression In Machine Learning

realcode4you
Oct 20, 2021
3 min read

Machine Learning Pipeline

The initial process in any machine learning implementation
The purpose is to understand the data, interpret the hidden information, visualizing and engineering the feature to be used by the machine learning
A few things to consider:

– What questions do you want to answer or prove true/wrong?

– What kind of data do you have? Numeric, Categorical, Text, Image?

How are you going to treat them.

– Do you have any missing values, wrong format, etc.

– How the data is spread? Do you have any outliers? How are you going to

deal with them?

– Which features are important? Can we add or remove features to get

Data Wrangling

– Understand the data

– Getting basic summary statistics

– Handling missing values

– Handling outliers

– Typecasting and transformation

Data Visualization

– Univariate Analysis: histogram, distribution (distplot, boxplot, violin)

– Multivariate Analysis: scatter plot, pair plot, etc

Feature Engineering

Why Feature Engineering?

– Better representation of data

– Better performing models

– Essential for model building and evaluation

– More flexibility on data types

– Emphasis on the business and domain

Types of data for feature engineering ranges from numerical, categorical, text, temporal, and image

Numerical data

Can be used in raw
Rounding
Counts of numeric data, e.g., frequency of songs listened by users
Binarization, e.g., instead of frequency we can put a ‘0’ or ‘1’ value to state whether a song has been listened by a user (for a recommender system)
Binning, e.g., categorize users based on some age groups
Interaction or combination, for example by using polynomial features
Transformation, e.g., log transform, polynomial transform

Categorical data

Transform the data into nominal feature, e.g., for a movie genre, you can have {0: ‘action’, 1: ‘thriller’, 2: ‘drama’, 3: ‘horror’, 4: ‘comedy’, 5: ‘family’, 6: ‘other’}
Transform into ordinal value, e.g., similar to the above, but there is an order in which the category or genre is introduce in the data

Encoding:

– Use dummy encoding

– Transform a categorical feature of m distinct labels into m-1 binary features

Categorical data

Consider the following dataframe:

After the dummy encoding scheme:

If we drop the first (is_USA) or the last (is_Canada), will it destroy the dataset?

Feature Scaling

Using the features’ raw value might make models biased towards features with really high magnitude

– Outliers will skew the algorithm

– Affects machine learning algorithms that use the magnitude of features,

i.e., regression

Scikit-learn’s preprocessing module provide three different feature scaler: standard scaler, minmax scaler and robust scaler
Standard scaler (aka Z-score scaling) tries to remove the mean and scale the variance into 1

MinMax scaler scale the feature value in range [0 1] by utilizing its minimum and maximum values

– Be careful with outliers in minmax scaler

Robust scaler using some statistical measures like median and percentile to scale the data

– IQR is Inter-Quartile Range is the range or differences between the 75% quartile and 25% quartile

Filter Methods

– It is based on metrics like correlation, features’ values, and does not

depends on results from any model

– Popular methods are threshold and statistical method

Wrapper Methods

– Use recursive approach to build multiple models using feature subsets

to select the best subsets. The RFE (Recursive Feature Elimination)

from sklearn.feature_selection is one such example

– Popular methods are ANOVA and chi-square tests – Utilize regression/classifier and cross-validation

Embedded Methods

– Use machine learning algorithms like random forests, decision trees and ensemble methods to rank and score features

Threshold Methods

– You can analyze features’ variant

– Features that are less variant, i.e., mostly constant across all observation can be removed

Dimensionality Reduction

Dealing with a lot of features can lead to issues like model overfitting, complex models, and many more that all roll up to what is called as the curse of dimensionality
Dimensionality reduction is the process of reducing the total number of features in our feature set using strategies like feature selection or feature extraction.
A very popular technique for dimensionality reduction is called PCA (principal component analysis)

REGRESSION