Normalizing Numeric Data For a Column
Sometimes we have a dataset in which the values are too far from each other. As an example, let's examine the following dataset:
Age Income Year-of-Education Purchase-Amount
#1 Customer 35 120,000 20 2000
#2 Customer 25 60,000 16 4000
#3 Customer 75 20,000 18 200
We can see that the values of "income" and "purchase-amount" are too large. This would be problematic for some data mining techniques. Because of this, sometimes we need to normalize the dataset. To do this, we transform all the relevant values into numbers that fall between 0 and 1, bringing them into the required range.
Take, for example, the "age" column, which has values of 35, 25, and 75. To bring these values into the needed range, we do the following:
First, we find the maximal, (which is 75)
Second, we find the minimal (which is 25)
We then transform the original values, using the following formulae:
Plugging our starting values into the formulae, we generate the following numbers:
Next, let's do the same for our "income" values:
First, we find the maximal (which is 120,000)
Second, we find the minimal (which is 20,000)
Therefore, our formulae become:
The standardized process for transforming a series of numeric data into values within range [0, 1] is listed below:
Using this process, we can similarly transform our remaining two columns.
Finally this:
Becomes:
Categorical to Numerical Conversion
Categorical Data Conversion
Customer ID Age Income Product
1 35 47,000 Computer
2 25 31,000 Computer
3 55 108,000 Computer
4 63 136,000 Computer
5 55 82,000 Computer
6 34 43,000 Printer
7 49 144,000 Printer
8 65 66,000 Printer
...
...
If we are given a table of data like the one above, we see that the "product" variable has five kinds of value: Computer, Printer, Paper, DVD, and Hard Drive. We know that the "product" variable is a categorical variable.
We can convert a categorical variable into numerical variables in the following manner: 1. We separate all the different values into four columns
...
...
Note: Pay attention to the column "product 1" which has both "computer" and "printer." The other columns have just one item. 2. We change the table into a numerical one.
...
...
So a categorical variable can be converted to multiple numerical variables.
Note: if a categorical variable takes only two values, then after conversion, it is still one variable (column).
If a categorical variable takes more than two values, then after conversion, we have more variables (columns).
Data Transformation
Given Dataset
We can transform variables to get more variables or create new variables by squaring a variable, or multiplying two variables, or transforming a variable using mathematical functions such as log, square root.
Example: the above table can be transformed to the new dataset which contains 3 new columns: "Square of Age" , " Multiplication of Age and Year-of_education" and " logarithm of income "
(Note: for the first row, 1225=35*35, 700=35*20, 4.414973348=log(26000))
Often we receive a dataset with missing values.
For example the above data have 4 missing values:
row 7, age is missing
row 12, favorite is missing
row 16, age is missing
row 17, favorite is missing
There are 2 ways to deal with missing values
Delete those rows with missing values
Impute missing values
Imputation for missing values
For numerical values, we can use the average of the variable (column) to impute missing values. For the variable Age, we have 18 values, the average of these 18 values is 46, so we impute the row 7 and 16 missing values with 46.
For categorical values, we can use "Mode method". We use the most frequent value to impute missing values. For the variable Favorite, we have 18 values, 11 of them are "YES", 7 of them are "NO". The most frequent value is "YES", so we impute the row 12 and 17 missing values with "YES".
After the imputation the dataset is now:
Customer ID Age Favorite
1 35 YES
2 25 YES
...
...
Dealing with Unbalanced Data
Given a dataset as
Customer ID Age Income Product
1 35 47,000 Computer
2 25 31,000 Computer
3 55 108,000 Computer
4 63 136,000 Computer
5 55 82,000 Computer
6 54 56,000 DVD
7 29 78,000 DVD
We notice that the variable "Product" has unbalanced values. There are 5 values of "Computer", only 2 values of "DVD".
We can make a less unbalanced dataset by repeating some rows.
Customer ID Age Income Product
1 35 47,000 Computer
2 25 31,000 Computer
3 55 108,000 Computer
4 63 136,000 Computer
5 55 82,000 Computer
6 54 56,000 DVD
7 29 78,000 DVD
7 54 56,000 DVD
8 29 78,000 DVD
For example we repeat the last two rows to make the dataset less unbalanced.
Here you can also get help in data preprocessing related help. If you face any issues in data cleaning and data analysis related help then don't worry. Here you get top rated project related help with an reasonable price.
To get any help you can contact us or send your requirement details at:
realcode4you@gmail.com
Comments