Data Wrangling is the process of converting data from the initial format to a format that may be readable and better for analysis.
Here we use the below data set :
https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data
Import pandas
Open Jupyter notebook or any online jupyter notebook editor and import pandas-
import pandas as pd
import matplotlib.pylab as plt
Want to add a caption to this image? Click the Settings icon.
Reading the data and add header
filename = "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/auto.csv"
headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style", "drive-wheels","engine-location","wheel-base",
"length","width","height","curb-weight","engine-type", "num-of-cylinders", "engine-size","fuel
-system","bore","stroke","compression-ratio","horsepower", "peak-rpm","city-mpg","highway-mpg","price"]
Want to add a caption to this image? Click the Settings icon.
Read CSV
df = pd.read_csv(filename, names = headers)
Show data in tabular form
df.head()
Data display in tabular form and you will face some challenges like this-
identify missing data
deal with missing data
correct data format
Identify and handle missing values
Identify missing values
Convert "?" to NaN
Missing data comes with the question mark "?". We replace "?" with NaN (Not a Number)
Example:
import numpy as np
# replace "?" to NaN
df.replace("?", np.nan, inplace = True)
df.head(5)
It set NaN at first five index row where "?" is presented.
How to detect missing data:
There are two method used to detect missing data.
.isnull() - Return true at the place of missing data and other place return false.
.notnull() - Return true at the placed data and false at missing data place.
Example:
mis_value = df.isnull()
mis_value.head(5)
Count missing value -In columns
Using for loop:
Example:
Write this for loop and find result
for column in mis_value .columns.values.tolist():
print(column)
print (mis_value [column].value_counts())
print("")
How we will work with missing data
Drop data
drop the whole row- Let suppose any value is necessary like price but it is missing at any row then we remove whole row.
drop the whole column - let we suppose if price is missing at any column then it reason of delete whole column because price is necessary for data science to calculate price.
Replace data
replace it by mean
replace it by frequency - replace as per frequency for example- 84 % is good, and 16% bad, then 16% remove by good.
replace it based on other functions
Calculate the average of any column
Example
avg= df["column name"].astype("float").mean(axis=0)
print("Average of column name:", avg)
Replace "NaN" by mean value - of any column
Example
df["column_name"].replace(np.nan, avg, inplace=True)
Calculate the mean value - of any column
Example
avg=df['column_name'].astype('float').mean(axis=0)
print("Average of column_name:", avg)
Replace NaN by mean value
Example
df["column_name"].replace(np.nan, avg, inplace=True)
How count each column data separately
Use value_counts() function
Example:
df['column_name'].value_counts()
Output like this: let suppose column_name is qualification then count each qualification with name.
mca 78
bca 45
Calculate for us the most common (max) automatically
df['column_name'].value_counts().idxmax()
Output:
mca 78
Replace NaN by most frequent
Example
df["column_name"].replace(np.nan, "four", inplace=True)
All NaN replace by most frequent- by "four"
Drop whole row with NaN in "Column_name" column
Let suppose column_name is "price"
df.dropna(subset=["price"], axis=0, inplace=True)
# reset index, because we dropped two rows
df.reset_index(drop=True, inplace=True)
Correct data format
In Pandas, we use
.dtype() to check the data type
.astype() to change the data type
Show list of data type:
Use this syntax to list data type -
df.dtypes
How to convert data type in proper format
There are different type of data format used -
Syntax
df[["column1", "column2"]] = df[["column1", "column2"]].astype("float")
df[["column3"]] = df[["column3"]].astype("int")
df[["column4"]] = df[["column4"]].astype("float")
df[["column5"]] = df[["column5"]].astype("float")
Again check it by using following -
It show list so that you can verify that data type is change or not
Syntax:
df.dtypes
Data Standardization
What is Standardization?
Standardization is the process of transforming data into a common format which allows the researcher to make the meaningful comparison.
Example
Transform mpg to L/100km
The formula for unit conversion is
L/100km = 235 / mpg
First go through the data to verify it by using this syntax-
Syntax:
df.head()
Example:
Convert mpg to L/100km by mathematical operation
df['city-L/100km'] = 235/df["city-mpg"]
It add new column city-L/100km after change the value of column city-mpg
# check your transformed data
df.head()
Data Normalization
Why normalization?
Normalization is the process of transforming values of several variables into a similar range.
Example:
# replace (original value) by (original value)/(maximum value
df['length'] = df['length']/df['length'].max()
df['width'] = df['width']/df['width'].max()
Binning
Why binning?
Binning is a process of transforming continuous numerical variables into discrete categorical 'bins', for grouped analysis.
Indicator variable (or dummy variable)
What is an indicator variable?
An indicator variable (or dummy variable) is a numerical variable used to label categories. They are called 'dummies' because the numbers themselves don't have inherent meaning.
Why we use indicator variables?
So we can use categorical variables for regression analysis in the later modules.
k8bet hiểu tầm quan trọng của việc có thể trải nghiệm niềm vui của trò chơi cờ bạc trực tuyến một cách thoải mái và không phải lo lắng bất cứ lúc nào, đó là lý do tại sao K8 Bet Đam mê cung cấp cho người chơi trải nghiệm chơi game đầy đủ trên nền tảng và thiết bị. Cho dù bạn có máy tính để bàn, máy tính xách tay, máy tính bảng hay điện thoại thông minh, trang web K8 đều có thể mang lại trải nghiệm mượt mà và liền mạch. thiết bị di động. Các ứng dụng chính thức được thiết kế riêng, đẹp mắt và dễ vận hành có sẵn để người chơi tải xuống. Các ứng dụng này có thể tải xuống trong cửa hàng ứng dụng và dễ dàng cài đặt trên thiết bị di động của người chơi, cho phép người chơi tận hưởng niềm vui của trò chơi cờ bạc. mọi lúc mọi nơi, bất kể bạn đang ở đâu. Đang chờ bạn bè, nghỉ ngơi hay đi du lịch, hãy thư giãn và trải nghiệm trải nghiệm chơi game đỉnh cao do K8 mang lại. Tải ngay APP chính thức của thể thao
Obviously wooden pens range in cost however are impressively more affordable than a few other metal or style pen even gold based pens. Wooden pens frequently have emphasizes which improve their allure in colors silver and gold.
Sight Dogs, similar to the Greyhound, Saluki and Afghan, show astonishing rate that they use to catch and kill their game. The Fragrance Dog is Beagles for sale typically a more slow variety like Bassets, Beagles and Hound dogs, who track down prey from fragrance.
Smoking, lack of facial volume and even the manner you sleep can all be reasons of untimely wrinkles and features as well. So earlier than you get too depressed and run to the closest pores and bbl treatment hospital, take time to do your studies on pores and skin rejuvenation procedures due to the fact this is honestly now not a "one length fits all" process.
New advantages of Vitamin D are being found regularly as a significant piece of a solid eating regimen. However many individuals actually don't get sufficient Vitamin D to receive the Queen Creek Mobile IV Therapy superb rewards it can propose to help great well being. On the off chance that you're worried about getting sufficient Vitamin D in your eating routine, here is a convenient introduction on those magnificent Vitamin D advantages and how to get them.