Basic Terminology
What is EDA?
Data-point/vector/Observation
Data-set.
Feature/Variable/Input-variable/Dependent-varibale
Label/Indepdendent-variable/Output-varible/Class/Class-label/Response label
Vector: 2-D, 3-D, 4-D,.... n-D
Q. What is a 1-D vector: Scalar
Iris Flower dataset Toy Dataset: Iris Dataset: [https://en.wikipedia.org/wiki/Iris_flower_data_set]
A simple dataset to learn the basics.
3 flowers of Iris species. [see images on wikipedia link above]
1936 by Ronald Fisher.
Petal and Sepal: http://terpconnect.umd.edu/~petersd/666/html/iris_with_labels.jpg
Objective: Classify a new flower as belonging to one of the 3 classes given the 4 features.
Importance of domain knowledge.
Why use petal and sepal dimensions as features?
Why do we not use 'color' as a feature?
Load Dataset:
#import libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
#Load Dataset
'''downlaod iris.csv from https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv'''
#Load Iris.csv into a pandas dataFrame.
iris = pd.read_csv("iris.csv")
Shape of dataset
# (Q) how many data-points and features?
print (iris.shape)
Columns of dataset:
#(Q) What are the column names in our dataset?
print (iris.columns)
Output:
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
'species'],
dtype='object')
Count flowers of each species:
#(Q) How many data points for each class are present?
#(or) How many flowers for each species are present?
iris["species"].value_counts()
# balanced-dataset vs imbalanced datasets
#Iris is a balanced dataset as the number of data points for every class is 50.
Output:
virginica 50
setosa 50
versicolor 50
Name: species, dtype: int64
2-D Scatter Plot
#2-D scatter plot:
#ALWAYS understand the axis: labels and scale.
iris.plot(kind='scatter', x='sepal_length', y='sepal_width') ;
plt.show()
#cannot make much sense out it.
#What if we color the points by thier class-label/flower-type.
Output:
# 2-D Scatter plot with color-coding for each flower type/class.
# Here 'sns' corresponds to seaborn.
sns.set_style("whitegrid");
sns.FacetGrid(iris, hue="species", size=4) \
.map(plt.scatter, "sepal_length", "sepal_width") \
.add_legend();
plt.show();
# Notice that the blue points can be easily seperated
# from red and green by drawing a line.
# But red and green data points cannot be easily seperated.
# Can we draw multiple 2-D scatter plots for each combination of features?
# How many cobinations exist? 4C2 = 6.
Output:
Observation(s):
Using sepal_length and sepal_width features, we can distinguish Setosa flowers from others.
Seperating Versicolor from Viginica is much harder as they have considerable overlap.
3D Scatter plot
Needs a lot to mouse interaction to interpret data.
What about 4-D, 5-D or n-D scatter plot?
Pair-plot
# pairwise scatter plot: Pair-Plot
# Dis-advantages:
##Can be used when number of features are high.
##Cannot visualize higher dimensional patterns in 3-D and 4-D.
#Only possible to view 2D patterns.
plt.close();
sns.set_style("whitegrid");
sns.pairplot(iris, hue="species", size=3);
plt.show()
# NOTE: the diagnol elements are PDFs for each feature. PDFs are expalined below.
Output:
Observations
petal_length and petal_width are the most useful features to identify various flower types.
While Setosa can be easily identified (linearly seperable), Virnica and Versicolor have some overlap (almost linearly seperable).
We can find "lines" and "if-else" conditions to build a simple model to classify the flower types.
Histogram, PDF, CDF
# What about 1-D scatter plot using just one feature?
#1-D scatter plot of petal-length
import numpy as np
iris_setosa = iris.loc[iris["species"] == "setosa"];
iris_virginica = iris.loc[iris["species"] == "virginica"];
iris_versicolor = iris.loc[iris["species"] == "versicolor"];
#print(iris_setosa["petal_length"])
plt.plot(iris_setosa["petal_length"], np.zeros_like(iris_setosa['petal_length']), 'o')
plt.plot(iris_versicolor["petal_length"], np.zeros_like(iris_versicolor['petal_length']), 'o')
plt.plot(iris_virginica["petal_length"], np.zeros_like(iris_virginica['petal_length']), 'o')
plt.show()
#Disadvantages of 1-D scatter plot: Very hard to make sense as points
#are overlapping a lot.
#Are there better ways of visualizing 1-D scatter plots?
Output:
sns.FacetGrid(iris, hue="species", size=5) \
.map(sns.distplot, "petal_length") \
.add_legend();
plt.show();
Output:
sns.FacetGrid(iris, hue="species", size=5) \
.map(sns.distplot, "petal_width") \
.add_legend();
plt.show();
Output:
sns.FacetGrid(iris, hue="species", size=5) \
.map(sns.distplot, "sepal_length") \
.add_legend();
plt.show();
output:
sns.FacetGrid(iris, hue="species", size=5) \
.map(sns.distplot, "sepal_width") \
.add_legend();
plt.show();
Output:
# Need for Cumulative Distribution Function (CDF)
# We can visually see what percentage of versicolor flowers have a
# petal_length of less than 5?
# How to construct a CDF?
# How to read a CDF?
#Plot CDF of petal_length
counts, bin_edges = np.histogram(iris_setosa['petal_length'], bins=10, density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges);
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf)
counts, bin_edges = np.histogram(iris_setosa['petal_length'], bins=20,
density = True)
pdf = counts/(sum(counts))
plt.plot(bin_edges[1:],pdf);
plt.show();
Output:
# Need for Cumulative Distribution Function (CDF)
# We can visually see what percentage of versicolor flowers have a
# petal_length of less than 1.6?
# How to construct a CDF?
# How to read a CDF?
#Plot CDF of petal_length
counts, bin_edges = np.histogram(iris_setosa['petal_length'], bins=10, density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
#compute CDF
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)
plt.show();
Output:
# Plots of CDF of petal_length for various types of flowers.
# Misclassification error if you use petal_length only.
counts, bin_edges = np.histogram(iris_setosa['petal_length'], bins=10,
density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)
# virginica
counts, bin_edges = np.histogram(iris_virginica['petal_length'], bins=10,
density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)
#versicolor
counts, bin_edges = np.histogram(iris_versicolor['petal_length'], bins=10,
density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)
plt.show();
Output:
[ 0.02 0.02 0.04 0.14 0.24 0.28 0.14 0.08 0. 0.04] [ 1. 1.09 1.18 1.27 1.36 1.45 1.54 1.63 1.72 1.81 1.9 ] [ 0.02 0.1 0.24 0.08 0.18 0.16 0.1 0.04 0.02 0.06] [ 4.5 4.74 4.98 5.22 5.46 5.7 5.94 6.18 6.42 6.66 6.9 ] [ 0.02 0.04 0.06 0.04 0.16 0.14 0.12 0.2 0.14 0.08] [ 3. 3.21 3.42 3.63 3.84 4.05 4.26 4.47 4.68 4.89 5.1 ]
Mean, Variance and Std-dev
#Mean, Variance, Std-deviation,
print("Means:")
print(np.mean(iris_setosa["petal_length"]))
#Mean with an outlier.
print(np.mean(np.append(iris_setosa["petal_length"],50)));
print(np.mean(iris_virginica["petal_length"]))
print(np.mean(iris_versicolor["petal_length"]))
print("\nStd-dev:");
print(np.std(iris_setosa["petal_length"]))
print(np.std(iris_virginica["petal_length"]))
print(np.std(iris_versicolor["petal_length"]))
Output:
Means: 1.464 2.41568627451 5.552 4.26 Std-dev: 0.171767284429 0.546347874527 0.465188133985
Median, Percentile, Quantile, IQR, MAD
#Median, Quantiles, Percentiles, IQR.
print("\nMedians:")
print(np.median(iris_setosa["petal_length"]))
#Median with an outlier
print(np.median(np.append(iris_setosa["petal_length"],50)));
print(np.median(iris_virginica["petal_length"]))
print(np.median(iris_versicolor["petal_length"]))
print("\nQuantiles:")
print(np.percentile(iris_setosa["petal_length"],np.arange(0, 100, 25)))
print(np.percentile(iris_virginica["petal_length"],np.arange(0, 100, 25)))
print(np.percentile(iris_versicolor["petal_length"], np.arange(0, 100, 25)))
print("\n90th Percentiles:")
print(np.percentile(iris_setosa["petal_length"],90))
print(np.percentile(iris_virginica["petal_length"],90))
print(np.percentile(iris_versicolor["petal_length"], 90))
from statsmodels import robust
print ("\nMedian Absolute Deviation")
print(robust.mad(iris_setosa["petal_length"]))
print(robust.mad(iris_virginica["petal_length"]))
print(robust.mad(iris_versicolor["petal_length"]))
Output:
Medians: 1.5 1.5 5.55 4.35 Quantiles: [ 1. 1.4 1.5 1.575] [ 4.5 5.1 5.55 5.875] [ 3. 4. 4.35 4.6 ] 90th Percentiles: 1.7 6.31 4.8 Median Absolute Deviation 0.148260221851 0.667170998328 0.518910776477
Box plot and Whiskers
#Box-plot with whiskers: another method of visualizing the 1-D scatter plot more intuitivey.
# The Concept of median, percentile, quantile.
# How to draw the box in the box-plot?
# How to draw whiskers: [no standard way] Could use min and max or use other complex statistical techniques.
# IQR like idea.
#NOTE: IN the plot below, a technique call inter-quartile range is used in plotting the whiskers.
#Whiskers in the plot below donot correposnd to the min and max values.
#Box-plot can be visualized as a PDF on the side-ways.
sns.boxplot(x='species',y='petal_length', data=iris)
plt.show()
Output:
Violin plots
# A violin plot combines the benefits of the previous two plots
#and simplifies them
# Denser regions of the data are fatter, and sparser ones thinner
#in a violin plot
sns.violinplot(x="species", y="petal_length", data=iris, size=8)
plt.show()
Output:
Multivariate probability density, contour plot.
#2D Density plot, contors-plot
sns.jointplot(x="petal_length", y="petal_width", data=iris_setosa, kind="kde");
plt.show();
Send Your Project Details at:
realcode4you@gmail.com
Comments