Requirement Details
Dataset contains around 65k+ traffic-related violation records and description of attributes is attached in a separate text file please refer.
Task 1:
Cleaning if required-- please mention details and steps
Preprocessing if required-- please mention details and steps
Data discretization if required-- please mention details and steps
List down the attributes, data types, categories of the dataset
Generate the below charts and for each chart provide the description and analysis why its plotted and how its applicable.Generate a R mark down file for all the below charts.
Parallel Coordinates chart
Density Plot
Column Chart
Bar Graph
Stacked Bar Graph
Grouped Bar Chart
Stacked Column Chart
Area Chart
Dual Axis Chart
Line Graph
Candle set chart
Box and whisker plot
Mekko Chart
Pie Chart
Bubble Chart
Scatter Plot Chart
Grouped Scatter Chart
Scatter Plot Matrix
Radar Chart
Radial Bar Chart
Donut chart
Bullet Graph
Funnel Chart
TreeMap
Dendo gram
Heat Map
Violin Chart
Violin plot
Area graph
stacked Area graph.
If any of the chart is not applicable please provide the reason why its not applicable
Task 2
Identify any two business strategies and discuss the link between Strategy and Business Analytics(content should not be completely copied from any internet as it will be verified)
Code Implementation
## Preliminary EDA Analysis for Traffic Violations and Traffic Accidents by ##
# Yearly
# Monthly
# Hourly
# Weekly
# Basis
Import Necessary Packages
#using the libraries
library(ggplot2) # Data visualization
library(readr) # CSV file I/O, e.g. the read_csv function
library(dplyr)
library(treemap)
library(plotly)
library(devtools)
devtools::install_github("dkahle/ggmap")
library(ggmap)
library(maps)
library(mapdata)
library(tseries)
library(lubridate)
library(extrafont)
# reading the csv file
traffic_violation <- read_csv("/cloud/project/traffic_violations.csv")
df<-data.frame(traffic_violation)
# giving the first rows of dataframe
head(df)
Data preprocessing
## Create 3 columns respectively for year, month name, month code and hour ##
## using derived and simplifying variables
df[,"stop_date"]<-as.Date(df[,"stop_date"],format="%m/%d/%Y")
df$month<-format(df[,"stop_date"],"%B")
df$month_code<-as.numeric(format(df[,"stop_date"],"%m"))
df$year<-as.numeric(format(df[,"stop_date"],"%Y"))
df$hour<-hour(hms(as.character(df[,"stop_time"])))+1
df$weekday<-as.numeric(format(df[,"stop_date"],"%u"))
df$weekday_full<-format(df[,"stop_date"],"%a")
## Creating time series table for No of traffic violation counts yearwise to study the pattern and trend ##
tab_ptn<-data.frame(table(df$month_code,df$year))
names(tab_ptn)<-c("month","year","count")
time_ser=ts(tab_ptn[which(tab_ptn$count!=0),3],frequency=12,start=c(2012,1))
print(time_ser)
Task 1 Useful visualization with data insights
## i) Line Graph
# data visualization with plot and boxplot
dev.new()
## This plot will give an insight the pottern and trend of the traffic violation incidents across the years ##
par(mfrow=c(1,2))
plot(time_ser,ylab="Total Traffic Incidents",type="b",pch=5,lwd=2,col="#00AFBB")
abline(reg=lm(time_ser~time(time_ser)),col="red",lty=2, lwd=3)
plot(aggregate(time_ser,FUN=mean),ylab="Traffic Incidents Trend ",lty=2,lwd=2,col="#00AFBB")
## ii) Box and Whisker plot
boxplot(time_ser~cycle(time_ser), data = df, xlab = "Time",
ylab = "Cycle of time series", main = " Data")
## Analysis based on box plot
## As per the graph output below the pattern looks seasonal and trend is looking upside till 2015 and then gradually decreases ##
## For seasonality the traffic incidents peaks at the first half of the year and then gradually decreases in the second half##
# Analysis the timing of the accidents based on duration
tab_accidents<-data.frame(table(df[which(df$is_arrested=="TRUE"),c(1,1)]))
names(tab_accidents)<-c("month","year","count")
time_accidents=ts(tab_accidents[which(tab_accidents$count!=0),3],frequency=12,start=c(2012,1))
print(time_accidents)
## Creating time series table to study the pattern for No of traffic violation counts hourly in a day ##
tab_hr<-data.frame(table(df$year,df$hour))
names(tab_hr)<-c("year","hour","count")
tab_hr<-tab_hr[order(tab_hr$year),]
time_hr=ts(tab_hr[which(tab_hr$count!=0),3],frequency=24,start=c(2012,1),end=c(2018,24))
## Boxplot to view the pottern of traffic violation incidents hourly in a day ##
boxplot(time_hr~cycle(time_hr),xlab="hours",ylab="Traffic Incidents hourly",col="#00AFBB")
## A great variation is found between 9 to 23 hours in a day with more accidents.
## Creating time series table to study the pattern for No of traffic violation counts during weekdays ##
tab_weekday<-data.frame(table(df$weekday,df$year))
names(tab_weekday)<-c("weekday","year","count")
time_weekday=ts(tab_weekday[which(tab_weekday$count!=0),3],frequency=7,start=c(2012,1))
## Boxplot to view the pottern of traffic violation incidents in weekdays ##
boxplot(time_weekday~cycle(time_weekday),xlab="weekdays",ylab="Traffic Violations on weekdays",names=c('MON','TUE',"WED","THU","FRI","SAT","SUN"),col="#FFFF00")
## The below pattern indicates that the traffic violations have an upper trend on week days and it lowers during weekends ##
## Wednesday registers the highest whereas Saturday registers the lowest number of traffic violations ##
## iii) Density plot
## Plotting based on driver's age
df$driver_age[is.na(df$driver_age)] = 0
den <- density(df$driver_age)
plot(den, frame = FALSE, col = "blue",main = "Density plot")
## we can see density is higher for younger drivers.
## iv) Column Chart
## Analysis of column chart of stop_outcome and gender
## Females gives citation as major outcome for stoppage whereas mostly males plays a role
ggplot(df, aes(x = stop_outcome,y = driver_gender), inherit.aes = FALSE )+
geom_col(stat="identity", position = "dodge") +
scale_fill_brewer(palette = "Set1")
output:
dev.off()
## v) Bar plot
## Analysis consumption of time in each weekday third last row suggest of wednesday
barplot(as.matrix(time_weekday),col=c("gold3","red"))
## vi) GroupedBar chart
## Analysis of violation categories with respect to gender
## Here Speeding seems to be violating criteria for females and mostly males dominate for all the
## other violations.
ggplot(df, aes(x = violation,y = driver_gender), inherit.aes = FALSE )+
geom_bar(stat="identity", position = "dodge") +
scale_fill_brewer(palette = "Set1")
output:
## vii) Area chart
## Analysis of frequency of records based on driver age. Young people lead to more accidents.
p <- ggplot(df, aes(x=driver_age))
p + geom_area(stat = "bin", binwidth=30)
## viii) Pie chart
## Analysis of more frequent stopping outcomes in which citation is the major
# Create a table of veh_body record counts and sort
tbl <- sort(table(df$stop_outcome),
decreasing = T)
pie(tbl, radius=1)
dev.off()
## Analysis of violation groups in which speeding plays major role in the records.
# Create a table of veh_body record counts and sort
tbl <- sort(table(df$violation_raw),
decreasing = T)
pie(tbl, radius=1)
## ix) Radial Bar Chart
## Analysis of weekdays and hours
## It suggest that Weekday wednesday has more of hours where traffic violations happens more frequently
ggplot(df) +
geom_bar(aes(x=weekday, y=hour), width = 1, stat="identity",
colour = "black", fill="lightblue") +
coord_polar(theta = "x", start=0)
output:
## x) Treemap
## Analysis of records based on outcomes and violation. Both are correlated as the citation is the dominant ones.
## Moving violation has more correlation with others
treemap(df, #Your data frame object
index=c("stop_outcome","violation"), #A list of your categorical variables
vSize = "driver_age", #This is your quantitative variable
type="index", #Type sets the organization and color scheme of your treemap
palette = "Reds", #Select your color palette from the RColorBrewer presets or make your own.
title="treemap", #Customize your title
fontsize.title = 10 #Change the font size of the title
)
## xi) Violin plot
## Analysis of conducting search and age of the driver
## More records can be found that search is conducted only for 30-50 years age.
ggplot(df, aes(x=search_conducted, y=driver_age)) +
# geom_violin() function is used to plow violin plot
geom_violin()
## Task 2
## Plots/Graphs that not helped us in business perspective
## i) Stacked column chart
## Not much analysis can be done as there are more missing values while comparing arrested and stop outcomes
ggplot(data = df, aes(x = stop_outcome , y = is_arrested, fill = drugs_related_stop)) + geom_col()
## ii) Donut chart
## Analysis of table accidents with year, month and count
## No much information can be seen here for business view point
tab_accidents<-dplyr::filter(tab_accidents,is.na(Var2))
tab_accidents
tab_accidents$fraction = tab_accidents$count / sum(tab_accidents$count)
# Compute the cumulative percentages (top of each rectangle)
tab_accidents$ymax = cumsum(tab_accidents$fraction)
# Compute the bottom of each rectangle
tab_accidents$ymin = c(0, head(tab_accidents$ymax, n=-1))
# Make the plot
ggplot(tab_accidents, aes(ymax=20, ymin=ymin, xmax=2, xmin=1)) +
geom_rect() +
coord_polar(theta="y") + # Try to remove that to understand how the chart is built initially
xlim(c(1, 1)) # Try to remove that to see how to make a pie chart
## Task 3
Plots/Graphs that can not be used for visualization in the given set
i) Parallel coordinate charts - We can't plot parallel coordinate charts as there are no multiple numeric inputs.
ii) Dual graph - Need at least two numeric inputs to check with category as per the data only age is one of the numeric input
iii) Candle charts - Fractional distribution of numeric inputs as there are no numeric inputs, we can't use it
iv) mekko charts - Consideration of numeric inputs here. As age is not the business variable with respect to other categories so we can't plot it
v) Bubble Plot, Scatter plot, Grouped Scatter chart, Scatter matrix, Radar matrix , Violin chart - When we have many numeric variables to analyse we can use this otherwise we can't do it for qualitative parameters
vi) Bullet graph, Funnel chart - Again it works for fraction of distributed continuous variable which is not a subset of the data given in the csv file
vii) Dendogram - Dendograms takes statistical numerical input on y axis and forms tree structuring splitting out in a group. As the records suggested that traffic violation can make a person get arrested or free
# Not much groups can play a leaf node here.
viii) Heatmap - Analysis of multiple numeric variables for correlation as majorly only age is one of the quantitative and others are qualitative So heatmap cannot be formed here.
ix) Area and stacked area graph - Again for forming area, we need to focus on two numeric variables that forms some relation and give insights.
# Exercise 2. Business strategies
i. Main focus is to study the record set, build up the model for analysis, checking out for predictive patterns, trends and driving behaviors in terms of age, gender, duration , violation type etc.
ii. To target more on regulations for traffic accidents and rules.
It seems there is a no formal link between strategy and business analytics used for various strategies.
Most of the companies try to put objectives and goals regarding production cost or procurement strategies to get tune up with business things.
There are functions which indirectly support the strategies but did not help much in the production value and also it helps in strategical view for the primary and secondary processes. Business analytics first priortize on the tasks that gives more productional value rather than working on long term strategical views.
To get any EDA or Data Analysis related help you can comment in below comment section or send your project requirement details at:
realcode4you@gmail.com
Comments