In this case study we’ll be predicting bike sharing numbers based on weather and other factors.
Read Dataset
bs=read.csv("bike_sharing_hours.csv",stringsAsFactors = F)
CreateDummies=function(data,var,freq_cutoff=0){
t=table(data[,var])
t=t[t>freq_cutoff]
t=sort(t)
categories=names(t)[-1]
for( cat in categories){
name=paste(var,cat,sep="_")
name=gsub(" ","",name)
name=gsub("-","_",name)
name=gsub("\\?","Q",name)
name=gsub("<","LT_",name)
name=gsub("\\+","",name)
name=gsub("\\/","_",name)
name=gsub(">","GT_",name)
name=gsub("=","EQ_",name)
name=gsub(",","",name)
data[,name]=as.numeric(data[,var]==cat)
}
data[,var]=NULL
return(data)
}
here cnt is nothing but simple sum of casual and registered. cnt is our response, we’ll be dropping the date and two other columns.
library(dplyr)
bs=bs %>% select(-dteday,-casual,-registered)
next we’ll make dummy vars for appropriate columns.
char_cols=c("season","mnth","hr","holiday","weekday","workingday","weathersit")
for(col in char_cols){
bs=CreateDummies(bs,col,500)
}
you’ll see that we’ll follow same procedure here for parameter tuning as randomForest, just that name and usual values of the paramaters to be tried are going to be different.
library(gbm)
library(cvTools)
param=list(interaction.depth=c(1:7),
n.trees=c(50,100,200,500,700),
shrinkage=c(.1,.01,.001),
n.minobsinnode=c(1,2,5,10))
Here interaction.depth, controls how weak is our learner. value 1 means weak learners are decision trees with single split . very high numbers here will lead to overfit.
n.trees means the same thing as ntree in randomForest, and n.minobsinnode is similar to nodesize in randomForest. Shrinkage is the fraction that we talked about earlier , ideal values to try out here should be ideally less than 1. Very small values generally tend to result in high values of n.trees
subset_paras=function(full_list_para,n=10){
all_comb=expand.grid(full_list_para)
s=sample(1:nrow(all_comb),n)
subset_para=all_comb[s,]
return(subset_para)
}
num_trials=10
my_params=subset_paras(param,num_trials)
# Note: A good value for num_trials is around 10-20% of total possible
# combination. It doesnt have to be always 10
myerror=9999999
for(i in 1:num_trials){
print(paste0('starting iteration:',i))
# uncomment the line above to keep track of progress
params=my_params[i,]
k=cvTuning(gbm,cnt~.,
data =bs,
tuning =params,
args = list(distribution="gaussian"),
folds = cvFolds(nrow(bs), K=10, type = "random"),
seed =2,
predictArgs = list(n.trees=params$n.trees)
)
score.this=k$cv[,2]
if(score.this<myerror){
print(params)
# uncomment the line above to keep track of progress
myerror=score.this
print(myerror)
# uncomment the line above to keep track of progress
best_params=params
}
print('DONE')
# uncomment the line above to keep track of progress
}
myerror
## [1] 52.29379
This is tentative measure of performance
best_params
## interaction.depth n.trees shrinkage n.minobsnode
## 1 6 500 0.1 10
This is the best combination of paramter values as per cv errors. Lets build our final model using these values
bs.gbm.final=gbm(cnt~.,data=bs,
n.trees = best_params$n.trees,
n.minobsinnode = best_params$n.minobsnode,
shrinkage = best_params$shrinkage,
interaction.depth = best_params$interaction.depth,
distribution = "gaussian")
test.pred=predict(bs.gbm.final,newdata=bs_test,n.trees = best_params$n.trees) write.csv(test.pred,"mysubmission.csv",row.names = F)
Comments