Regression with GBM In R | R Studio Assignment and Project Help | Realcode4you

Sep 24, 20232 min read

In this case study we’ll be predicting bike sharing numbers based on weather and other factors.

Read Dataset

bs=read.csv("bike_sharing_hours.csv",stringsAsFactors = F)

CreateDummies=function(data,var,freq_cutoff=0){
	t=table(data[,var])
	t=t[t>freq_cutoff]
	t=sort(t)
	categories=names(t)[-1]

	for( cat in categories){
		name=paste(var,cat,sep="_")
		name=gsub(" ","",name)
		name=gsub("-","_",name)
		name=gsub("\\?","Q",name)
		name=gsub("<","LT_",name)
		name=gsub("\\+","",name)
		name=gsub("\\/","_",name)
		name=gsub(">","GT_",name)
		name=gsub("=","EQ_",name)
		name=gsub(",","",name)
		data[,name]=as.numeric(data[,var]==cat)
	}

	data[,var]=NULL
	return(data)
}

here cnt is nothing but simple sum of casual and registered. cnt is our response, we’ll be dropping the date and two other columns.

library(dplyr) 
bs=bs %>% select(-dteday,-casual,-registered)

next we’ll make dummy vars for appropriate columns.

char_cols=c("season","mnth","hr","holiday","weekday","workingday","weathersit") 
for(col in char_cols){ 
	bs=CreateDummies(bs,col,500) 
}

you’ll see that we’ll follow same procedure here for parameter tuning as randomForest, just that name and usual values of the paramaters to be tried are going to be different.

library(gbm) 
library(cvTools)

param=list(interaction.depth=c(1:7), 
	n.trees=c(50,100,200,500,700), 
	shrinkage=c(.1,.01,.001), 
	n.minobsinnode=c(1,2,5,10))

Here interaction.depth, controls how weak is our learner. value 1 means weak learners are decision trees with single split . very high numbers here will lead to overfit.

n.trees means the same thing as ntree in randomForest, and n.minobsinnode is similar to nodesize in randomForest. Shrinkage is the fraction that we talked about earlier , ideal values to try out here should be ideally less than 1. Very small values generally tend to result in high values of n.trees


subset_paras=function(full_list_para,n=10){ 
	all_comb=expand.grid(full_list_para) 
	s=sample(1:nrow(all_comb),n) 
	subset_para=all_comb[s,] 
	return(subset_para) 
}

num_trials=10 
my_params=subset_paras(param,num_trials) 
# Note: A good value for num_trials is around 10-20% of total possible 
# combination. It doesnt have to be always 10 
myerror=9999999

for(i in 1:num_trials){
	print(paste0('starting iteration:',i))
	# uncomment the line above to keep track of progress
	params=my_params[i,]

	k=cvTuning(gbm,cnt~.,
		data =bs,
		tuning =params,
		args = list(distribution="gaussian"),
		folds = cvFolds(nrow(bs), K=10, type = "random"),
		seed =2,
		predictArgs = list(n.trees=params$n.trees)
		)
score.this=k$cv[,2]

if(score.this<myerror){
	print(params)
	# uncomment the line above to keep track of progress
	myerror=score.this
	print(myerror)
	# uncomment the line above to keep track of progress
	best_params=params
}
print('DONE')

# uncomment the line above to keep track of progress

}

myerror

## [1] 52.29379

This is tentative measure of performance

best_params

## interaction.depth n.trees shrinkage n.minobsnode

## 1 6 500 0.1 10

This is the best combination of paramter values as per cv errors. Lets build our final model using these values

bs.gbm.final=gbm(cnt~.,data=bs,
	n.trees = best_params$n.trees,
	n.minobsinnode = best_params$n.minobsnode,
	shrinkage = best_params$shrinkage,
	interaction.depth = best_params$interaction.depth,
	distribution = "gaussian")

test.pred=predict(bs.gbm.final,newdata=bs_test,n.trees = best_params$n.trees) write.csv(test.pred,"mysubmission.csv",row.names = F)

RealCode4You

Regression with GBM In R | R Studio Assignment and Project Help | Realcode4you

Recent Posts

Comments