Data Analysis Annotated Report
This is a sample of what you might do as you write your own data analysis report. We want to draw your attention to two things in this sample report. You will see:
examples of completed sections (written in paragraphs) for parts I - II
the annotated process a student might go through in order to write each section
# Run this code to load the required packages
suppressMessages(suppressWarnings(suppressPackageStartupMessages({
library(mosaic)
library(supernova)
library(Lock5withR)
})))
# To make slightly smaller plots
options(repr.plot.width = 5, repr.plot.height = 3)
About this Project
Dear 100A-ers,
For your project, you will be writing data analysis reports.
Your Goal
Census at School is a classroom project that engages students in actually collecting data and analyzing data from themselves and other students. Students complete an online questionnaire (here is a link to the questions) and contribute their data to this project. But they can also use this data to answer questions that are interesting to them.
Imagine you have been hired by an online magazine that would like to publish about American high school students. You have been contracted to uncover some interesting findings about American students using the Census at School data. They’d like to publish an article about high school students with your findings incorporated into it, and what those interesting facts are is up to you!
Use what you have learned in this course to analyze and interpret the data and communicate your findings to this magazine.
Instructions
Your task is to use R to explore variation in the data, model the variation, evaluate your models, and then write up your methods and findings in a report to magazine (they are your “clients”).
Your complete data analysis report will have 5 sections: Introduction, Explore Variation, Model Variation, Evaluate Models, Discussion/Conclusions.
It is up to you to decide which questions you would like to ask and answer using these data, and which models you ultimately pursue and discuss in your final report.
CensusSchool <- read.csv("https://docs.google.com/spreadsheets/d/e/
↪2PACX-1vSVaWnM4odSxy0mlnhWvvGbeLtiKoZmsbqC6KLzXtBOjQfrF9EVKuX4RVh3XbP3iw/pub?
↪gid=2100178416&single=true&output=csv", header = TRUE)
str(CensusSchool)
Output:
Part I: Intro/Overview of the Problem or Question
Sample section I One very popular extracurricular activity for high schoolers today is playing video games. You’ve probably heard of the stereotype of a “gamer” who spends all day on their game console. You’ve also probably heard that boys play more video games than girls. But do these stereotypes hold true for high schoolers of today’s generation? Are there gender differencesin time spent playing video games? This is an important question to answer in order to understand the ways in which America’s high school students choose to spend their time. If they are spending a lot of time on video games, and boys spend more time than girls, we should know this in order to examine the impact video games might have on their development.
In order to investigate the relationship between playing video games and gender, I will use data from the Census at School classroom project (CensusSchool) collected between 2010 - 2021. In this data set, 10,113 American high school students completed a survey assignment asking a variety of questions about their preferences, habits, and characteristics. In all, 60 variables are captured in the data set. Using these data, I will look at the number of hours a week students report playing video games (Video_Games_Hours) and students’ self reports of gender (Gender).
My hypothesis is that gender will explain some of the variation in the hours of video games played per week: Video_Games_Hours = Gender + other stuff. Specifically, I predict that males will report spending more time playing video games than females.
Part II: Explore Variation
Sample section II As I explored the data, some cleaning was necessary. An initial visualization of my outcome variable, Video_Games_Hours, showed that some students reported impossibly high hours - approaching 100,000 a week! I decided to remove students who reported unrealistically high observations. After accounting for a minimum of 42 weekly hours sleeping and 40 hours of school, I determined 86 to be the maximum possible gaming hours a week. Sixty students reported spending more than 86 hours a week playing games; these cases were removed from the data (see R code for filtering). I decided to only keep complete observations, so students with missing observations (NAs) were also removed. My clean data consisted of 9,999 observations.
I visualized Video_Games_Hours again using the clean data. The histogram (shown below in an R cell) shows a high peak around zero, with a long right skew. This suggests that most students play very few hours, but some students play many hours a week. Running favstats() showed that the mean hours of weekly gaming in my sample is 5.14 (SD=9.61). The range is from 0 - 84 hours.
I want to use Gender to explore variation in Video_Games_Hours. Because Gender is a categorical variable, I made a bar graph to look at the distribution (see R code). I also ran favstats(). In my sample, 52% of students are female.
Next, I explored the relationship between the two variables. I created a faceted histogram. Looking at the histogram, it appears that females have more observations around zero than males. The within-group variation for males looks greater than for females. I suspect my hypothesis to be correct: males likely spend more hours a week playing video games than females.
You can see the steps I went through to write this exploring variation section below:
First, I want to select only those variables of interest for my analysis and store them in a new data frame called gamedata. I am only going to keep two varaibles, Gender and Video_Games_Hours.
If I ever want to change my research question or look at another variable, I could go back and change the variables.
gamedata <- select(CensusSchool, Gender, Video_Games_Hours)
I want to know more about how much time students spend playing video games. What does the distribution of video games look like in the census data?
gf_histogram(~Video_Games_Hours, data = gamedata)
Warning message:
“Removed 39 rows containing non-finite values (stat_bin).”
Why did I get that crazy long x-axis? Let’s run favstats and see what’s happening... Students cant play 100000 hours of games. There are only 168 hours in an entire week.
favstats(~Video_Games_Hours, data = gamedata)
output:
Hmm something looks off! Someone said they played 99999 hours of games a week. Now I’m wondering: what is a reasonable number of hours of video games one could play in a week?
Assuming students spend 40 hours a week in school, and at least 6 hours sleeping a night (6 hours x 7 nights/ week), how much free time could a kid have?
168 hours in a week - 40 hours at school - 42 hours sleeping
168 - 40 - 42
# I'll consider 86 a reasonable cut-off for hours of video game playing a week
86
How many students say they spend more than 86 hours a week gaming?
tally(~Video_Games_Hours > 86, data = gamedata)
Video_Games_Hours > 86
TRUE FALSE <NA>
60 10014 39
I am going to remove these students with hours above 86 from my data
gamedata_clean <- filter(gamedata, Video_Games_Hours <= 86 )
tally(~Video_Games_Hours > 86, data = gamedata_clean)
Video_Games_Hours > 86
TRUE FALSE
0 10014
I only want to keep complete observations in too, so I will remove missing data (NAs)
## I should omit missing data too
gamedata_clean <- na.omit(gamedata_clean)
head(gamedata_clean)
output:
Now I am going to visualize my outcome variable - Video_Games_Hours - again. This time it should be easier to interpret because I removed the mistakes. I also want to know what the mean is, and the standard deviation, of my outcome variable. I should run favstats().
## visualize again
gf_histogram(~Video_Games_Hours, data = gamedata_clean)
favstats( ~Video_Games_Hours, data = gamedata_clean)
output:
Hmm it looks like most people play few hours a week, but some people play a LOT of video games! The mean is around 5 but the maximum is 84. This is a right skew because the distribution looks like it has a tail that goes off to the right.
I want to use gender to explain variation in gaming hours. I will look at the distribution of Gender to make sure everything looks right. I wonder how many females and males I have in my sample - I will run favstats again and make it a proportion so I can report the percentage in my report.
gf_bar(~Gender, data = gamedata_clean)
tally(~Gender, data = gamedata_clean, format = "proportion")
output:
Gender
Female Male
0.5213521 0.4786479
If I want to show the relationship between the two variables, I need to visualize them together. I’m going to make a faceted histogram so I can see how Video_Games_Hours differs by males and females. The within group variation is the variation I see within only the female group and within only the male group. It looks like males have more within group variation. Their mean hours might be higher. Females have a lot more observations near zero.
gf_dhistogram(~Video_Games_Hours , data = gamedata_clean) %>%
gf_facet_grid(Gender ~.)
output:
Based on this visualization, I think my hypothesis that males play more games than females is right. I will have to fit a model to the data to be sure, though.