Introduction:
As an expert on Life Science Informatics, you are required to analyse the US weekly Nationally Notifiable Disease Surveillance Data from 1888 to 2013. The data can be used to estimate seasonal and long-term transmissions trends, generate models for predictability of infectious disease outbreaks and conduct scientific research work.
Based on the data you are expected to prepare a presentation (power point slides) conducting an Exploratory Data Analysis, for which you have 5 minutes and one R notebook with scientifically oriented analysis.
Data to be Analysed:
Project Tycho (Source: https://www.tycho.pitt.edu)
Resources:
VisGuides host a community of people working with the same data: https://visguides.org/search?q=tycho
Criteria:
- Good research question and result
- Good contextualisation of the statistical analysis in the selected region
- Quality and diversity of the graphics used
- Quality of the presentation: engaging capacity and clarity of the message.
- Using statistical key figures correctly
- Use and explanation of (advanced) statistical and visualization techniques not
covered in the course
Tools:
R and Rstudio, any other additional dataset you may consider relevant.
Code Implementation
---
title: " "
author: " "
date: ' '
output:
html_document:
number_sections : true
toc : true
toc_depth :2
theme :redable
df_print: paged
word_document: default
pdf_document: default
header-includes:
- \usepackage{titling}
- \pretitle{\begin{center} \includegraphics[width=5in,height=5in]{THD-Logo_grau.png}\LARGE\\}
- \posttitle{\end{center}}
---
1.Introduction
A disease can be considered as a particular abnormal condition that negatively affects the structure or function of all or part of an organism, and that is not immediately due to any external injury. They are often known to be medical conditions that associate with specific signs and symptoms. A disease may be caused by external factors such as pathogens or by internal dysfunctions. For example, internal dysfunctions of the immune system can produce a variety of different diseases, including various forms of immunodeficiency, hypersensitivity, allergies and autoimmune disorders.(source: wikipedia)
In humans, disease is often used more broadly to refer to any condition that causes pain, dysfunction, distress, social problems, or death to the person affected, or similar problems for those in contact with the person. In this broader sense, it sometimes includes injuries, disabilities, disorders, syndromes, infections, isolated symptoms, deviant behaviors, and atypical variations of structure and function, while in other contexts and for other purposes these may be considered distinguishable categories. Diseases can affect people not only physically, but also mentally, as contracting and living with a disease can alter the affected person's perspective on life.
Death due to disease is called death by natural causes. There are four main types of disease: infectious diseases, deficiency diseases, hereditary diseases (including both genetic diseases and non-genetic hereditary diseases), and physiological diseases. Diseases can also be classified in other ways, such as communicable versus non-communicable diseases. The deadliest diseases in humans are coronary artery disease (blood flow obstruction), followed by cerebrovascular disease and lower respiratory infections.[3] In developed countries, the diseases that cause the most sickness overall are neuropsychiatric conditions, such as depression and anxiety.
Life expectancy is one of the most commonly used measures for international health comparison. In 2007, the United States ranked 27th and 26th out of 33 countries within its peer group of Organization for Economic Co-operation and Development (OECD) countries for life expectancy at birth for females and males, respectively[1](https://www.healthypeople.gov/2020/about/foundation-health-measures/General-Health-Status).
Here in this project we are going to analyse a massive data that contains the details of diseases which occured in all states of the U.S during he time 1888 to 2014. Source of this data is from Project Tycho who work with national and global health institutes and researchers to make data easier to use to improve global health.
When it comes to interpreting the world and the enormous amount of data is producing on a daily basis, Data Visualization becomes the most desirable way. Rather than screening huge Excel sheets, it is always better to visualize that data through charts and graphs, to gain meaningful insights.
2.Objectives
We are trying to study the nature of spread of different types of diseases in the all states of United States.
For that we will be using the is from Project Tycho as we mentioned before. First and foremost thig to do when we have a huge data set like the one we have is to go through the data and try to understand the basice ideas which are included in it. With that understanding only we can move forward and try to filter the data specifically for our type of research.
From this data we have produced graphs for the number of cases and number of deaths that have caused due to all diseases in all the states of the US. From that we have extracted the data of the most dreadful disease that have caused maximum deaths and its activity during the years and from the plot it is easy to understand during which years the disease had its prime spread in the society. From that we can get which State has got higher damage due to the effect of this particular disease.
3.Problem Defenition
So now as we have the data we need to analyse, next step is to find out what we are going to analyse from the data. As we know that the data we are going to test gives us numerous opportunities of research, we have to set our point of interest to certain areas so that we get some good quality analysis from this work.
For the obtain the above said criteria, we are going to find out the type of disease which is found to be most abundant in the U.S country aling the years 1888 to 2014 as per the data we have and find out its significant variations in different states along these years.
4.Methods
The R programming language gives us some quick and easy tools that let us convert our data into visually insightful elements like graphs. So that from the graphs the data becomes more interpretative and understandable.
The different types of graphs we have used in this work is listed below.
Tree Map : A Treemap displays hierarchical data as a set of nested rectangles. Each group is represented by a rectangle, which area is proportional to its value.
Geom Map : Used to display the data in the Map (In our case U.S)
Bar plot : A bar chart or bar graph is a chart or graph that presents categorical data with rectangular bars with heights or lengths proportional to the values that they represent.
Box plot : In a box plot, we draw a box from the first quartile to the third quartile. A vertical line goes through the box at the median
Heat Map : A heat map is a two-dimensional representation of data in which values are represented by colors
Animation Plot : To plot the extracted data in a more exciting way
5.Data Analysis
The data analyses are done in different steps which are explained in detail as follows.
5.1 Loading the libraries
Here we can find all the libraries we have used in this work.
```{r}
library (ggplot2)
library (ggthemes)
library (readr)
library (dplyr)
library (tidyverse)
library (ggpubr)
library (treemap)
library (usmap)
library (gapminder)
library (gganimate)
library (treemap)
library (ggridges)
```
```{r}
options(scipen=9999)
```
5.2 Loading the Data
Below code is used to set the working directory and the create a new file that contains the required data that has tp be analysed.
```{r}
setwd("D:/LSI/Life Science Info. SS 2022/Data Visulization") # Setting working directory
Tycho <- read.csv("ProjectTycho_Level2_v1.1.0.csv", header = T, stringsAsFactors = T) # Loading data in csv File
```
5.3 Checking the Data, rearranging and clean ups
The next step is to look into the data and study the dimensions and contents in the data. The scattered data is rearranged, cleaned up as it is ready for the analysis.
```{r}
dim(Tycho) # Dimension of the data in rows followed by columns
```
```{r}
Tycho <- Tycho %>% arrange(epi_week) #Arranging data by time sires from 1888 to 2014
head(Tycho)
```
```{r}
Tycho <- Tycho[ , -11] # Deletion of column num. 11 url
Tycho <- Tycho[ , -2] # Deletion of column num. 2 US
```
```{r}
Tycho <- Tycho %>% filter(number > 0 ) # removing 0 cases or deaths from the Tycho 36,59,360 - 26,78,605 = 9,80,755 rows has 0 cases data.
```
```{r}
str(Tycho)# checking the data type of each column
```
```{r}
Tycho$to_date <- as.Date(Tycho$to_date, format = "%Y-%m-%d") # change the format from factor to date
Tycho$from_date <- as.Date(Tycho$from_date, format = "%Y-%m-%d")
```
```{r}
levels(Tycho$disease) # List of all 50 disease
```
```{r}
unique(Tycho$state) # Here we have 57 observation in state but currently USA have 50 states only
```
```{r}
Disease <- Tycho %>% count(disease,event, wt = number) # sum of all disease according cases and deaths
```
Here is the another problem, for some disease data don't have records for cases, Only deaths are there for example "CHOLERA" and in reverse there is 0 death against huge cases of 4928414 for "CHLAMYDIA". Just for notice deaths of chlamydia in USA could be checked here,
https://www.getargon.io/posts/health/conditions/us-std/death-rate-chlamydia-us/
5.4 Plotting the Data
In this step we start to plot data which we have trimmed and reordered so that we get some useful plots that explains the data.
The code below will generate a tree map showing all the diseases which showed up during the years and from the tree map we will easily be able to identify the most abundant disease which affected the U.S population.
```{r fig.align='center'}
# Tree map to visualize disease ratio according total event per cases and deaths
treemap(Disease, index = c("disease","event"),
vSize = "n", vColor = "event", type = "index", bg.labels = 0,
title = " Treemap for all 50 Disease", border.col = c("black", "white"),
border.lwds = c(0,0), palette = ("Set3"),
align.labels = list(c("center","center"),c("right","bottom")))
```
Filtering cases and deaths on basis of States, Note some times in loc_type we have city name but its not in the most of cases so here consideration is all data as state data.
```{r}
State <- Tycho %>% count(state,event, wt = number) # sum for all cases and deaths in each state
State <- State %>% pivot_wider(
names_from = event,
values_from = n,
values_fill = 0,
values_fn = list(breaks = mean)) # converting rows to column
```
Now let's understand the total cases for all periods in 50 states and with all diseases.
Below, the Geographic_map plot of US states shows the ratio of cases per state from 1888 to 2014. It explains that some area or group of state with darker colors are primarily on the northwest side of the US, for instance, New York. In the south region, Texas and California in the west region presenting a dark color. On the other hand, states from central geom-location in the US present light colors which means less occurrences of cases during year 1888 to 2014.
```{r fig.align='center'}
# Geographic_map plot of US states to see the ratio of cases per each state
plot_usmap(data = State,values = "CASES",labels = TRUE) + # color of graph
scale_fill_stepsn(n.breaks = 9, colors = c("white","cyan4")) +
ggtitle("Total Cases In States, Years 1888 - 2014") +
theme(plot.title = element_text(size = 15, hjust = 0.5), legend.position = c(0.87, 0.1)) # title arrangements
```
As similar to the image above the below code will now generate a Geographic map plot showing he ratio of deaths per state from 1888 to 2014. The outcome of this plot points to the results similar in the case of ratio of cases which seems to be obvious as we can say that increase number of cases also resulted in increased number of deaths.
```{r fig.align='center'}
# Geographic_map plot of US states to see the ratio of deaths per each state
plot_usmap(data = State, values = "DEATHS",labels = TRUE) +
scale_fill_stepsn(n.breaks = 9,colors = c("white","darkred")) + # color of graph
ggtitle("Total Deaths In States, Years 1888 - 2014") + # title arrangements
theme(plot.title = element_text(size = 15, hjust = 0.5), legend.position = c(0.87, 0.1))
```
Now as we want to identify specific diseases that have major impact in the U.S we have to do some advanced filtration so that we can have some deep insights of the data.
The code below makes separate columns for cases and deaths of diseases and also filters the numbers of cases and deaths for better representation. To improve the quality and clarity of ideas in the representation we are now going to filter and separate the data so that we have two files one which have diseases with cases greater that 55000 and deaths greater that 60000 (Disease1) and all less values will come under another file(Disease2).
```{r}
# filtering disease which is grater than 50,000.
Disease1 <- Disease %>% pivot_wider(names_from = event, values_from = n, values_fill = 0) # creates Disease1 which has separate deaths and cases column
Disease1 <- Disease1 %>% group_by(DEATHS,CASES) %>% filter(CASES >= 55000 | DEATHS > 60000) # filtering cases and death
Disease1 <- Disease1 %>% pivot_longer( cols = DEATHS : CASES, names_to = "event", values_to = "number") # Changing back as before
```
The below plot illustrates the total number of deaths and cases of diseases along the years 1888 to 2014 in th U.S where cases number exceeds 55000. From this illustration we get some clear cut ideas about the diseases which have serious impacts. For example considering the number of cases Measles have the upper hand when compared to all other diseases.
```{r warning=FALSE, fig.align='center'}
## Bar plot for comparing each disease with cases and deaths which are grater than 55,000
ggplot(Disease1, aes(x = reorder(disease, number) , y = reorder(number, disease), fill= event))+
geom_bar( position='dodge', stat='identity', ) + coord_flip()+ theme_igray() +
scale_fill_manual( values = c("slategray3", "slategray4"),labels = c("CASES","DEATHS")) +
theme(text = element_text(size = 9),element_line(size =1),legend.position = c(-0.3, 0.0),axis.text.x = element_text(angle = 90, hjust = 1)) +
ggtitle("Cases greater than 50,000")
```
The code below filters the diseases with cases whose numbers com under 55,000.
```{r}
# filtering disease which is lower than 50,000.
Disease2 <- Disease %>% pivot_wider(
names_from = event,
values_from = n,
values_fill = 0) # creates Disease1 which has separate deaths and cases column
Disease2 <- Disease2 %>% group_by(DEATHS,CASES) %>% filter(CASES <= 50000 & disease != "PNEUMONIA AND INFLUENZA") # filtering cases and death and skip "PNEUMONIA AND INFLUENZA" from the selection because took already in Disease1.
Disease2 <- Disease2 %>% pivot_longer( cols = DEATHS : CASES, names_to = "event", values_to = "number")
```
And as we plot the diseases with less that 55000 cases as in Below plot, we can find the Botulism disease have least impact in the country compared to other diseases.
Also we can find that in certain diseases the lack of data which means certain diseases have either deaths of cases. These will not help us in further investigation, so we need to do some more filtration of our data.
```{r warning=FALSE, fig.align='center'}
# Bar plot for comparing each disease with cases and deaths which are lower than 50,000
ggplot(Disease2, aes(x = reorder(disease, number) , y = reorder(number, disease), fill= event))+
geom_bar( position='dodge', stat='identity') + coord_flip()+ theme_igray() +
scale_fill_manual(values = c("lemonchiffon3", "lemonchiffon4"),labels = c("CASES","DEATHS")) +
theme(text = element_text(size = 10),element_line(size =1),legend.position = c(-0.4, 0.0),axis.text.x = element_text(angle = 90, hjust = 1)) +
ggtitle("Cases lower than 50,000")
```
The filtration mentioned above is done in the following code chunk
```{r}
# converting Disease and filtering the disease which only has cases and deaths.
Case_n_Death <- Disease %>% pivot_wider(names_from = event, values_from = n, values_fill = 0)
Case_n_Death <- Case_n_Death %>% group_by(CASES,DEATHS) %>% filter(CASES >= 1 & DEATHS >= 1 )
Case_n_Death <- Case_n_Death %>% pivot_longer( cols = DEATHS : CASES, names_to = "event", values_to = "number")
```
The code below will generate a box plot showing the total number of cases and deaths of thos 15 diseases which we found have both cases and deaths in the whole of the U.S. A box plot allows us to summarize the main characteristics of the data.
and identify the presence of out liners.
The box plot produced below shows only 15 disease from data which have data for deaths and cases, so if we look among those disease in box plot then it shows Tuberculosis, Pneumonia, pellagra and meningitis has high mortality as compare to other ten disease.
```{r fig.align='center'}
# Box plot to shows the mean values of each diseases in comparison of deaths for cases
ggplot(Case_n_Death, aes(x=number, y=disease, fill=disease)) + theme_light() +
geom_boxplot() + xscale("log2", .format = FALSE) + theme(legend.position = "none") + theme(text = element_text(size = 11),element_line(size =1)) + ggtitle("Deaths and Cases distribution")
```
The code below filters the data so that it shows only the items where deaths are reported. An then it counts the tolat number of death caused by all diseases in respective to the states in U.S
```{r}
State_death_count <- filter (Tycho, event == "DEATHS") # filtering only deaths using state column (2)number from Tycho
State_death_count <- State_death_count %>% group_by(state, disease) %>% summarise(number = sum(number))
```
The code below does the similar job as in the above case except in this case cases are the one which is in considertation.
```{r}
State_cases_count <- filter (Tycho, event == "CASES") # filtering only deaths using state column (2)number from Tycho
State_cases_count <- State_cases_count %>% group_by(state, disease) %>% summarise(number = sum(number))
```
The below two code chunks are to adjust the data so that it groups the data according to certain interval so that it can be represented as a heat map, one for deaths and other for cases respectively.
```{r}
State_death_count$groups <- cut(State_death_count$number, breaks = c(0,10,100,1000,10000,100000,500000)) # Creating Group of number of death for heat_map to create scale
```
```{r}
State_cases_count$groups <- cut(State_cases_count$number, breaks = c(0,100,1000,10000,100000,1000000,5000000)) # Creating Group of number of cases for heat_map to create scale
```
```{r fig.align='center'}
# Heat_map for deaths on the basis of ranges(0-10,10-100,100-1000...) which shows the concentration of the deaths per state.
ggplot(State_death_count, aes(state,disease)) +
geom_tile(aes(fill= groups)) + # Border color
scale_fill_manual(breaks = levels(State_death_count$groups), values = c("cornsilk","gold","orange","darkorange2","red2","darkred")) + # Box color low to high
theme_tufte() + # plot theme
theme(text = element_text(size = 9),axis.text.x = element_text(angle = 90, hjust = 1)) + # text size arrangement of scales
theme(legend.key.width = unit(0.3, "cm"), legend.key.height = unit(0.2, "cm"),legend.direction = 'horizontal', legend.position = c(0.65, 1.05)) +
ggtitle("Heatmap of Deaths in State" )
```
```{r fig.align='center'}
# Heat_map for cases on the basis of ranges(0-100,100-1000,1000-10000...) which shows the concentration of the cases per state.
ggplot(State_cases_count, aes(state, disease )) +
geom_tile(aes(fill= groups ) ) +
scale_fill_manual(breaks = levels(State_cases_count$groups), values = c("khaki","yellow3","olivedrab3","turquoise3","deepskyblue4","darkblue","mediumseagreen2")) +
theme_tufte() + # plot theme
theme(text = element_text(size = 7),axis.text.x = element_text(angle = 90, hjust = 1)) +
theme(legend.key.width = unit(0.9, "cm"), legend.key.height = unit(0.1, "cm"),legend.direction = 'horizontal', legend.position = c(0.7, 1.05)) +
ggtitle("Heatmap of Cases of Disease in State" ) # plot title
# White space shows there are no cases for particular disease for related states
```
```{r}
# Here filtering the Measles according each decade,to combine it at the end.
Y1915 <- Tycho %>% filter(disease == "MEASLES") %>% group_by(state,number) %>% filter(from_date >= as.Date("1905-01-01") & from_date < as.Date("1915-01-01")) %>% select(state, number) %>% ungroup(state,number) %>% group_by(state) %>% count(state, wt = number) %>% add_column(year = "1905")
Y1925 <- Tycho %>% filter(disease == "MEASLES") %>% group_by(state,number) %>% filter(from_date >= as.Date("1915-01-01") & from_date < as.Date("1925-01-01")) %>% select(state, number) %>% ungroup(state,number) %>% group_by(state) %>% count(state, wt = number) %>% add_column(year = "1915")
Y1935 <- Tycho %>% filter(disease == "MEASLES") %>% group_by(state,number) %>% filter(from_date >= as.Date("1925-01-01") & from_date < as.Date("1935-01-01")) %>% select(state, number) %>% ungroup(state,number) %>% group_by(state) %>% count(state, wt = number) %>% add_column(year = "1925")
Y1945 <- Tycho %>% filter(disease == "MEASLES") %>% group_by(state,number) %>% filter(from_date >= as.Date("1935-01-01") & from_date < as.Date("1945-01-01")) %>% select(state, number) %>% ungroup(state,number) %>% group_by(state) %>% count(state, wt = number) %>% add_column(year = "1935")
Y1955 <- Tycho %>% filter(disease == "MEASLES") %>% group_by(state,number) %>% filter(from_date >= as.Date("1945-01-01") & from_date < as.Date("1955-01-01")) %>% select(state, number) %>% ungroup(state,number) %>% group_by(state) %>% count(state, wt = number) %>% add_column(year = "1945")
Y1965 <- Tycho %>% filter(disease == "MEASLES") %>% group_by(state,number) %>% filter(from_date >= as.Date("1955-01-01") & from_date < as.Date("1965-01-01")) %>% select(state, number) %>% ungroup(state,number) %>% group_by(state) %>% count(state, wt = number) %>% add_column(year = "1955")
Y1975 <- Tycho %>% filter(disease == "MEASLES") %>% group_by(state,number) %>% filter(from_date >= as.Date("1965-01-01") & from_date < as.Date("1975-01-01")) %>% select(state, number) %>% ungroup(state,number) %>% group_by(state) %>% count(state, wt = number) %>% add_column(year = "1965")
Y1985 <- Tycho %>% filter(disease == "MEASLES") %>% group_by(state,number) %>% filter(from_date >= as.Date("1975-01-01") & from_date < as.Date("1985-01-01")) %>% select(state, number) %>% ungroup(state,number) %>% group_by(state) %>% count(state, wt = number) %>% add_column(year = "1975")
Y1995 <- Tycho %>% filter(disease == "MEASLES") %>% group_by(state,number) %>% filter(from_date >= as.Date("1985-01-01") & from_date < as.Date("1995-01-01")) %>% select(state, number) %>% ungroup(state,number) %>% group_by(state) %>% count(state, wt = number) %>% add_column(year = "1985")
Y2005 <- Tycho %>% filter(disease == "MEASLES") %>% group_by(state,number) %>% filter(from_date >= as.Date("1995-01-01") & from_date < as.Date("2005-01-01")) %>% select(state, number) %>% ungroup(state,number) %>% group_by(state) %>% count(state, wt = number) %>% add_column(year = "1995")
```
```{r}
# Combining the decade from 1915-2005 for measles disease
Measles_decade <- rbind(Y1915,Y1925,Y1935,Y1945,Y1955,Y1965,Y1975,Y1985,Y1995,Y2005) # Combining all decades to new data frame
Measles_decade$year <- as.integer(Measles_decade$year)
```
```{r fig.align='center'}
# Animation plot to catch the trend of disease per each state
ggplot(Measles_decade, aes( x = state, y = n, size = n, color = state)) +
geom_point(alpha = 1.0, show.legend = FALSE ) +
scale_color_viridis_d() + theme_stata() +
theme(plot.title = element_text(hjust = 0.5)) +
theme(text = element_text(size = 9),axis.text.x = element_text(angle = 90, hjust = 1)) +
theme(legend.position = "none") + scale_size(range = c(2, 12)) +
labs(title = 'Year: {frame_time}', x = 'State', y = 'Cases') + transition_time(year)
```
Results
As we explained in the problem discussion section we were trying to find the most abundant disease which has affected the states of the U.S during the years 1888 to 2014. What we did in this test was analyzing the data, filtering it to produce some qualioty graphs that shows relevant information regarding our type of research. In the process of analyzing and plotting the contents in data we got a clear picture of the disease which was most abundant in the U.S during the years 1888 to 2014 wchich was actually our research subject and that is Measles.
Discussion
The results found in this work substantiates how R can be used to express a huge data and the obtained results have full support in case of reliability just by going through the different types of plots. That is aslo the beauty of graphical representation, that one can easily reach to conclusion by looking at a plot rather than a huge data containing numerous attributes in it.
But in other case, yes understanding the plain data is sort of a difficult thing, but it is much better than trying to understand a plot which makes no sense. So, t is important to be very specific in case of producing plots and graphs, so that it will give meaningful insights for the reader regarding the data.
Comentarios