Here we will covers the EDA in python machine learning. If you are looking EDA Assignment Help, Project Help, Homework Help.
Topic: To visualise how honey production is changed over the years (1998-2016) in the United States.
Background:
In 2006, global concern was raised over the rapid decline in the honeybee population, an integral component of American honey agriculture. Large numbers of hives were lost to Colony Collapse Disorder, a phenomenon of disappearing worker bees causing the remaining hive colony to collapse. Speculation to the cause of this disorder points to hive diseases and pesticides harming the pollinators, though no overall consensus has been reached. The U.S. used to locally produce over half the honey it consumes per year. Now, honey mostly comes from overseas, with 350 of the 400 million pounds of honey consumed every year originating from imports. This dataset provides insight into honey production supply and demand in America from 1998 to 2016.
Objective:
To visualise how honey production is changed over the years (1998-2016) in the United States.
Key questions to be answered:
How has honey production yield changed from 1998 to 2016?
Over time, what are the major production trends been across the states?
Are there any patterns that can be observed between total honey production and value of production every year? How has value of production, which in some sense could be tied to demand, changed every year?
Dataset:
state: Various states of U.S.
numcol: Number of honey-producing colonies. Honey producing colonies are the maximum number of colonies from which honey was taken during the year. It is possible to take honey from colonies that did not survive the entire year
yieldpercol: Honey yield per colony. Unit is pounds
totalprod: Total production (numcol x yieldpercol). Unit is pounds
stocks: Refers to stocks held by producers. Unit is pounds
priceperlb: Refers to average price per pound based on expanded sales. The unit is dollars.
prodvalue: Value of production (totalprod x priceperlb). The unit is dollars.
year: Year of production
Import the necessary packages - pandas, numpy, seaborn, matplotlib.pyplot
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
pd.set_option('display.float_format', lambda x: '%.5f' % x) # To supress numerical display in scientific notations
Read in the dataset
honeyprod = pd.read_csv("honeyproduction1998-2016.csv")
View the first few rows of the dataset
honeyprod.head(10)
Output:
Observations: The dataset looks clean and consistent with the description provided in the Data Dictionary.
Check the shape of the dataset
honeyprod.shape
Output:
(785, 8)
Observations: We have 785 observations of 8 columns
Check the datatype of the variables to make sure that the data is read in properly
honeyprod.dtypes
Output:
state object numcol float64 yieldpercol int64 totalprod float64 stocks float64 priceperlb float64 prodvalue float64 year int64 dtype: object
Observations:
state is object data type
year is integer type currently. Since year is a categorical variable here, let us convert it to category data data type in Python.
All the other variables are numerical and there for their python data types (float64 and int64) are ok.
honeyprod.year = honeyprod.year.astype('category') # To convert year into categories
# Uncomment the following code to learn more about the astype function and its attribtes
# help(honeyprod.astype)
Let us analyse the quantitative variables in the dataset
honeyprod.describe()
output:
Observations:
Number of colonies in every state are spread over a huge range. Ranging from 2000 to 510000.
The mean numcol is close to the 75% percentile of the data, indicating a right skew.
As expected, standard deviation of numcol is very high
yieldpercol - Yield per colony also has huge spread ranging from 19 pounds to 136 pounds.
Infact, all the variable seem to have a huge range, we will have to investigate furthur if this spread is mainly across different states or varies in the same state over the years.
Looking at the relationship between numerical variables using pair plots and correlation plots
sns.pairplot(honeyprod, diag_kind="kde")
Output:
correlation = honeyprod.corr() # creating a 2-D Matrix with correlation plots
correlation
Output:
# Uncomment the following code for information of the arguments
# help(sns.heatmap)
plt.figure(figsize=(15, 7))
sns.heatmap(correlation, annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
Output:
Observations:
Number of colonies have a high positive correlation with total production, stocks and the value of production. As expected, all these values are highly correlated with each other.
Yield per colony does not have a high correlation with any of the features that we have in our dataset.
Same is the case with priceperlb.
Determining the factors influencing per colony yield and price per pound of honey would need furthur investigation.
Let us now explore the categorical features - state and year
print(honeyprod.state.nunique())
print(honeyprod.year.nunique())
Output:
44
19
We have honey production data for 44 US states over a span of 19 years, from 1998 to 2016.
Let us look at the overall trend of honey production in the US over the years
plt.figure(figsize=(15, 7))
sns.pointplot(x='year', y='totalprod', data=honeyprod, estimator=sum, ci=None)
plt.xticks(rotation=90) # To rotate the x axis labls
plt.show()
# Uncomment the following code to check the actual values
# honeyprod.groupby(['year'])['totalprod'].sum().reset_index()
Output:
Observations:
The overall honey production in the US has been decresing over the years.
Total honey production = number of colonies * average yield per colony. Let us check if the honey production is decreasing due to one of these factors or both.
Variation in the number of colonies over the years
plt.figure(figsize=(15, 7))
sns.pointplot(x='year', y='numcol', data=honeyprod, ci=None, estimator=sum)
plt.xticks(rotation=90) # To rotate the x axis labls
plt.show()
Output:
Observations:
The number of colonies across the country shows a declining trend from 1998-2008 but has seen an uptick since 2008.
It is possible that there was some intervension in 2008 that help in increasing the number of honey bee colonies across the country.
Variation of yield per colony over the years
plt.figure(figsize=(15, 7))
sns.pointplot(x='year', y='yieldpercol', data=honeyprod, estimator=sum, ci=None)
plt.xticks(rotation=90) # To rotate the x axis labls
plt.show()
Output:
Onservation:
In contrast to number of colonies, the yield per colony has been decreasing since 1998.
This indicates that, it is not the number of colonies that is causing a decline in totalhoney production but the yield per colony.
Let us look at the production trend at state level
# Add hue parameter to the pointplot to plot for each state
plt.figure(figsize=(15, 7)) # To resize the plot
sns.pointplot(x='year', y='totalprod', data=honeyprod, estimator=sum, ci=None, hue = 'state')
plt.legend(bbox_to_anchor=(1, 1))
plt.xticks(rotation=90) # To rotate the x axis labls
plt.show()
Output:
Observations: There are some states that have much higher productions than the others but this plot is a little hard to read. Let us try plotting each state seperatly for a better understanding.
Catplot:
sns.catplot(x='year', y='totalprod', data=honeyprod,
estimator=sum, col='state', kind="point",
height=3,col_wrap = 5)
plt.show()
Output:
Observations:
The most prominent honey producing states of US are - California, Florida, North Dakota and South Dakota and Montana
Unfortunately, the honey production in California has seen a steep decline over the years.
Florida's total production also has been on a decline.
South Dakota has more of less maintained its levels of production.
North Dakota has actually seen an impressive increase in the honey production.
Let us look at the yearly trend in number of colonies and yield per colony in these 5 states
cplot1=sns.catplot(x='year', y='numcol',
data=honeyprod[honeyprod["state"].isin(["North Dakota","California","South Dakota","Florida","Montana"])],
estimator=sum, col='state', kind="point",
height=3,col_wrap = 5)
cplot1.set_xticklabels(rotation=90)
plt.show()
Output:
cplot2=sns.catplot(x='year', y='yieldpercol',
data=honeyprod[honeyprod["state"].isin(["North Dakota","California","South Dakota","Florida","Montana"])],
estimator=sum, col='state', kind="point",
height=3,col_wrap = 5)
cplot2.set_xticklabels(rotation=90)
plt.show()
Output:
Observation:
In North Dakota, the number of colonies has increased significantly over the years as compared to the other 4 states
If we check the yield per colony, it has been in an overall decreasing trend for all the 5 states over the years
Let us see what effect, the declining production trend has had on the value of production
sns.pointplot(x="year", y="prodvalue", data=honeyprod, ci=None)
plt.xticks(rotation=90) # To rotate the x axis labls
plt.show()
Output:
Observations:
This is an interesting trend. As the total production has declined over the years, the value of production per pound has increased over time.
As the supply declined, the demand has added to the value of honey.
Let us check which of the states have been capitalising on this trend. We can compare the total production with the stocks held by the producers
plt.figure(figsize = (20,20)) # To resize the plot
# Plot total production per state
sns.barplot(x="totalprod", y="state", data=honeyprod.sort_values("totalprod", ascending=False),
label="Total Production", color="b", ci=None)
# Plot stocks per state
sns.barplot(x="stocks", y="state", data=honeyprod.sort_values("totalprod", ascending=False),
label="Stocks", color="r", ci=None)
# Add a legend
plt.legend(ncol=2, loc="lower right", frameon=True)
plt.show()
Output:
Observations:
North Dakota has been able to sell more honey as compared to South Dakota despite having the highest production value.
Florida has the highest efficiency among the major honey producing states
Michigan is more efficient than Wisconsin in selling honey.
Let us look at the spread of average price of a pound of honey
plt.figure(figsize=(15, 7))
sns.histplot(honeyprod.priceperlb)
plt.show()
Output:
Box Plot:
sns.boxplot(data = honeyprod, x = 'priceperlb')
plt.show()
Output:
Observations:
Price per pound of honey has a right skewed distribution with a lot of outliers towards the higher end.
The median price per pound of honey is 1.5
Let us look at the average price per pound of honey across states
plt.figure(figsize=(15, 7)) # To resize the plot
sns.barplot(data = honeyprod, x = "state", y = "priceperlb", ci=None, color = "coral",
order=honeyprod.groupby('state').priceperlb.mean().sort_values(ascending = False).index)
plt.xticks(rotation=90) # To rotate the x axis lables
plt.show()
Output:
Observations:
Virginia has the highest price per pound of honey.
The average price per pound of honey in the major honey producing states is towards the lower end.
Conclusion
We can conclude that the total honey production has declined over the years whereas the value of production per pound has increased.
The reason for the declined honey production is the decrease in the yield per colony over the years.
The major honey producing states are California, Florida, North Dakota, South Dakota and Montana.
Among these, Florida has been very efficient in selling honey.
Comments