Here we analysis the company supply chain data using python big data. Each company has own data which he sell or purchase and he analyze his data to predict their profit and loss.
Our main input file for this assessment is called DataCoSupplyChainDataset.csv, a subset of a dataset from Kaggle, which contains supply chains used by a company called DataCoSupplyChainDataset.csv Global. There is a second file provided, called DescriptionDataCoSupplyChain.csv, which describes the columns in the main dataset.
Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns; sns.set(style="ticks", color_codes=True)
import matplotlib.pyplot as plt
import plotly.graph_objs as go
from plotly.subplots import make_subplots
import seaborn as sns
import plotly.express as px
%matplotlib inline
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected = True)
Load dataset:
df=pd.read_csv('/content/DataCoSupplyChainDataset.csv')
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
df.head()
Selecting Useful Features and Leave Unnecessary Features
data=df.copy()
FeatureList=['Type', 'Benefit per order', 'Sales per customer',
'Delivery Status', 'Late_delivery_risk', 'Category Name', 'Customer City', 'Customer Country',
'Customer Id', 'Customer Segment',
'Customer State', 'Customer Zipcode', 'Department Name', 'Latitude', 'Longitude',
'Market', 'Order City', 'Order Country', 'Order Customer Id', 'order date (DateOrders)', 'Order Id',
'Order Item Cardprod Id', 'Order Item Discount', 'Order Item Discount Rate', 'Order Item Id',
'Order Item Product Price', 'Order Item Profit Ratio', 'Order Item Quantity', 'Sales', 'Order Item Total',
'Order Profit Per Order', 'Order Region', 'Order State', 'Order Status', 'Order Zipcode', 'Product Card Id',
'Product Category Id', 'Product Description', 'Product Image', 'Product Name', 'Product Price', 'Product Status',
'shipping date (DateOrders)', 'Shipping Mode']
df1=df[FeatureList]
df1.head()
Exploratory Data Analysis
Delivery status
Bar Plot:
data_delivery_status=df1.groupby(['Delivery Status'])['Order Id'].count().reset_index(name='Number of Orders').sort_values(by= 'Number of Orders', ascending= False)
px.bar(x=data_delivery_status['Delivery Status'] , y=data_delivery_status['Number of Orders'] , color=data_delivery_status['Number of Orders'],
labels = { 'Delivery Status': 'Delivery Status', 'Number of Orders': 'Number of Orders'})
data_delivery_status_region=df1.groupby(['Delivery Status', 'Order Region'])['Order Id'].count().reset_index(name='Number of Orders').sort_values(by= 'Number of Orders', ascending= False)
px.bar(data_delivery_status_region, x='Delivery Status', y='Number of Orders' , color='Order Region',
)
Top 20 Customers regarding the quantity of orders
df1['Customer_ID_STR']=df1['Customer Id'].astype(str)
data_customers=df1.groupby(['Customer_ID_STR'])['Order Id'].count().reset_index(name='Number of Orders').sort_values(by= 'Number of Orders', ascending= False)
px.bar(data_customers.head(20),x='Number of Orders', y='Customer_ID_STR' , color='Number of Orders' )
Top 20 Customers regarding profit of all orders
df1['Customer_ID_STR']=df1['Customer Id'].astype(str)
data_customers_profit=df1.groupby(['Customer_ID_STR'])['Order Profit Per Order'].sum().reset_index(name='Profit of Orders').sort_values(by= 'Profit of Orders', ascending= False)
px.bar(data_customers_profit.head(20),x='Profit of Orders', y='Customer_ID_STR' , color='Profit of Orders')
Customer Segment
#Customer Segments
data_Customer_Segment=df1.groupby(['Customer Segment'])['Order Id'].count().reset_index(name='Number of Orders').sort_values(by= 'Number of Orders', ascending= False)
px.pie(data_Customer_Segment, values='Number of Orders', names= 'Customer Segment' , title= 'Number of Orders of different Customer Segments',
width=600 , height=600 , color_discrete_sequence = px.colors.sequential.RdBu)
Category
#Category Name
data_Category_Name=df1.groupby(['Category Name'])['Order Id'].count().reset_index(name='Number of Orders').sort_values(by= 'Number of Orders', ascending= True)
px.bar(data_Category_Name, x='Number of Orders',y = 'Category Name',color ='Number of Orders')
Geo Features
data_Region=df1.groupby(['Order Region'])['Order Id'].count().reset_index(name='Number of Orders').sort_values(by= 'Number of Orders', ascending= True)
px.bar(data_Region, x='Number of Orders',y = 'Order Region',color ='Number of Orders')
data_countries=df1.groupby(['Order Country'])['Order Id'].count().reset_index(name='Number of Orders').sort_values(by= 'Number of Orders', ascending= True)
px.bar(data_countries.head(20), x='Number of Orders',y = 'Order Country',color ='Number of Orders')
df_geo=df1.groupby([ 'Order Country', 'Order City'])['Order Profit Per Order'].sum().reset_index(name='Profit of Orders').sort_values(by= 'Profit of Orders', ascending= False)
df_geo
fig = px.choropleth(df_geo , locationmode='country names', locations='Order Country',
color='Profit of Orders', # lifeExp is a column of data
hover_name='Order Country',
#hover_data ='Order City',
color_continuous_scale=px.colors.sequential.Plasma)
fig.show()
Sales Analysis
#Order Country
df_sales_country=df1.groupby([ 'Order Country'])['Sales'].sum().reset_index(name='Sales of Orders').sort_values(by= 'Sales of Orders', ascending= False)
px.bar(df_sales_country.head(10), x='Sales of Orders',y = 'Order Country',color ='Sales of Orders')
#Product
df_sales_country=df1.groupby([ 'Product Name'])['Sales'].sum().reset_index(name='Sales of Orders').sort_values(by= 'Sales of Orders', ascending= False)
px.bar(df_sales_country.head(10), x='Sales of Orders',y = 'Product Name',color ='Sales of Orders')
#Product and deliveray status
df_sales_pd=df1.groupby([ 'Product Name', 'Delivery Status'])['Sales'].sum().reset_index(name='Sales of Orders').sort_values(by= 'Sales of Orders', ascending= False)
px.bar(df_sales_pd.head(10), x='Sales of Orders',y = 'Product Name',color ='Delivery Status')
#Product and order region
df_sales_pr=df1.groupby([ 'Product Name', 'Order Region'])['Sales'].sum().reset_index(name='Sales of Orders').sort_values(by= 'Sales of Orders', ascending= False)
px.bar(df_sales_pr.head(10), x='Sales of Orders',y = 'Product Name',color ='Order Region')
#'Type of payment
df_sales_pr=df1.groupby([ 'Type'])['Sales'].sum().reset_index(name='Sales of Orders').sort_values(by= 'Sales of Orders', ascending= False)
px.bar(df_sales_pr.head(10), x='Sales of Orders',y = 'Type',color ='Sales of Orders')
Date and sales analysis
import datetime as dt
data_orderdate=df[['order date (DateOrders)', 'Sales']]
data_orderdate['order_date'] = pd.to_datetime(data_orderdate['order date (DateOrders)'])
data_orderdate["Quarter"] = data_orderdate['order_date'].dt.quarter
data_orderdate["Month"] = data_orderdate['order_date'].dt.month
data_orderdate["year"] = data_orderdate['order_date'].dt.year
data_orderdate['YearStr']=data_orderdate['year'].astype(str)
df_sales_year=data_orderdate.groupby([ 'YearStr'])['Sales'].sum().reset_index(name='Sales of Orders').sort_values(by= 'Sales of Orders', ascending= False)
px.bar(df_sales_year, x='Sales of Orders',y = 'YearStr',color ='Sales of Orders')
Forecasting
Predicting if an order is fraud or not
data=df1.copy()
data['SUSPECTED_FRAUD'] = np.where(data['Order Status'] == 'SUSPECTED_FRAUD', 1, 0)
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
def Labelencoder_feature(x):
le=LabelEncoder()
x=le.fit_transform(x)
return x
features=data.drop(columns=['SUSPECTED_FRAUD','Order Status' ])
target=data['SUSPECTED_FRAUD']
features.isnull().sum()
Implement Using PySpark
Requirement Details
In this assessment you are required to answer five questions using Spark. For each question you will need to write code which uses the appropriate transformations and actions.
Our main input file for this assessment is called DataCoSupplyChainDataset.csv, a subset of a dataset from Kaggle, which contains supply chains used by a company called DataCo Global. There is a second file provided, called DescriptionDataCoSupplyChain.csv, which describes the columns in the main dataset.
You should use the following template file to write your code: test3_solutions.py. See the video instructions provided with the assessment instructions for an example of how to use the template.
Q1. Load the data, convert to dataframe and apply appropriate column names and variable types.
Q2. Determine what proportion of all transactions is attributed to each customer segment in the dataset i.e. Consumer = x%, Corporate = y% etc. This question uses the Customer Segment field.
Q3. Determine which three products had the most amount of sales. This question uses the Order Item Total and Product Name fields.
Q4. For each transaction type, determine the average item cost. This question uses the Type, Order Item Product Price and Order Item Quantity fields.
Q5. What is the first name of the most regular customer in Puerto Rico? (Repeat transactions by the same customer should not count as separate customers.) This question uses the Customer Country, Customer Fname and Customer Id fields.
If you need solution of the above questions using pyspark then contact us and get instant help with an affordable price.
Contact Us or Send Your Requirement Details at:
realcode4you@gmail.com
Commenti