Introduction
Hospitals are constantly challenged to provide timely patient care while maintaining high resource utilization. While this challenge has been around for many years, the recent COVID-19 pandemic has increased its prominence. For a hospitals, the ability to predict length of stay (LOS) of a patient as early as possible (at the admission stage) is very useful in managing its resources.
In this task, you will develop a ML model to predict if a patient will be discharged from a hospital early or, will stay in hospital for an extended period (see task below for exact definition), based on several attributes (features) related to: patient characteristics, diagnoses, treatments, services, hospital charges and patients socio-economic background.
The machine learning task we are interested in is: “Predict if a given patient (i.e. newborn child) will be discharged from the hospital within 3 days (class 0) or will stay in hospital beyond that - 4 days or more (class 1)”.
The data set to develop your models is given to you on canvas. Note that you need to transform the target column (“LengthOfStay”) to match the two classes mentioned in the above task. Class 0 if LengthOfStay < 4 and class 1 otherwise.
You need to come up with an approach (that follows the restrictions in 3.2), where each element of the system is justified using data analysis, performance analysis and/or knowledge from relevant literature.
As one of the aims of the assignment is to become familiar with the machine learning paradigm, you should evaluate multiple different models (only use techniques taught in class up to week 5 - inclusive) to determine which one is most appropriate for this task.
Setup an evaluation framework, including selecting appropriate performance measures, and determining how to split the data.
Finally you need to analyse the model and the results from your model using appropriate techniques and establish how adequate your model is to perform the task in real world and discuss limitation if there are any (ultimate judgement).
Predict the result for the test set.
Dataset
The data set for this assignment is available on Canvas. There are the following files:
“README.md”: Description of dataset.
“train data.csv”: Contain the train set, attributes and target for each patient. This data is to be used in developing the models. Use this for your own exploration and evaluation of which approach you think is “best” for this prediction task.
“test data.csv”: Contain the test set, attributes for each patient. You need to make predictions for this data and submit the prediction via canvas. The teaching team will use this data to evaluate the performance of the model you have developed.
“s1234567 predictions.csv”: Shows the expected format for your predictions on the unseen test data. You should organize your predictions in this format. Any deviation from this format will result on zero marks for the results part. Change the number in filename to your student ID.
Dataset you can download from here
Implementation
Import Libraries
# Convolutional Neural Network
# Importing the libraries
import tensorflow as tf
import pandas as pd
import numpy as np
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import matplotlib.pyplot as plt
from tensorflow.keras.layers import Conv2D,Dropout
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier
from matplotlib import pyplot
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split,GridSearchCV,cross_val_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC
Load Dataset
train_df=pd.read_csv('train_data.csv')
train_df
Output Result
Describe the Dataset
display(train_df.describe())
Output:
Check if any null vaule present or not
#check for missing values
train_df.isnull().values.any()
Output
False
Data Visualization
# Class wise(Normal & Fraud) distribution of data
count_classes = pd.value_counts(train_df['HealthServiceArea'], sort = True)
count_classes.plot(kind = 'pie', shadow=True, legend=True)
plt.title("Health Service Area distribution")
Output:
# Class Gender distribution of data
count_classes = pd.value_counts(train_df['Gender'], sort = True)
count_classes.plot(kind = 'pie', shadow=True, legend=True,autopct='%1.2f')
plt.title("Gender distribution")
Output:
# Class wise(Normal & Fraud) distribution of data
count_classes = pd.value_counts(train_df['Race'], sort = True)
count_classes.plot(kind = 'pie', shadow=True, legend=True,autopct='%1.2f')
plt.title("Based on Race distribution")
Output:
# Class wise(Normal & Fraud) distribution of data
count_classes = pd.value_counts(train_df['TypeOfAdmission'], sort = True)
count_classes.plot(kind = 'pie', shadow=True, legend=True,autopct='%1.2f')
plt.title("Based on TypeOfAdmission distribution")
Output:
# Class wise(Normal & Fraud) distribution of data
count_classes = pd.value_counts(train_df['PaymentTypology'], sort = True)
count_classes.plot(kind = 'pie', shadow=True, legend=True,autopct='%1.2f')
plt.title("Based on PaymentTypology distribution")
Output:
Histograms
plt.hist(train_df.AverageCostInFacility, label='Cost In Facility')
plt.legend(loc='upper right')
plt.xlabel('Average Cost In Facility of Transaction')
plt.ylabel('Number of Transactions')
plt.show()
Output:
Data preprocessing
Convert length of stay to 0 and 1
train_df['LengthOfStay'] = train_df['LengthOfStay'].apply(lambda x: 1 if x > 3 else 0)
print(train_df.info())
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59966 entries, 0 to 59965
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 59966 non-null int64
1 HealthServiceArea 59966 non-null object
2 Gender 59966 non-null object
3 Race 59966 non-null object
4 TypeOfAdmission 59966 non-null object
5 CCSProcedureCode 59966 non-null int64
6 APRSeverityOfIllnessCode 59966 non-null int64
7 PaymentTypology 59966 non-null object
8 BirthWeight 59966 non-null int64
9 EmergencyDepartmentIndicator 59966 non-null object
10 AverageCostInCounty 59966 non-null int64
11 AverageChargesInCounty 59966 non-null int64
12 AverageCostInFacility 59966 non-null int64
13 AverageChargesInFacility 59966 non-null int64
14 AverageIncomeInZipCode 59966 non-null int64
15 LengthOfStay 59966 non-null int64
dtypes: int64(10), object(6)
memory usage: 7.3+ MB
None
Remove Id and HealthServiceArea as per requirement
remove_col=[]
cat_col=[]
train_df.drop(['ID', 'HealthServiceArea'], axis=1, inplace=True)
remove_col.append('ID')
remove_col.append('HealthServiceArea')
Transform Gender with label encoder to convert categorical data to numerical
# creating instance of labelencoder
Gender_labelencoder = LabelEncoder()
# Assigning numerical values and storing in another column
train_df['Gender'] = Gender_labelencoder.fit_transform(train_df['Gender'])
cat_col.append('Gender')
Convert categorical data to numerical
Transform Race with label encoder
# creating instance of labelencoder
Race_labelencoder = LabelEncoder()
# Assigning numerical values and storing in another column
train_df['Race'] = Race_labelencoder.fit_transform(train_df['Race'])
cat_col.append('Race')
train_df['TypeOfAdmission'].value_counts()
Output:
Newborn 58741
Emergency 659
Urgent 412
Elective 154
Name: TypeOfAdmission, dtype: int64
Transform TypeOfAdmission with label encoder
# creating instance of labelencoder
TypeOfAdmission_labelencoder = LabelEncoder()
# Assigning numerical values and storing in another column
train_df['TypeOfAdmission'] = TypeOfAdmission_labelencoder.fit_transform(train_df['TypeOfAdmission'])
cat_col.append('TypeOfAdmission')
train_df['CCSProcedureCode'].value_counts()
Output:
228 19886
115 13628
0 11189
220 10773
231 2981
-1 769
216 740
Name: CCSProcedureCode, dtype: int64
Transform CCSProcedureCode with label encoder
# creating instance of labelencoder
CCSProcedureCode_labelencoder = LabelEncoder()
# Assigning numerical values and storing in another column
train_df['CCSProcedureCode'] = CCSProcedureCode_labelencoder.fit_transform(train_df['CCSProcedureCode'])
cat_col.append('CCSProcedureCode')
train_df['APRSeverityOfIllnessCode'].value_counts()
Output:
1 47953
2 8760
3 3252
4 1
Name: APRSeverityOfIllnessCode, dtype: int64
train_df['PaymentTypology'].value_counts()
Output:
Medicaid 28723
Private Health Insurance 15608
Blue Cross/Blue Shield 12073
Self-Pay 1984
Federal/State/Local/VA 849
Managed Care, Unspecified 545
Miscellaneous/Other 118
Medicare 44
Unknown 22
Name: PaymentTypology, dtype: int64
Transform PaymentTypology with label encoder
# creating instance of labelencoder
PaymentTypology_labelencoder = LabelEncoder()
# Assigning numerical values and storing in another column
train_df['PaymentTypology'] = PaymentTypology_labelencoder.fit_transform(train_df['PaymentTypology'])
cat_col.append('PaymentTypology')
train_df['EmergencyDepartmentIndicator'].value_counts()
Output:
N 59453
Y 513
Name: EmergencyDepartmentIndicator, dtype: int64
Transform EmergencyDepartmentIndicator with label encoder
# creating instance of labelencoder
EmergencyDepartmentIndicator_labelencoder = LabelEncoder()
# Assigning numerical values and storing in another column
train_df['EmergencyDepartmentIndicator'] = EmergencyDepartmentIndicator_labelencoder.fit_transform(train_df['EmergencyDepartmentIndicator'])
cat_col.append('EmergencyDepartmentIndicator')
Split Training and Testing Data
x=train_df.drop('LengthOfStay', axis=1)
y_train=train_df.LengthOfStay
Find which features are important
XGBClassifier
model = XGBClassifier()
model.fit(x, y_train)
# feature importance
print(model.feature_importances_)
# plot
pyplot.bar(range(len(model.feature_importances_)), model.feature_importances_)
pyplot.show()
Output:
Based on above feature importance graph we can see that feature 0,5,7 are not important.
We can remove the Gender, PaymentTypology and EmergencyDepartmentIndicator is not important
x
x_train=df = x.drop(['Gender','PaymentTypology','EmergencyDepartmentIndicator'], axis=1)
remove_col.append('Gender')
remove_col.append('PaymentTypology')
remove_col.append('EmergencyDepartmentIndicator')
model = XGBClassifier()
model.fit(x_train, y_train)
# feature importance
print(model.feature_importances_)
# plot
pyplot.bar(range(len(model.feature_importances_)), model.feature_importances_)
pyplot.show()
Output:
Logistic Regression
# Defining the LR model and performing the hyper parameter tuning using gridsearch
#weights = np.linspace(0.05, 0.95, 20)
params = {'C' : [10**-4,10**-3,10**-2,10**-1,1,10**1,10**2,10**3],
'penalty': ['l2']#,'class_weight': [{0: x, 1: 1.0-x} for x in weights]
}
clf = LogisticRegression(n_jobs= -1,random_state=42)
clf.fit(x_train,y_train)
model = GridSearchCV(estimator=clf,cv = 5,n_jobs= -1,param_grid=params,scoring='f1',verbose= 2,)
model.fit(x_train,y_train)
print("Best estimator is", model.best_params_)
Output:
Fitting 5 folds for each of 8 candidates, totalling 40 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers. [Parallel(n_jobs=-1)]: Done 37 tasks | elapsed: 19.2s [Parallel(n_jobs=-1)]: Done 40 out of 40 | elapsed: 20.4s finished
Best estimator is {'C': 10, 'penalty': 'l2'}
# model fitting using the best parameter.
%%time
log_clf = LogisticRegression(n_jobs= -1,random_state=42,C= model.best_params_['C'],penalty= 'l2')
log_clf.fit(x_train,y_train)
y_pred = log_clf.predict(x_train)
con_mat =confusion_matrix (y_train, y_pred)
print("-"*117)
print('Confusion Matrix: ', '\n',con_mat)
print("-"*117)
print("Type 1 error (False Positive) = ", con_mat[0][1])
print("Type 2 error (False Negative) = ", con_mat[1][0])
print("-"*117)
print("Total cost = ", con_mat[0][1] * 10 + con_mat[1][0] * 500)
print("-"*117)
#Print the classification report
print(classification_report(y_train,y_pred))
Output:
---------------------------------------------------------------------------------------------------------------------
Confusion Matrix:
[[49120 775]
[ 8893 1178]]
---------------------------------------------------------------------------------------------------------------------
Type 1 error (False Positive) = 775
Type 2 error (False Negative) = 8893
---------------------------------------------------------------------------------------------------------------------
Total cost = 4454250
---------------------------------------------------------------------------------------------------------------------
precision recall f1-score support
0 0.85 0.98 0.91 49895
1 0.60 0.12 0.20 10071
accuracy 0.84 59966
macro avg 0.72 0.55 0.55 59966
weighted avg 0.81 0.84 0.79 59966
CPU times: user 252 ms, sys: 111 ms, total: 364 ms
Wall time: 2.04 s
SVM
# model fitting and hyper parameter tuning to find the best parameter.
x_cfl=SVC()
prams={
#'learning_rate':[0.01,0.03,0.05,0.1,0.15,0.2],
# 'n_estimators':[50,75,100,150,250],
'max_iter':[5,10,20,30,-1],
'degree':[3],
#'colsample_bytree':[0.1,0.3,0.5,1],
# 'subsample':[0.1,0.3,0.5,1]
}
model=GridSearchCV(x_cfl,param_grid=prams,verbose=10,n_jobs=-1,scoring='f1',cv=5)
model.fit(x_train,y_train)
print("Best estimator is", model.best_params_)
Output:
Fitting 5 folds for each of 5 candidates, totalling 25 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers. [Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 0.1s [Parallel(n_jobs=-1)]: Batch computation too fast (0.1349s.) Setting batch_size=2. [Parallel(n_jobs=-1)]: Done 4 tasks | elapsed: 0.3s [Parallel(n_jobs=-1)]: Done 14 tasks | elapsed: 1.5s [Parallel(n_jobs=-1)]: Done 22 out of 25 | elapsed: 3.9min remaining: 32.0s [Parallel(n_jobs=-1)]: Done 25 out of 25 | elapsed: 5.0min remaining: 0.0s [Parallel(n_jobs=-1)]: Done 25 out of 25 | elapsed: 5.0min finished
Best estimator is {'degree': 3, 'max_iter': 20}
/usr/local/lib/python3.7/dist-packages/sklearn/svm/_base.py:231: ConvergenceWarning: Solver terminated early (max_iter=20). Consider pre-processing your data with StandardScaler or MinMaxScaler. % self.max_iter, ConvergenceWarning)
Output:
%%time
#model fitting with the best parameter.
svm_clf = SVC(max_iter= -1, degree= model.best_params_['degree'])
svm_clf.fit(x_train,y_train)
y_pred = svm_clf.predict(x_train)
con_mat =confusion_matrix (y_train, y_pred)
print("-"*117)
print('Confusion Matrix: ', '\n',con_mat)
print("-"*117)
print("Type 1 error (False Positive) = ", con_mat[0][1])
print("Type 2 error (False Negative) = ", con_mat[1][0])
print("-"*117)
print("Total cost = ", con_mat[0][1] * 10 + con_mat[1][0] * 500)
print("-"*117)
#Print the classification report
print(classification_report(y_train,y_pred))
Output:
--------------------------------------------------------------------------------------------------------------------- Confusion Matrix: [[49895 0] [10071 0]] --------------------------------------------------------------------------------------------------------------------- Type 1 error (False Positive) = 0 Type 2 error (False Negative) = 10071 --------------------------------------------------------------------------------------------------------------------- Total cost = 5035500 --------------------------------------------------------------------------------------------------------------------- precision recall f1-score support 0 0.83 1.00 0.91 49895 1 0.00 0.00 0.00 10071 accuracy 0.83 59966 macro avg 0.42 0.50 0.45 59966 weighted avg 0.69 0.83 0.76 59966 CPU times: user 2min 53s, sys: 273 ms, total: 2min 53s Wall time: 2min 53s
Thanks For Visite Here!
If you need any programming assignment help in Machine Learning, Machine Learning project or Machine Learning homework, then we are ready to help you.
Send your request at realcode4you@gmail.com and get instant help with an affordable price.
We are always focus to delivered unique or without plagiarism code which is written by our highly educated professional which provide well structured code within your given time frame. If you are looking other programming language help like C, C++, Java, Python, PHP, Asp.Net, NodeJs, ReactJs, etc. with the different types of databases like MySQL, MongoDB, SQL Server, Oracle, etc. then also contact us.
Comments