top of page
realcode4you

Predict Length Of Stay (LOS) Of a Patient Recent COVID-19 Pandemic



Introduction

Hospitals are constantly challenged to provide timely patient care while maintaining high resource utilization. While this challenge has been around for many years, the recent COVID-19 pandemic has increased its prominence. For a hospitals, the ability to predict length of stay (LOS) of a patient as early as possible (at the admission stage) is very useful in managing its resources.


In this task, you will develop a ML model to predict if a patient will be discharged from a hospital early or, will stay in hospital for an extended period (see task below for exact definition), based on several attributes (features) related to: patient characteristics, diagnoses, treatments, services, hospital charges and patients socio-economic background.


The machine learning task we are interested in is: “Predict if a given patient (i.e. newborn child) will be discharged from the hospital within 3 days (class 0) or will stay in hospital beyond that - 4 days or more (class 1)”.


The data set to develop your models is given to you on canvas. Note that you need to transform the target column (“LengthOfStay”) to match the two classes mentioned in the above task. Class 0 if LengthOfStay < 4 and class 1 otherwise.


  • You need to come up with an approach (that follows the restrictions in 3.2), where each element of the system is justified using data analysis, performance analysis and/or knowledge from relevant literature.

  • As one of the aims of the assignment is to become familiar with the machine learning paradigm, you should evaluate multiple different models (only use techniques taught in class up to week 5 - inclusive) to determine which one is most appropriate for this task.

  • Setup an evaluation framework, including selecting appropriate performance measures, and determining how to split the data.

  • Finally you need to analyse the model and the results from your model using appropriate techniques and establish how adequate your model is to perform the task in real world and discuss limitation if there are any (ultimate judgement).

  • Predict the result for the test set.


Dataset

The data set for this assignment is available on Canvas. There are the following files:

  • “README.md”: Description of dataset.

  • “train data.csv”: Contain the train set, attributes and target for each patient. This data is to be used in developing the models. Use this for your own exploration and evaluation of which approach you think is “best” for this prediction task.

  • “test data.csv”: Contain the test set, attributes for each patient. You need to make predictions for this data and submit the prediction via canvas. The teaching team will use this data to evaluate the performance of the model you have developed.

  • “s1234567 predictions.csv”: Shows the expected format for your predictions on the unseen test data. You should organize your predictions in this format. Any deviation from this format will result on zero marks for the results part. Change the number in filename to your student ID.

Dataset you can download from here



Implementation


Import Libraries

# Convolutional Neural Network
# Importing the libraries
import tensorflow as tf
import pandas as pd
import numpy as np
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import matplotlib.pyplot as plt
from tensorflow.keras.layers import Conv2D,Dropout
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier
from matplotlib import pyplot
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split,GridSearchCV,cross_val_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC

Load Dataset


train_df=pd.read_csv('train_data.csv')
train_df

Output Result


Describe the Dataset

display(train_df.describe())

Output:



Check if any null vaule present or not


#check for missing values
train_df.isnull().values.any()

Output

False


Data Visualization

# Class wise(Normal & Fraud) distribution of data
count_classes = pd.value_counts(train_df['HealthServiceArea'], sort = True)
count_classes.plot(kind = 'pie', shadow=True, legend=True)
plt.title("Health Service Area distribution")

Output:












# Class Gender distribution of data
count_classes = pd.value_counts(train_df['Gender'], sort = True)
count_classes.plot(kind = 'pie', shadow=True, legend=True,autopct='%1.2f')
plt.title("Gender distribution")

Output:











# Class wise(Normal & Fraud) distribution of data
count_classes = pd.value_counts(train_df['Race'], sort = True)
count_classes.plot(kind = 'pie', shadow=True, legend=True,autopct='%1.2f')
plt.title("Based on Race distribution")

Output:












# Class wise(Normal & Fraud) distribution of data
count_classes = pd.value_counts(train_df['TypeOfAdmission'], sort = True)
count_classes.plot(kind = 'pie', shadow=True, legend=True,autopct='%1.2f')
plt.title("Based on TypeOfAdmission distribution")

Output:











# Class wise(Normal & Fraud) distribution of data
count_classes = pd.value_counts(train_df['PaymentTypology'], sort = True)
count_classes.plot(kind = 'pie', shadow=True, legend=True,autopct='%1.2f')
plt.title("Based on PaymentTypology distribution")

Output:











Histograms

plt.hist(train_df.AverageCostInFacility, label='Cost In Facility')
plt.legend(loc='upper right')
plt.xlabel('Average Cost In Facility of Transaction')
plt.ylabel('Number of Transactions')
plt.show()

Output:













Data preprocessing


Convert length of stay to 0 and 1


train_df['LengthOfStay'] = train_df['LengthOfStay'].apply(lambda x: 1 if x > 3 else 0)
print(train_df.info())

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59966 entries, 0 to 59965
Data columns (total 16 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   ID                            59966 non-null  int64 
 1   HealthServiceArea             59966 non-null  object
 2   Gender                        59966 non-null  object
 3   Race                          59966 non-null  object
 4   TypeOfAdmission               59966 non-null  object
 5   CCSProcedureCode              59966 non-null  int64 
 6   APRSeverityOfIllnessCode      59966 non-null  int64 
 7   PaymentTypology               59966 non-null  object
 8   BirthWeight                   59966 non-null  int64 
 9   EmergencyDepartmentIndicator  59966 non-null  object
 10  AverageCostInCounty           59966 non-null  int64 
 11  AverageChargesInCounty        59966 non-null  int64 
 12  AverageCostInFacility         59966 non-null  int64 
 13  AverageChargesInFacility      59966 non-null  int64 
 14  AverageIncomeInZipCode        59966 non-null  int64 
 15  LengthOfStay                  59966 non-null  int64 
dtypes: int64(10), object(6)
memory usage: 7.3+ MB
None


Remove Id and HealthServiceArea as per requirement


remove_col=[]
cat_col=[]
train_df.drop(['ID', 'HealthServiceArea'], axis=1, inplace=True)
remove_col.append('ID')
remove_col.append('HealthServiceArea')

Transform Gender with label encoder to convert categorical data to numerical

# creating instance of labelencoder
Gender_labelencoder = LabelEncoder()

# Assigning numerical values and storing in another column
train_df['Gender'] = Gender_labelencoder.fit_transform(train_df['Gender'])
cat_col.append('Gender')

Convert categorical data to numerical


Transform Race with label encoder

# creating instance of labelencoder
Race_labelencoder = LabelEncoder()

# Assigning numerical values and storing in another column
train_df['Race'] = Race_labelencoder.fit_transform(train_df['Race'])
cat_col.append('Race')
train_df['TypeOfAdmission'].value_counts()

Output:

Newborn      58741
Emergency      659
Urgent         412
Elective       154
Name: TypeOfAdmission, dtype: int64

Transform TypeOfAdmission with label encoder

# creating instance of labelencoder
TypeOfAdmission_labelencoder = LabelEncoder()

# Assigning numerical values and storing in another column
train_df['TypeOfAdmission'] = TypeOfAdmission_labelencoder.fit_transform(train_df['TypeOfAdmission'])
cat_col.append('TypeOfAdmission')
train_df['CCSProcedureCode'].value_counts()

Output:

228    19886
 115    13628
 0      11189
 220    10773
 231     2981
-1        769
 216      740
Name: CCSProcedureCode, dtype: int64

Transform CCSProcedureCode with label encoder

# creating instance of labelencoder
CCSProcedureCode_labelencoder = LabelEncoder()

# Assigning numerical values and storing in another column
train_df['CCSProcedureCode'] = CCSProcedureCode_labelencoder.fit_transform(train_df['CCSProcedureCode'])
cat_col.append('CCSProcedureCode')
train_df['APRSeverityOfIllnessCode'].value_counts()

Output:

1    47953
2     8760
3     3252
4        1
Name: APRSeverityOfIllnessCode, dtype: int64

train_df['PaymentTypology'].value_counts()

Output:

Medicaid                     28723
Private Health Insurance     15608
Blue Cross/Blue Shield       12073
Self-Pay                      1984
Federal/State/Local/VA         849
Managed Care, Unspecified      545
Miscellaneous/Other            118
Medicare                        44
Unknown                         22
Name: PaymentTypology, dtype: int64

Transform PaymentTypology with label encoder

# creating instance of labelencoder
PaymentTypology_labelencoder = LabelEncoder()

# Assigning numerical values and storing in another column
train_df['PaymentTypology'] = PaymentTypology_labelencoder.fit_transform(train_df['PaymentTypology'])
cat_col.append('PaymentTypology')
train_df['EmergencyDepartmentIndicator'].value_counts()

Output:

N    59453
Y      513
Name: EmergencyDepartmentIndicator, dtype: int64

Transform EmergencyDepartmentIndicator with label encoder

# creating instance of labelencoder
EmergencyDepartmentIndicator_labelencoder = LabelEncoder()

# Assigning numerical values and storing in another column
train_df['EmergencyDepartmentIndicator'] = EmergencyDepartmentIndicator_labelencoder.fit_transform(train_df['EmergencyDepartmentIndicator'])
cat_col.append('EmergencyDepartmentIndicator')


Split Training and Testing Data

x=train_df.drop('LengthOfStay', axis=1)
y_train=train_df.LengthOfStay


Find which features are important


XGBClassifier

model = XGBClassifier()
model.fit(x, y_train)
# feature importance
print(model.feature_importances_)

# plot
pyplot.bar(range(len(model.feature_importances_)), model.feature_importances_)
pyplot.show()

Output:












Based on above feature importance graph we can see that feature 0,5,7 are not important.

We can remove the Gender, PaymentTypology and EmergencyDepartmentIndicator is not important


x

x_train=df = x.drop(['Gender','PaymentTypology','EmergencyDepartmentIndicator'], axis=1)
remove_col.append('Gender')
remove_col.append('PaymentTypology')
remove_col.append('EmergencyDepartmentIndicator')
model = XGBClassifier()
model.fit(x_train, y_train)

# feature importance
print(model.feature_importances_)

# plot
pyplot.bar(range(len(model.feature_importances_)), model.feature_importances_)
pyplot.show()

Output:













Logistic Regression


# Defining the LR model and performing the hyper parameter tuning using gridsearch
#weights = np.linspace(0.05, 0.95, 20)
params = {'C' : [10**-4,10**-3,10**-2,10**-1,1,10**1,10**2,10**3],
          'penalty': ['l2']#,'class_weight': [{0: x, 1: 1.0-x} for x in weights]
         }

clf = LogisticRegression(n_jobs= -1,random_state=42)
clf.fit(x_train,y_train)
model = GridSearchCV(estimator=clf,cv = 5,n_jobs= -1,param_grid=params,scoring='f1',verbose= 2,)
model.fit(x_train,y_train)
print("Best estimator is", model.best_params_)

Output:

Fitting 5 folds for each of 8 candidates, totalling 40 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers. [Parallel(n_jobs=-1)]: Done 37 tasks | elapsed: 19.2s [Parallel(n_jobs=-1)]: Done 40 out of 40 | elapsed: 20.4s finished

Best estimator is {'C': 10, 'penalty': 'l2'}


# model fitting using the best parameter.
%%time
log_clf = LogisticRegression(n_jobs= -1,random_state=42,C= model.best_params_['C'],penalty= 'l2')
log_clf.fit(x_train,y_train)
y_pred = log_clf.predict(x_train)
con_mat =confusion_matrix (y_train, y_pred)
print("-"*117)
print('Confusion Matrix: ', '\n',con_mat)
print("-"*117)
print("Type 1 error (False Positive) = ", con_mat[0][1])
print("Type 2 error (False Negative) = ", con_mat[1][0])
print("-"*117)
print("Total cost = ", con_mat[0][1] * 10 + con_mat[1][0] * 500)
print("-"*117)

#Print the classification report
print(classification_report(y_train,y_pred))

Output:

---------------------------------------------------------------------------------------------------------------------
Confusion Matrix:  
 [[49120   775]
 [ 8893  1178]]
---------------------------------------------------------------------------------------------------------------------
Type 1 error (False Positive) =  775
Type 2 error (False Negative) =  8893
---------------------------------------------------------------------------------------------------------------------
Total cost =  4454250
---------------------------------------------------------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.85      0.98      0.91     49895
           1       0.60      0.12      0.20     10071

    accuracy                           0.84     59966
   macro avg       0.72      0.55      0.55     59966
weighted avg       0.81      0.84      0.79     59966

CPU times: user 252 ms, sys: 111 ms, total: 364 ms
Wall time: 2.04 s


SVM


# model fitting and hyper parameter tuning to find the best parameter.
x_cfl=SVC()

prams={
    #'learning_rate':[0.01,0.03,0.05,0.1,0.15,0.2],
    #  'n_estimators':[50,75,100,150,250],
     'max_iter':[5,10,20,30,-1],
     'degree':[3],
    #'colsample_bytree':[0.1,0.3,0.5,1],
   # 'subsample':[0.1,0.3,0.5,1]
}

model=GridSearchCV(x_cfl,param_grid=prams,verbose=10,n_jobs=-1,scoring='f1',cv=5)
model.fit(x_train,y_train)
print("Best estimator is", model.best_params_)

Output:

Fitting 5 folds for each of 5 candidates, totalling 25 fits 


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers. [Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    0.1s [Parallel(n_jobs=-1)]: Batch computation too fast (0.1349s.) Setting batch_size=2. [Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:    0.3s [Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:    1.5s [Parallel(n_jobs=-1)]: Done  22 out of  25 | elapsed:  3.9min remaining:   32.0s [Parallel(n_jobs=-1)]: Done  25 out of  25 | elapsed:  5.0min remaining:    0.0s [Parallel(n_jobs=-1)]: Done  25 out of  25 | elapsed:  5.0min finished 


Best estimator is {'degree': 3, 'max_iter': 20} 


/usr/local/lib/python3.7/dist-packages/sklearn/svm/_base.py:231: ConvergenceWarning: Solver terminated early (max_iter=20).  Consider pre-processing your data with StandardScaler or MinMaxScaler.   % self.max_iter, ConvergenceWarning)

Output:

%%time
#model fitting with the best parameter.
svm_clf = SVC(max_iter= -1,  degree= model.best_params_['degree'])
svm_clf.fit(x_train,y_train)
y_pred = svm_clf.predict(x_train)
con_mat =confusion_matrix (y_train, y_pred)
print("-"*117)
print('Confusion Matrix: ', '\n',con_mat)
print("-"*117)
print("Type 1 error (False Positive) = ", con_mat[0][1])
print("Type 2 error (False Negative) = ", con_mat[1][0])
print("-"*117)
print("Total cost = ", con_mat[0][1] * 10 + con_mat[1][0] * 500)
print("-"*117)

#Print the classification report
print(classification_report(y_train,y_pred))

Output:

--------------------------------------------------------------------------------------------------------------------- Confusion Matrix: [[49895 0] [10071 0]] --------------------------------------------------------------------------------------------------------------------- Type 1 error (False Positive) = 0 Type 2 error (False Negative) = 10071 --------------------------------------------------------------------------------------------------------------------- Total cost = 5035500 --------------------------------------------------------------------------------------------------------------------- precision recall f1-score support 0 0.83 1.00 0.91 49895 1 0.00 0.00 0.00 10071 accuracy 0.83 59966 macro avg 0.42 0.50 0.45 59966 weighted avg 0.69 0.83 0.76 59966 CPU times: user 2min 53s, sys: 273 ms, total: 2min 53s Wall time: 2min 53s



Thanks For Visite Here!



If you need any programming assignment help in Machine Learning, Machine Learning project or Machine Learning homework, then we are ready to help you.

Send your request at realcode4you@gmail.com and get instant help with an affordable price.

We are always focus to delivered unique or without plagiarism code which is written by our highly educated professional which provide well structured code within your given time frame. If you are looking other programming language help like C, C++, Java, Python, PHP, Asp.Net, NodeJs, ReactJs, etc. with the different types of databases like MySQL, MongoDB, SQL Server, Oracle, etc. then also contact us.

Comments


bottom of page