“A new baby's gender, name, time of birth, and birth weight are nice information for a birth announcement, but birth weight is especially important for an obstetrician. A large size at delivery has long been associated with an increased risk of injuries to a newborn and its mom. So the better a doctor can predict birth weight, the easier the delivery may be.” Ultrasound is a popular way of doing it. But, aha! You are a Data Scientist (or going to be one). You can amaze people by predicting the birth weight way earlier than ultrasound, right? In this assignment, let’s do this.
Dataset
Login to Canvas > Assignments > Programming Assignment 1
You will get the following 3 files
* baby-weights-dataset2.csv
* It has 101400 rows (samples) with 37 columns (variables).
Each sample represent a case for a new-born. It contains 37 variables
(just mentioned! Haha) about it. Very last column of it is “BWEIGHT”,
that true weight of the new-born (in lbs unit). Actually, this needs to be considered as the target variable here.
* data-description.txt
* You will see that the name of the 37 variables are actually
contracted form of some sort. And, the source of the dataset did not offer me description of every single of them. But, after studying about them, I could elaborate only few of them. Please pardon my laziness. Okay, this file contains few descriptions for the variables. All the rest are mostly talking about the Mother’s medical history and all. No big deal, I guess, for you to work with these variables without knowing their meaning.
* judge-without-label.csv
* This is an interesting file. It contains new samples:
additional 2001 rows with 36 columns (without the BWEIGHT target column). Once again, this should be part of the training, as there are no ground truth target labels, right? Once the training is complete with the dataset provided above, you must apply your prediction algorithm to predict BWEIGHT of these 2001 samples, and submit the result as part of your assignment submission.
Tasks
Datasets download from here
Task 1:
Import all the necessary packages here
Task 2:
Load the dataset into memory so that you can play with it here
Task 3:
Compute mean, stdev, min, max, 25% percentile, median and 75% percentile of the dataset (BWEIGHT variable)
Task 4:
Also, draw the histogram plot for the BWEIGHT variable
Task 5:
Present the skewness and kurtosis of the BWEIGHT target variable
Task 6:
Do variable selection from the pool of 36 variables based on correlation score with the target variable BWEIGHT
Please report all the variables you kept for training.
Task 7:
Check for missing data, and apply a "good" strategy to tackle it
Task 8:
Tackle the dummy categorical variables by introducing dummy variables
Task 9.1:
Randomly split the dataset into training, Tr (80%) and testing, Te (20%)
Task 9.2:
On the training dataset, apply a normalization technique
Task 9.3: Apply the training data statistics to normalize the testing data as well.
Task 10:
Find the linear regression function describing the training dataset using a technique you recently learned in class. CLOSED-FORM vs. Gradient Descent (batch or stochastic or mini-batch).
PLEASE DO NOT CALL ANY LIBRARY FUNCTION THAT MIGHT DO THE TASK FOR YOU. If you do, you are most likely get a ZERO for this assignment.
Task 11:
Predict BWEIGHT target variable for each of the testing dataset using the regression line you learned in Task 10, and report RMSE(testing) (Root Mean Squared Error)
Repeat Task 10 additional four times : Run linear regression training again
After each run, Report RMSE(testing)
Task 12: Finally, Report RMSE(testing) = Average(RMSE_test) ±± Stdev(RMSE_test)
Here Average(RMSE_test) = average of all the 5 RMSE(testing) scores you got above.
And, stdev(RMSE_test) = standard deviation of all the 5 RMSE(testing) scores above.
Task 13: Run linear regression one last time on the whole dataset (i.e, training+testing which is preprocessed by you above).
Task 14: Preprocess the judge-without-label.csv file according টo the strategy you applied above on the whole dataset (task 13)
Task 15: Predict BWEIGHT for each of the samples from the judge-without-label.csv file, and save the results in judge-submission-run-1.csv in the format below. Please change the run number and report what changes you have made in a corresponding file run-1.txt.
Task 16: Repeat tasks 9-12 three times, and report the ultimate RMSE_test average ±± ultimate RMSE_test stdev
Task 17: Make an entry in the Kaggle challenge below:
[https://www.kaggle.com/c/csci-ml-s19-pa1/]
Please oin the challenge and submit a judge-submission-run1.csv file, and please report your Kaggle handle here too.There is limit of 5 entries each day untile the deadline. For each of the runs you submit, please report here the RMSE you got (as reported by the Kaggle platform).
Solution:
Import Libraries
Task 1:
import pandas as pd
import matplotlib.pyplot as plt
Task 2: Load Dataset
df = pd.read_csv('baby-weights-dataset2.csv')
TASK 3: Compute mean, stdev, min, max, 25% percentile, median and 75% percentile of
the dataset (BWEIGHT variable)
df['BWEIGHT'].describe()
Output:
count 101400.000000 mean 7.258066 std 1.329461 min 0.187500 25% 6.625000 50% 7.375000 75% 8.062500 max 13.062500 Name: BWEIGHT, dtype: float64
TASK 4: Also, draw the histogram plot for the BWEIGHT variable
df['BWEIGHT'].hist()
If you need any programming assignment help in Machine Learning, Machine Learning project or Machine Learning homework or need complete solution of above problem then we are ready to help you.
Send your request at realcode4you@gmail.com and get instant help with an affordable price.
We are always focus to delivered unique or without plagiarism code which is written by our highly educated professional which provide well structured code within your given time frame.
If you are looking other programming language help like C, C++, Java, Python, PHP, Asp.Net, NodeJs, ReactJs, etc. with the different types of databases like MySQL, MongoDB, SQL Server, Oracle, etc. then also contact us.
Comments