Task 1
Q1: EDA
Create a Spark Dataframe from /databricks-datasets/learning-spark-v2/sf-airbnb/sf-airbnb-clean.parquet/. Visualize and explore the data. Note anything you find interesting. This dataset is slightly cleansed form of the Inside Airbnb dataset for San Francisco.
http://insideairbnb.com/get-the-data.html
Q2: Model Development and Tracking
Split into 80/20 train-test split using SparkML APIs.
Build a model using SparkML to predict price given the other input features (or subset of them).
Mention why you chose this model, how it works, and other models that you considered.
Compute the loss metric on the test dataset and explain your choice of loss metric.
Log your model/hyperparameters/metrics to MLflow.
Task 2
Question 1
Part 1: Code analysis and documentation
In the following cells is code to generate a synthetic data set. At each point that is
marked by commenting blocks ( '#', '"""', '''''), fill in appropriate comments that explain
the functionality of each part of the subsequent code in standard python code style.
import collections
"""
"""
DataStructure = collections.namedtuple('DataStructure', 'value1 value2 value3 value4
value5 value6')
ModuloResult = collections.namedtuple('ModuloResult', 'factor remain')
from pyspark.sql.types import DoubleType, StructType
from pyspark.sql.functions import lit, col
from pyspark.sql import DataFrame
import random
import numpy
from functools import reduce
import math
DISTINCT_NOMINAL = 5
STDDEV_NAME = "_std"
class DataGenerator:
def __init__(self, DISTINCT_NOMINAL, STDDEV_NAME):
self.DISTINCT_NOMINAL = DISTINCT_NOMINAL
self.STDDEV_NAME = STDDEV_NAME
def modeFlag(self, mode: str):
""" comments
"""
modeVal = {
"ascending" : False,
"descending" : True
}
return modeVal.get(mode)
def lfold(self, func, nums, exp):
""" comments
"""
acc = []
for i in range(len(nums)):
result = reduce(func, nums[:i+1], exp)
acc.append(result)
return acc
def generateDoublesData(self, targetCount: int, start: float, step: float, mode: str):
""" comments
"""
stoppingPoint = (targetCount * step) + start
doubleArray = list(numpy.arange(start, stoppingPoint, step))
try :
doubleArray = sorted(doubleArray, reverse=self.modeFlag(mode))
except:
if (mode == 'random'):
random.shuffle(doubleArray)
else:
raise Exception(mode, " is not supported.")
return doubleArray
def generateDoublesMod(self, targetCount: int, start: float, step: float, mode: str, exp: float):
""" comments
"""
doubles = self.generateDoublesData(targetCount, start, step, mode)
res = (lambda x, y: x + ((x + y) / x))
return self.lfold(res, doubles, exp)
def generateDoublesMod2(self, targetCount: int, start: float, step: float, mode: str):
""" comments
"""
doubles = self.generateDoublesData(targetCount, start, step, mode)
func = (lambda x, y: (math.pow((x-y)/math.sqrt(y), 2)))
sequenceEval = reduce(func, doubles, 0)
res = (lambda x, y: (x + (x / y)) / x)
return self.lfold(res, doubles, sequenceEval)
def generateIntData(self, targetCount: int, start: int, step: int, mode: str):
""" comments
"""
stoppingPoint = (targetCount * step) + start
intArray = list(range(start, stoppingPoint, step))
try :
intArray = sorted(intArray, reverse=self.modeFlag(mode))
except:
if (mode == 'random'):
random.shuffle(intArray)
else:
raise Exception(mode, " is not supported.")
return intArray
def generateRepeatingIntData(self, targetCount: int, start: int, step: int, mode: str,
distinctValues: int):
""" comments
"""
subStopPoint = (distinctValues * step) + start - 1
distinctArray = list(range(start, subStopPoint, step))
try :
sortedArray = sorted(distinctArray, reverse=self.modeFlag(mode))
except:
if (mode != 'random'):
raise Exception(mode, " is not supported.")
outputArray = numpy.full((int(targetCount / (len(sortedArray) - 1)), len(sortedArray)),
sortedArray).flatten().tolist()[:targetCount]
if (mode == 'random'):
random.shuffle(outputArray)
return outputArray
def getDoubleCols(self, schema: StructType):
""" comments
"""
return [s.name for s in schema if s.dataType == DoubleType()]
def normalizeDoubleTypes(self, df: DataFrame):
""" comments
"""
doubleTypes = self.getDoubleCols(df.schema)
stddevValues = df.select(doubleTypes).summary("stddev").first()
for indx in range(0, len(doubleTypes)):
df = df.withColumn(doubleTypes[indx]+STDDEV_NAME,
col(doubleTypes[indx])/stddevValues[indx+1])
return df
def generateData(self, targetCount: int):
""" comments
"""
seq1 = self.generateIntData(targetCount, 1, 1, "ascending")
seq2 = self.generateDoublesData(targetCount, 1.0, 1.0, "descending")
seq3 = self.generateDoublesMod(targetCount, 1.0, 1.0, "ascending", 2.0)
seq4 = list(map(lambda x: x * -10, self.generateDoublesMod2(targetCount, 1.0, 1.0,
"ascending")))
seq5 = self.generateRepeatingIntData(targetCount, 0, 5, "ascending",
DISTINCT_NOMINAL)
seq6 = self.generateDoublesMod2(targetCount, 1.0, 1.0, "descending")
seqData: List[DataStructure] = []
for i in range(0, targetCount):
seqData.append(DataStructure(value1=seq1[i], value2=seq2[i].item(),
value3=seq3[i].item(), value4=seq4[i].item(),
value5=seq5[i], value6=seq6[i].item()))
return self.normalizeDoubleTypes(spark.createDataFrame(seqData))
def generateCoordData(self, targetCount: int):
""" comments
"""
coordData = self.generateData(targetCount).withColumnRenamed("value2_std",
"x1").withColumnRenamed("value3_std", "x2").withColumnRenamed("value4_std",
"y1").withColumnRenamed("value6_std", "y2").select("x1", "x2", "y1", "y2")
return coordData
Part 2: Data Normalcy and Filtering
Many data manipulation tasks require the identification and handling of outlier data. In this section, examine the data set that is generated and write a function that will determine the distribution type of a collection of column names passed in. The only distribution types that are required to be detected are:
Normal Distriubtion
Left Tailed
Right Tailed The return type of this function should be a Dictionary of (ColumnName -> Distriubtion Type)
dataGenerator = DataGenerator(DISTINCT_NOMINAL, STDDEV_NAME)
data = dataGenerator.generateData(1000)
columnsToCheck = ["value2_std", "value3_std", "value4_std", "value6_std"]
Part 3: Testing
In order to validate that the function that you have written performs as intended, write a simple test that could be placed in a unit testing framework.
Demonstrate that the test passes while validating proper classification of at maximum 1 type of distribution
Demonstate the test failing at classifying correctly, but ensure that the application continues to run (handle the exception and report the failure to stdout)
(Hint: Distribution characteristics may change with the number of rows generated based on the data generator's equations)
Part 4: Efficient Calculations
In this section, create a function that allows for the calculation of euclidean distance between the pairs (x1, y1) and (x2, y2). Choose the approach that scales best to extremely large data sizes.
Once complete, determine the distribution type of your derived distance column using the function you developed above in Part 2.
Show a plot of the distribution to ensure that the distribution type is correct.
coordData = dataGenerator.generateCoordData(1000)
display(coordData)
Part 5: Complex DataTypes
In this section, create a new column that shows the mid-point coordinates between the (x1, y1) and (x2, y2) values in each row.
After the new column has been created, write a function that will calculate the distance from each pair (x1, y1) and (x2, y2) to the mid-point value.
Once the distances have been calculated, run a validation check to ensure that the expected result is achieved.
Part 6: Precision
How many rows of data do not match?
Why would they / wouldn't they match?
Get help in machine learning coding, machine learning project, machine learning homework, data science and data visualization. You need to send your assignment request at below mail id or you can chat on website chatbot. We are available 24/7 for your support and help.
Contact Us!!!
realcode4you@gmail.com
What's the solution for the above paper ?