Task 1.1: Data Retrieving
Load the CSV data from the file. You need to use an appropriate pandas function to
load the csv data, and make use of the correct arguments including sep, decimal,
header, names, if needed.
Task 1.2: Check data types
Check whether the loaded data is equivalent to the data in the source (CSV) file. That is, you will need to ensure that the loaded data has appropriate data types assigned, or take steps to ensure that the appropriate types are used
Task 1.3: Typos
Check whether there are typos in the data. If there are any typos, correct them by using masks.
Task 1.4: Extra-whitespaces
Check whether there are instances of extra whitespaces in the data, and if so, demonstrate how to remove them by calling on an appropriate function.
Task 1.5: Upper/Lower-case
Cast all text data to upper-case by using an appropriate function
Task 1.6: Sanity checks
Design and run a small test-suite, consisting of a series of sanity checks to test for the presence of impossible values for each attribute
Task 1.7: Missing values
Check whether the loaded data has any missing values. If so, use an appropriate function to replace them with one of the following values: - a fixed value - the column-wise median value - the column-wise mean value - or ignoring all observations containing missing values.
Task 2.1: Explore a survey question
Explore the survey question: [Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. (Star Wars: Episode I The Phantom Menace; Star Wars: Episode II Attack of the Clones; Star Wars: Episode III Revenge of the Sith; Star Wars: Episode IV A New Hope; Star Wars: Episode V The Empire Strikes Back; Star Wars: Episode VI Return of the Jedi)], then analysis how people rate Star Wars Movies.
Task 2.2: Relationships between columns
Explore the relationships between columns. You may choose which pairs of columns to focus on, but you need to generate 3 visualisations for this subtask. These should address a plausible hypothesis for the data concerned. Please also format the graph as required in Task 2.1.
Task 2.3: Explore a specific relationship
Explore whether there are relationship between people's demographics (Gender, Age, Household Income, Education, Location) and their attitude to Start War characters.
Solution:
#import all libraries
import pandas as pd import numpy as np import string import os import matplotlib.pyplot as plt
#Task 1.1: Data Retrieving
df = pd.read_csv('StarWars.csv',encoding = "cp1252") #After loading the data I check if I have the correct number of rows and columns. df.shape #Since pandas provides us with a unique index for each row I drop RespondentID column. df.drop('RespondentID', axis=1,inplace=True) #Original csv file header has 2 rows, I combine 2 rows in 1 and replace 'Unnamed' with edited values from the second row. #To avoid using chained indexing I save 2nd row to a variable header2. header2 = df.iloc[0] #After extracting needed column names from the first row I drop it. df.drop(df.index[0],inplace=True) #Since I have dropped row 0 I reset index to avoid possible complications in the future. df.reset_index(drop=True, inplace=True) #I remove the question 'Which of the following Star Wars films have you seen? Please select all that apply'. #Instead I name columns "Seen" which stands for "Have you seen?" followeed by movie number and name. #For example: "Seen I The Phantom Menace". for i in range(2,8): df.rename(columns = {df.columns[i]:'Seen '+ header2[i][19:]},inplace=True) #For question 'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.' #I name columns "Rank" which stands for "Please rank" followed by movie number and name. for i in range(8,14): df.rename(columns = {df.columns[i]:'Rank '+ header2[i][19:]},inplace=True) #For question "Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her." #I name columns with characters' names. for i in range(14,28): df.rename(columns = {df.columns[i]:header2[i]},inplace=True)
#Task 1.2: Check data types
df.dtypes #All the datatypes are strings, as it is in the original document. #I replace interval values in columns 9-14 with numeric data type to replace for i in range (8,14): df[df.columns[i]] = pd.to_numeric(df[df.columns[i]])
#Task 1.4: Extra-whitespaces
#I perform this task before Task 1.3 to reduce the number of values I need to replace. #I strip all the values of whitespace. for i in df.columns: if df[i].dtype == object: df[i] = df[i].str.strip()
#Task 1.3: Typos
#To identify what columns contain typos I look throug unique values for each column for i in range (37): print(df[df.columns[i]].unique()) #Since the number of typos is small, I don't need to use conditional logic to choose all the wrong values. #I copy all the values that are wrong from the list of unique values and replace them. df.replace(to_replace =['Yess'],value ="Yes", inplace = True) df.replace(to_replace =['Noo'],value ="No", inplace = True) df.replace(to_replace =['Female','female'],value ="F", inplace = True) df.replace(to_replace =['Male','male'],value ="M", inplace = True)
#Task 1.5: Upper/Lower-case
#I convert all strings in data frame to upper case. for i in df.columns: if df[i].dtype == object: df[i] = df[i].str.upper()
#Task 1.6: Sanity checks
#Looking through unique values I notice '500' in Age column.
#I assume '500' was a typo and person meant they are 50 years ols, which falls into range'45-60'.
df['Age'].replace(to_replace =['500'],value = '45-60', inplace = True)
#Task 1.7: Missing values
#To start with I look at how many missing values each columns has. df.isnull().sum() #For columns'Which of the following Star Wars films have you seen? Please select all that apply'... #...i fill nan as 'NOT' for 'Haven't seen' and I fill movie names with 'SEEN' for 'I have seen it'. df.replace(to_replace =['STAR WARS: EPISODE I THE PHANTOM MENACE',\ 'STAR WARS: EPISODE II ATTACK OF THE CLONES','STAR WARS: EPISODE III REVENGE OF THE SITH',\ 'STAR WARS: EPISODE IV A NEW HOPE','STAR WARS: EPISODE V THE EMPIRE STRIKES BACK',\ 'STAR WARS: EPISODE VI RETURN OF THE JEDI'],value ="SEEN", inplace = True) for i in range(2,8): df[df.columns[i]].fillna(value = 'NOT',inplace = True) #For movies rankings I can replace missing values with 0 for i in range (8,14): df[df.columns[i]].fillna(value = 0, inplace = True) #For movies rankings I can replace missing values with 0 for i in range (14,28): df[df.columns[i]].fillna(value = 'UNFAMILIAR (N/A)', inplace = True) #For household income and education I replace missing values with the mod of the column. df['Household Income'].fillna(df['Household Income'].mode([0]),inplace = True) df['Education'].fillna(df['Education'].mode([0]),inplace = True)
#Changing data types to categorical: #After I removed errors I can change the rest of the variables to categorical data. #Yes No answers for i in range (2): df[df.columns[i]] = pd.Categorical(df[df.columns[i]], ['YES', 'NO']) for i in range (30,32): df[df.columns[i]] = pd.Categorical(df[df.columns[i]], ['YES', 'NO']) #Seen or Not Star Wars movies for i in range (2,8): df[df.columns[i]] = pd.Categorical(df[df.columns[i]], ['NOT','SEEN']) #Movie rankings #for i in range (8,14): # df[df.columns[i]] = pd.Categorical(df[df.columns[i]], [0,1,2,3,4,5,6], ordered = True) #Gender df['Gender'] = pd.Categorical(df['Gender'], ['M', 'F']) #Character ranking for i in range(14,27): df[df.columns[i]] = pd.Categorical(df[df.columns[i]], ['UNFAMILIAR (N/A)','VERY UNFAVORABLY','SOMEWHAT UNFAVORABLY',\ 'NEITHER FAVORABLY NOR UNFAVORABLY (NEUTRAL)',\ 'SOMEWHAT FAVORABLY','VERY FAVORABLY'], ordered = True) #Education df['Education'] = pd.Categorical(df['Education'], ['LESS THAN HIGH SCHOOL DEGREE','HIGH SCHOOL DEGREE',\ 'SOME COLLEGE OR ASSOCIATE DEGREE','BACHELOR DEGREE','GRADUATE DEGREE'],ordered = True)
Data visualization:
I hope this may help you to understand basic flow of data science concept, if you are face any other issue or need any assignment related help then you can directly send your quote so we can help you as soon as we can.
You can send quote at given main directly:
"realcode4you@gmail.com"
or
Submit your requirement details at here:
コメント