What is machine learning?
1- A set of tools and methods that attempt to extract insight from a record of the observable world and infer patterns.
2- Studying and understanding a phenomenon
Make observations and collect the relevant data
Model the underlying patterns
Use the model to inform our understanding of the phenomenon
Make predictions!!!
3- An important feature of any ML method is its ability to learn and improve with experience, i.e. both existing and new data.
ML attempts to answer:
How does learning performance vary with the number of training examples?
Which learning algorithms are most appropriate for various types of learning tasks?
ML draws on concepts and results from:
Statistics
Artificial intelligence
Philosophy
Information theory
Biology
Cognitive science
Control theory
Introduction to Machine Learning

Supervised Learning
Infers a function that maps set of inputs (features, predictors, covariates, independent variables) to an output (response, target, outcome, dependent variable) from input/output pairs.
The function is inferred from training examples, which are mapped to new examples.
Goals:
Accurately predict unseen cases, i.e. test cases (primary)
Understand the relationship between inputs and output (secondary)
Two sub-categories:
Regression – a continuous outcome
Classification – a categorical/qualitative outcome
Unsupervised Learning
No distinctions between input(s) and output within a data set.
Attempts to uncover the underlying structure or pattern within a data set.
Can lead to testable hypotheses.
Difficult to know how well you have done.
Two sub-categories:
Dimension reduction – visualisation of multi-dimensional data in lower dimensions, 2-D and 3-D.
Clustering – grouping of objects based on some similarity measures.
Introduction to Data Analysis
What is statistics?
Statistics allow us to learn from our data.
Data are numbers with context.
Data contains information about some group of individuals.
A characteristic of an individual is referred to as a variable.
Data Types
Two main types of data:
Qualitative (categorical) – variables that represent qualities and cannot be measured.
Nominal – characteristics have no order, e.g. eye colour, gender (male/female).
Ordinal – characteristics that are intrinsically ordered, e.g. educational attainment (primary, secondary, tertiary).
Quantitative (numerical)
Discrete – able to take only certain distinct values within an allowable range. The allowable range maybe finite or infinite. For example, outcome of a dice roll, and number of students.
Continuous – data measured on a scale, able to take on any values within an allowable range which maybe finite or infinite, e.g. body mass, height.
Exploratory Data Analysis (EDA)
EDA is the process of describing the data and summarising the main characteristics.
Provides some insight into the behaviour of the data.
A critical aspect of EDA is outlier detection.
Describe graphically and numerically.
Describing Qualitative and Discrete Data
Qualitative and discrete (finite) data are typically expressed as count data.
For a better perspective, counts are often expressed as a percentage of the total.

Describing Quantitative Data
Three aspects are addressed
Measure of Centre: describes how data cluster around a particular value.
Measure of Spread: describes the dispersion/variability of data
Measure of Shape: describes the distribution (or pattern) of data.
Measure of Centre (Central Tendency)

Measure of Spread (Dispersion)

Example :

Percentiles and 5-Number Summary

Example :

Measure of Shape
Two measures:
1.Skewness – a measure of symmetry, or more precisely, the lack of symmetry. A distribution is symmetric if it looks the same to the left and right of the centre point.
2.Kurtosis – is a measure tailed-ness relative to a normal distribution.

Skewed Distributions

Excess Kurtosis


Data Quality Issues
What are the issues to consider?
Data entry errors
Values outside of expected range(s)
Missing values
Noted as NA in R
>20% is not good.
Outliers
Certain descriptive statistics and modelling are sensitive to them
Can lead to bias estimates and potentially incorrect findings
Handling Missing Values
Approach 1: Drop any features with missing values
Typically not recommended
Depends on the % of missing values
The is.na(.) command in R can be used to check for missing values
Approach 2: Analyse complete cases only
Use the na.omit(.) command in R
Important to note % of cases removed
Approach 3: Impute the missing values
Mean imputation, regression imputation, K-NN and etc.
Recommended only for continuous data
Detecting Outliers
Check the range, i.e. min to max
Visualise the data, e.g. histograms, boxplots, etc.
Use thresholds, e.g. (Q_1, Q_3) ±1.5 × IQR, z-scores outside of ±3, etc.
Handling Outliers
Approach 1: Remove them
Typically not recommended, in particular with smaller datasets
Somewhat acceptable for large datasets
Approach 2: Investigate the source and find out why this has happened
Approach 3: Non-linear data transformation
Square-root and log-transformation for right skewed data.
Commenti