Introduction to Machine Learning and Data Analysis

What is machine learning?

1- A set of tools and methods that attempt to extract insight from a record of the observable world and infer patterns.

2- Studying and understanding a phenomenon

Make observations and collect the relevant data
Model the underlying patterns
Use the model to inform our understanding of the phenomenon
Make predictions!!!

3- An important feature of any ML method is its ability to learn and improve with experience, i.e. both existing and new data.

ML attempts to answer:

How does learning performance vary with the number of training examples?

Which learning algorithms are most appropriate for various types of learning tasks?

ML draws on concepts and results from:

Statistics
Artificial intelligence
Philosophy
Information theory
Biology
Cognitive science
Control theory

Introduction to Machine Learning

Supervised Learning

Infers a function that maps set of inputs (features, predictors, covariates, independent variables) to an output (response, target, outcome, dependent variable) from input/output pairs.

The function is inferred from training examples, which are mapped to new examples.

Goals:

Accurately predict unseen cases, i.e. test cases (primary)
Understand the relationship between inputs and output (secondary)

Two sub-categories:

Regression – a continuous outcome
Classification – a categorical/qualitative outcome

Unsupervised Learning

No distinctions between input(s) and output within a data set.
Attempts to uncover the underlying structure or pattern within a data set.
Can lead to testable hypotheses.
Difficult to know how well you have done.

Two sub-categories:

Dimension reduction – visualisation of multi-dimensional data in lower dimensions, 2-D and 3-D.
Clustering – grouping of objects based on some similarity measures.

Introduction to Data Analysis

What is statistics?

Statistics allow us to learn from our data.
Data are numbers with context.
Data contains information about some group of individuals.
A characteristic of an individual is referred to as a variable.

Data Types

Two main types of data:

Qualitative (categorical) – variables that represent qualities and cannot be measured.

Nominal – characteristics have no order, e.g. eye colour, gender (male/female).
Ordinal – characteristics that are intrinsically ordered, e.g. educational attainment (primary, secondary, tertiary).

Quantitative (numerical)

Discrete – able to take only certain distinct values within an allowable range. The allowable range maybe finite or infinite. For example, outcome of a dice roll, and number of students.
Continuous – data measured on a scale, able to take on any values within an allowable range which maybe finite or infinite, e.g. body mass, height.

Exploratory Data Analysis (EDA)

EDA is the process of describing the data and summarising the main characteristics.
Provides some insight into the behaviour of the data.
A critical aspect of EDA is outlier detection.
Describe graphically and numerically.

Describing Qualitative and Discrete Data

Qualitative and discrete (finite) data are typically expressed as count data.
For a better perspective, counts are often expressed as a percentage of the total.

Describing Quantitative Data

Three aspects are addressed

Measure of Centre: describes how data cluster around a particular value.
Measure of Spread: describes the dispersion/variability of data
Measure of Shape: describes the distribution (or pattern) of data.

Measure of Centre (Central Tendency)

Measure of Spread (Dispersion)

Example :

Percentiles and 5-Number Summary

Example :

Measure of Shape

Two measures:

1.Skewness – a measure of symmetry, or more precisely, the lack of symmetry. A distribution is symmetric if it looks the same to the left and right of the centre point.

2.Kurtosis – is a measure tailed-ness relative to a normal distribution.

Skewed Distributions

Excess Kurtosis

Data Quality Issues

What are the issues to consider?

Data entry errors

Values outside of expected range(s)

Missing values

Noted as NA in R
>20% is not good.

Outliers

Certain descriptive statistics and modelling are sensitive to them
Can lead to bias estimates and potentially incorrect findings

Handling Missing Values

Approach 1: Drop any features with missing values

Typically not recommended
Depends on the % of missing values
The is.na(.) command in R can be used to check for missing values

Approach 2: Analyse complete cases only

Use the na.omit(.) command in R
Important to note % of cases removed

Approach 3: Impute the missing values

Mean imputation, regression imputation, K-NN and etc.
Recommended only for continuous data

Detecting Outliers

Check the range, i.e. min to max
Visualise the data, e.g. histograms, boxplots, etc.
Use thresholds, e.g. (Q_1, Q_3) ±1.5 × IQR, z-scores outside of ±3, etc.

Handling Outliers

Approach 1: Remove them

Typically not recommended, in particular with smaller datasets
Somewhat acceptable for large datasets

Approach 2: Investigate the source and find out why this has happened

Approach 3: Non-linear data transformation

Square-root and log-transformation for right skewed data.

RealCode4You

Introduction to Machine Learning and Data Analysis | Realcode4you

Recent Posts

Comments