Oct 29, 20218 min read

Important Basic Topics of Python & Machine Learning | Overview

MACHINE LEARNING OVERVIEW

AI, ML, and DL

What is Artificial Intelligence?
What is Machine Learning?
What is Deep Learning?

Artificial intelligence

The term was first introduced in 1956 in a conference where researchers wanted to digitized how human brain works
AI is the science and engineering of making computers behave in ways that until recently, we thought required human intelligence, Andrew Moore
AI is a moving target based on the capabilities that human possessed but machines do not, e.g., emotion – AI encompasses technology advances in different fields such as Machine Learning, Human Computer Interaction, etc
Example of AI: DeepBlue, and to some extent: Google Home, Siri and Alexa

Machine Learning and Deep Learning

Machine learning is the study of computer algorithms that allow computer programs to automatically improve through experience ~ Tom Mitchell
A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E ~ Tom Mitchell
The goal of ML is never to make “perfect” guesses, because ML deals in domains where there is no such thing. The goal is to make guesses that are good enough to be useful.

Deep Learning

It is a class of machine learning algorithms inspired by the structure of a human brain.
Deep learning algorithms use complex multi-layered neural networks, where the level of abstraction increases gradually by non-linear transformations of input data.

Output:

How can a machine learn?

Dataset

The samples need to be representative
The samples can include numbers, images, text, etc

Features

Important piece of data that work as the key to the solution of the task
Tell the machine/program what to pay attention to

Algorithm

The same task can be solved using different algorithm
The accuracy or speed of getting results can be different

If the dataset quality is high, the features were chosen right, an ML-powered system can be better than human for a given task

Machine Learning Problem

Many real-world problems are complex. Inventing specialized algorithms to solve them perfectly every time is not practical

Some example – How can we predict future traffic pattern at an intersection? – Is it cancer?

What is the market value of this house five years from now?
Which of these candidates are the perfect one for the job?
Which of these people can be my best friend/partner?
Will a certain people like this movie or not?
How can I slice the banana to make a perfect peanut butter banana sandwich? (https://www.ethanrosenthal.com/2020/08/25/optimalpeanut-butter-and-banana-sandwiches/)

Machine Learning Algorithm

Generally divided into supervised and unsupervised learning, also reinforced learning, based on whether the they are trained with human supervision, and whether the training data is labeled or not
Whether or not they can learn incrementally on the fly (online versus batch learning)
Whether they work by simply comparing new data points to known data points, or instead detect patterns in the training data and build a predictive model, much like scientists do (instance-based versus model-based learning)

Supervised Learning

The training data fed to the algorithm includes the desired solutions, called labels
It models the relationship between the target prediction output and the input features, such that we can predict the output values for new data based on those relationships learned from past data
The goal is to develop a finely tuned predictor function h(x) (sometimes called the “hypothesis”) so that, given input data x about a certain domain, e.g., square footage of a house), it will predict interesting value h(x), e.g., the market price of the housr.
Two major categories are regression and classification

Some of the most important supervised learning algorithms include:

k-Nearest Neighbors
Linear Regression
Logistic Regression
Support Vector Machines (SVMs)
Decision Trees and Random Forests
Neural networks

If it quacks like a duck, waddle like a duck and swim like a duck, then….

It can be a mallard, which is a species of duck

If it has a flat beak to catch worms and has a webbed feet…

It does not need to be a duck. It can be a platypus

If it walk on four legs and has a long nose….

We need more information. It can be an elephant, but it can also be a family of some mice

Unsupervised Learning

In unsupervised learning, the training data is unlabeled
The unsupervised machine learning is typically tasked with finding relationships and correlation within data.

– Used mostly for pattern detection and descriptive modeling

Some of the most important unsupervised learning algorithms include:

– Clustering

– Visualization and dimensionality reduction

– Association rule learning

Supervised vs Unsupervised

Instance-based and Model-Based Learning

Instance-Based Learning

System generalizes to new cases based on a similarity measure to known cases.

Instance-Based Learning

System generalizes to new cases based on a similarity measure to known cases.

Batch Learning

The system is incapable of learning incrementally: it must be trained using all the available data - offline.
First the system is trained, and then it is launched into production and runs without learning anymore; it just applies what it has learned.
When new data comes, you need to train a new version of the system from scratch on the full dataset (not just the new data, but also the old data), then stop the old system and replace it with the new one.

Online Learning

The system is trained incrementally by feeding it data instances sequentially, individually or in mini-batches.
The system can learn about new data on the fly, as it arrives

– Great for systems that receive data as a continuous flow (e.g., stock prices)

and need to adapt to change rapidly or autonomously

– Also good for limited computing resources, and huge datasets that cannot fit in one machine’s main memory (this is called out-of-core learning)

Main Challenges of Machine Learning

Machine learning involves selecting some learning algorithm and train it on some data.
“bad data” and “bad algorithm” are the two things that can go wrong

Data Challenges

Insufficient quantity of training data – it takes a lot of data for most machine learning algorithms to work properly
Non-representative training data - In order to generalize well, it is crucial that your training data be representative of the new cases you want to generalize to.
Poor-quality data – outliers, missing values, etc.
Irrelevant features - A critical part of the success of a Machine Learning project is coming up with a good set of features to train on. This process, called feature engineering, involves:

– Feature selection: selecting the most useful features to train

on among existing features.

– Feature extraction: combining existing features to produce a more

useful one

– Feature generation: Creating new features by gathering new data.

Bad Algorithms

Overfitting the Training Data - the model performs well on the training data, but it does not generalize well.
Underfitting the Training Data - occurs when your model is too simple to learn the underlying structure of the data

– The main options to fix this problem are:

• Selecting a more powerful model, with more parameters

• Feeding better features to the learning algorithm (feature engineering)

• Reducing the constraints on the model

Model Testing and Validation

Helps to determine how well the model generalizes to new cases
Achieved by splitting the data to training set and validation/test set.
Evaluating the model on the test set helps to assess how well it will perform on new instances of data
Cross-validation is used to evaluate several models

– The training set is split into complementary subsets, and each model

is trained against a different combination of these subsets and validated against the remaining parts.

– The selected model is then trained on the full training set, and the

generalized error is measured on the test set.

Machine Learning Steps

The main steps of a machine learning project include:

1. Look at the big picture (problem definition).

2. Get the data.

3. Discover and visualize the data to gain insights.

4. Prepare the data for Machine Learning algorithms.

5. Select a model and train it.

6. Fine-tune your model.

7. Present your solution.

8. Launch, monitor, and maintain your system.

Machine Learning Pipeline

Essential Python Libraries for Data Science

Python data ecosystem libraries commonly used in data science include:

NumPy
Pandas
Matplotlib (and it’s cousins)
IPython and Jupyter
SciPy
Scikit-learn
Statsmodels
Keras
Tensorflow

NumPy: Short for Numerical Python

Provides the data structures, algorithms, and library glue needed for numerical computing in Python
Acts as a container for data to be passed between algorithms and libraries.
NumPy contains, among other things:

– A fast and efficient multidimensional array object ndarray

– Functions for performing element-wise computations with arrays

or mathematical operations between arrays

– Tools for reading and writing array-based datasets to disk

– Linear algebra operations, Fourier transform, and random number

generation

– A mature C API to enable Python extensions and native C or C++ code

to access NumPy’s data structures and computational facilities

Pandas

Provide high-level data structures and functions designed to make working with structured or tabular data fast, easy, and expressive
Blend the high-performance, array-computing ideas of NumPy with the flexible data manipulation capabilities of spreadsheets and relational databases
Provide sophisticated indexing functionality to make it easy to reshape, slice and dice, perform aggregations, and select subsets of data.
Two of the most important data structures of pandas are:

– DataFrame - a tabular, column-oriented data structure with both row and

column labels

– Series - a one-dimensional labeled array object

Matplotlib and Other Data Visualization Libraries

Matplotlib is the most popular Python library for producing publication-quality plots and other two-dimensional data visualizations.
Matplotlib is like the mother of all Python libraries. It serves as an excellent base, enabling coders to “wrap” other tools over it.
Seaborn may be able to support some more complex visualization approaches but it still requires matplotlib knowledge to fine-tune things.
Bokeh is a robust tool for setting up your own visualization server but maybe a bit overkill when creating simple scenarios.
Geoplotlib will get the job done if you need to visualize geographic data.
Ggplot shows a lot of promise but still has a lot of growing up to do.
Plot.ly generates the most interactive graphs, which can be saved offline to create vivid web-based visualizations.

Ipython and Jupyter

IPython is designed from the ground up to maximize your productivity in both interactive computing and software development.
Component of the much broader Jupyter open source project
Designed to accelerate the writing, testing, and debugging of Python code.
Jupyter Notebook is an interactive web-based code “notebook” offering support for dozens of programming languages.

SciPy

Collection of packages addressing a number of different standard problem domains in scientific computing, such as:

scipy.linalg (linear algebra routines)
scipy.optimize (function optimizers)
scipy.sparse (sparse matrices and sparse linear system solvers)
scipy.special (a wrapper around Fortran SPECFUN library, implementing many math functions), and
scipy.stats (probability distributions, statistical tests, and descriptive statistics)

Scikit-learn

General-purpose machine learning toolkit for Python
Along with pandas, statsmodels, and IPython, scikit-learn has been critical for enabling Python to be a productive data science programming language

It includes submodules for such models as:

Classification: SVM, nearest neighbors, random forest, logistic regression, etc.
Regression: Lasso, ridge regression, etc.
Clustering: k-means, spectral clustering, etc.
Dimensionality reduction: PCA, feature selection, matrix factorization, etc.
Model selection: Grid search, cross-validation, metrics
Preprocessing: Feature extraction, normalization

Statsmodels

Statsmodels is more focused on statistical inference, providing uncertainty estimates and p-values for parameters. Scikit-learn, by contrast, is more prediction-focused.
It contains algorithms for classical statistics and econometrics and includes submodules such as:

– Regression models: Linear regression, generalized linear models,

robust linear models, linear mixed effects models, etc.

– Analysis of variance (ANOVA)

– Time series analysis: AR, ARMA, ARIMA, VAR, and other models

– Nonparametric methods: Kernel density estimation, kernel regression

– Visualization of statistical model results.

TensorFlow & Keras

TensorFlow is an end-to-end open source platform for machine learning, most especially for neural networks and deep learning.
• TensorFlow contains a comprehensive, flexible ecosystem of tools, libraries, and community resources that let researchers push the state-of-the-art in ML where we can easily build and deploy ML-powered applications. However, it is not that easy to use.
Keras is a high-level API built on TensorFlow that is easy to use
TensorFlow + Keras integration means that you can:

– Define your model using the easy to use interface of Keras

– And then drop down into TensorFlow if you need: (1) specific

TensorFlow functionality, or (2) to implement a custom feature that

Keras does not support but TensorFlow does.

Send your requirement details realcode4you@gmail.com and get instant help with an affordable price.

RealCode4You