A group of predictors is called an ensemble. Thus, using multiple predictors (or learners) to make a single decision/prediction is called ensemble learning, and an ensemble learning algorithm is called an ensemble method.
An ensemble of Decision Trees is called a random forest, and despite its simplicity, this is one of the most powerful Machine Learning algorithms available today.
In this module we will discuss random forests and the most popular ensemble methods, including bagging, boosting, and stacking.
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)
# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"
# Common imports
import numpy as np
import os
# to make this notebook's output stable across runs
np.random.seed(42)
# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)
# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "ensembles"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)
def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
print("Saving figure", fig_id)
if tight_layout:
plt.tight_layout()
plt.savefig(path, format=fig_extension, dpi=resolution)
Voting Classifiers Suppose you have trained a few classifiers, each one achieving about 80% accuracy: An easy way to create a single, even better classifier is to aggregate the predictions of each classifier and predict the class that gets the most votes. This is called _hard voting, and is shown below.
The main idea of ensemble methods is that many weak learners (meaning each each predictor does only slightly better than random guessing) can be combined so that the ensemble becomes a strong learner (achieving high accuracy). However, those weka learners must be sufficient in number and sufficiently diverse.
heads_proba = 0.51
coin_tosses = (np.random.rand(10000, 10) < heads_proba).astype(np.int32)
cumulative_heads_ratio = np.cumsum(coin_tosses, axis=0) / np.arange(1, 10001).reshape(-1, 1)
plt.figure(figsize=(8,3.5))
plt.plot(cumulative_heads_ratio)
plt.plot([0, 10000], [0.51, 0.51], "k--", linewidth=2, label="51%")
plt.plot([0, 10000], [0.5, 0.5], "k-", label="50%")
plt.xlabel("Number of coin tosses")
plt.ylabel("Heads ratio")
plt.legend(loc="lower right")
plt.axis([0, 10000, 0.42, 0.58])
save_fig("law_of_large_numbers_plot")
plt.show()
Output:
The following code creates and trains a voting classifier in Scikit-Learn that is composed of three diverse classifiers. Let's start by getting some data - we will work with the make_moons dataset.
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
#Create objects of the above 3 classes
log_clf = LogisticRegression(solver = 'lbfgs', random_state = 42)
rnd_clf = RandomForestClassifier (n_estimators = 100, random_state = 42)
svm_clf = SVC (gamma = "scale", random_state = 42)
#stick these objects inside the VotingClassifier
voting_clf = VotingClassifier(estimators = [('lr',log_clf ), ('rf', rnd_clf ),('svc', svm_clf)],
voting = 'hard')
#Now we fit it
voting_clf.fit(X_train, y_train)
Output:
VotingClassifier(estimators=[('lr', LogisticRegression(random_state=42)), ('rf', RandomForestClassifier(random_state=42)), ('svc', SVC(random_state=42))])
#look at each classifiers accuracy on the test set
from sklearn.metrics import accuracy_score
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(clf.__class__.__name__, accuracy_score(y_test, y_pred))
output:
LogisticRegression 0.864 RandomForestClassifier 0.896 SVC 0.896 VotingClassifier 0.912
If all classifiers are able to estimate class probabilities (that is, they all have a predict_proba() method), then you can tell Scikit-Learn to predict the class with the highest class probability, averaged over all the individual classifiers. This is called soft voting
log_clf = LogisticRegression(solver="lbfgs", random_state=42)
rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42)
svm_clf = SVC(gamma="scale", probability=True, random_state=42)
voting_clf = VotingClassifier(
estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
voting = 'soft')
voting_clf.fit(X_train, y_train)
VotingClassifier(estimators=[('lr', LogisticRegression(random_state=42)),
('rf', RandomForestClassifier(random_state=42)),
('svc', SVC(probability=True, random_state=42))],
voting='soft')
from sklearn.metrics import accuracy_score
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(clf.__class__.__name__, accuracy_score(y_test, y_pred))
output:
LogisticRegression 0.864 RandomForestClassifier 0.896 SVC 0.896 VotingClassifier 0.92
Realcode4you expert covers all Ensemble Learning techniques to get better accuracy. If you face any issue related to ensemble learning techniques then send your requirement details at:
realcode4you@gmail.com
Comments