# Ensemble Learning

## ensemble

\ɑ̃.sɑ̃bl\
1. set, e.g. $\;\mathbb{R}, \{ 0, 1 \}$
2. together

## Terminologies

### Features, Labels, etc.

Example: Iris dataset

### Conventions

• data: X
• labels/targets: y
• instance/sample: one piece of data

### In Python's Scikit-Learn


In [1]: from sklearn.datasets import load_iris

In [3]: iris.target_names
Out[3]: ["setosa", "versicolor", "virginica"]

In [4]: X, y = iris.data, iris.target

In [5]: X.shape
Out[5]: (150, 4)

In [6]: n_instances, n_features = X.shape

In [7]: n_instances
Out[7]: 150

In [8]: n_features
Out[8]: 4

In [9]: y
Out[9]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [10]: iris.feature_names
Out[10]:
['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)']

In [11]: X[:3]
Out[11]:
array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2]])


### Performance Metrics

There are many different kinds of performance metrics.

Accuracy, precision, recall, to name a few.

### Accuracy

Accuracy is particularly convenient because it is easy to understand and often applicable.

$$\text{accuracy} = \frac{\text{\# correctly predicted instances}}{\text{\# instances}}$$

For example, it's awkward to use precision or recall when classifying, say, three iris flowers.

## Voting

### Better Explained by Examples

$\;\texttt{04.02.3weak1strong.ipynb}$

### A Biased Coin Toss

$51\%$ Head, $49\%$ Tail

\begin{aligned} P(\text{\# Head} > \text{\# Tail}, \; 1000\;\text{tosses}) &\approx 75\% \\ P(\text{\# Head} > \text{\# Tail}, \; 10000\;\text{tosses}) &\approx 97\% \end{aligned}

### Or, Is It Even True?

$\;\texttt{04.05.coin\_toss\_monte\_carlo.ipynb}$

### Remember Back from Our High School?

\newcommand{\nchoosek}[2]{\begin{pmatrix}#1\\#2\end{pmatrix}} \begin{aligned} &\left(x + y\right)^{n} \\ &= \nchoosek{n}{0} x^{0} y^{n} + \nchoosek{n}{1} x^{1} y^{n-1} + \cdots + \nchoosek{n}{n} x^{n} y^{0} \\ &= \sum_{k=0}^{n} \nchoosek{n}{k} x^{k} y^{n-k} \\ \end{aligned}

### In Our Case

\newcommand{\nchoosek}[2]{\begin{pmatrix}#1\\#2\end{pmatrix}} \begin{aligned} 1 &= \left(p + (1-p)\right)^{n} \\ &= \sum_{k=0}^{n} \nchoosek{n}{k} p^{k} (1-p)^{n-k} \\ \end{aligned}

### More Precisely

\newcommand{\nchoosek}[2]{\begin{pmatrix}#1\\#2\end{pmatrix}} \begin{aligned} P(\text{\# Head} > \text{\# Tail}, \; n\;\text{tosses}) \\ = \sum_{k=\lfloor \frac{n}{2} \rfloor + 1}^{n} \nchoosek{n}{k} p^{k} (1-p)^{n-k} \\ \end{aligned}

## 👉 👈

### Let's Calculate Sth Simpler

When $p = 0.5$, we have

$\newcommand{\nchoosek}[2]{\begin{pmatrix}#1\\#2\end{pmatrix}} P(\text{\# Head} > \text{\# Tail}, \; n\;\text{tosses}) \\ = \begin{cases} \frac{1}{2} &\text{if n is odd} \\ \frac{1-\varepsilon}{2} &\text{if n is even} \end{cases}$ where $\;\varepsilon = \begin{pmatrix}n \\ \lfloor \frac{n}{2} \rfloor\end{pmatrix} (\frac{1}{2})^{n}\,.$

### Having Boosted Our Confidence

Let's go back to the case $p=0.51\;.$

At least we kind of see that

• $P(\text{\# Head} > \text{\# Tail}, \; n\;\text{tosses})$ equals the latter half of the binomial sum
• It is obvious that $$(\text{the latter half}) > (\text{the former half})$$
• $\lim_{n\to\infty} \varepsilon = 0$

### But Still...

I still don't see how this could become $97\%$

### Let' Go to

$\;\texttt{04.12.half\_binomial\_sum.ipynb}$

### Two Flavors: Hard and Soft

Consider the famous Dog/Cat binary classification

### Hard Voting

instance 1 instance 2 instance 3
estimator 1 Dog Cat Cat
estimator 2 Dog Dog Cat
estimator 3 Cat Cat Dog
decision Dog Cat Cat
ground truth Dog Cat Dog

### Soft Voting $P(X_{i} = \text{Dog})$

instance 1 instance 2 instance 3
estimator 1 0.87 (Dog) 0.30 (Cat) 0.47 (Cat)
estimator 2 0.90 (Dog) 0.51 (Dog) 0.48 (Cat)
estimator 3 0.93 (Dog) 0.12 (Cat) 0.58 (Dog)
average 0.90 (Dog) 0.31 (Cat) 0.51 (Dog)
ground truth Dog Cat Dog

### Which Is Better? Soft or Hard?

$\;\texttt{04.16.agerons\_example.ipynb}$

## Bagging and Pasting

i.e. sampling w/ or w/o replacement

Statisticians call sampling w/ replacement bootstrap.

### with Replacement? How?

Same sample appears more than once in training set?

Wouldn't they be overly represented?

### sklearn Implementation

"Yes, just let them appear more than once."

### Proof?

The source code in, e.g.


~/miniconda3/envs/ensemble_learning/lib/python3.10/site-packages/sklearn/ensemble/_bagging.py

if you use Miniconda like me.

def _generate_indices(random_state, bootstrap, n_population, n_samples):
"""Draw randomly sampled indices."""
# Draw sample indices
if bootstrap:
indices = random_state.randint(0, n_population, n_samples)
else:
indices = sample_without_replacement(
n_population, n_samples, random_state=random_state
)

return indices

# Draw samples, using sample weights, and then fit
if support_sample_weight:
if sample_weight is None:
curr_sample_weight = np.ones((n_samples,))
else:
curr_sample_weight = sample_weight.copy()

if bootstrap:
sample_counts = np.bincount(indices, minlength=n_samples)
curr_sample_weight *= sample_counts
else:

estimator_fit(X[:, features], y, sample_weight=curr_sample_weight)

else:
estimator_fit(X[indices][:, features], y[indices])


### BaggingClassifier from sklearn

class BaggingClassifier(ClassifierMixin, BaseBagging):
def __init__(
self,
base_estimator=None,
n_estimators=10,
*,
max_samples=1.0,
max_features=1.0,
bootstrap=True,
bootstrap_features=False,
oob_score=False,
warm_start=False,
n_jobs=None,
random_state=None,
verbose=0,
):


### Out-of-Bag, aka oob, Instances

"Only about $63\%$ of the training instances are sampled on average"

$63\%$? Where does this number come from?

### Ambiguity

The quote about $63\%$ seems a bit unclear.

If we spend time thinking about it, sampling involves two numbers

1. Population $m$: the number of (distinct) objects from which we do our sampling
2. Sampling size $k$

Obviously the oob ratio will depend on the sampling size $k$. So a plausible answer for the origin of $63\%$ is $$k = m \implies \lim_{m\to \infty} \left(\text{noob ratio}\right) = 63\%$$

### A Quick Python Experiment

In [1]: import numpy as np

In [2]: def get_noob_ratio(m=100, k=100):
...:     """
...:     noob stands for "Not Out-Of-Bag"
...:     """
...:     population = np.arange(m)
...:     sampled = np.random.choice(population, size=k, replace=True)
...:     ratio = len(np.unique(sampled)) / m
...:     return ratio
...:

In [3]: get_noob_ratio()
Out[3]: 0.62

In [4]: get_noob_ratio(999_999, 999_999)
Out[4]: 0.6321746321746322

In [5]: 1 - 1/np.e
Out[5]: 0.6321205588285577


### Mathematically

In particular, when $k = m$, we have $\lim_{m\to\infty} \left( 1 - \frac{1}{m} \right)^{m} = \frac{1}{\lim_{m\to\infty} \left( 1 + \frac{1}{m} \right)^{m}} = \frac{1}{\;e\;} \approx 37\%$

### oob, What for?

These oob instances could, e.g., be used to evaluate model performance w/o having to preserve a special validation set beforehands.

### Random Subspaces and Random Patches

If you think about it, you might realize that what we did before to instances could be equally applied to features!

That is, we could also sample only a subset of the features to train an estimator. And we could also choose to (or not to) bootstrap such a sampling.

## Q&A

1. (Slide 5.7) Why are there two "class args"?

class BaggingClassifier(ClassifierMixin, BaseBagging)

1. (Slide 5.12) Why $\lim_{m\to\infty} \left( 1 - \frac{1}{m} \right)^{m} \overset{?}{=} \frac{1}{\lim_{m\to\infty} \left( 1 + \frac{1}{m} \right)^{m}}$
1. (Slide 4.10) Why $\lim_{n\to\infty} \varepsilon \overset{?}{=} 0$
1. ($\texttt{04.02.3weak1strong.ipynb}$) The predict method of the Weak class runs, like, very very slowly. Is there any way to reduce its time complexity?

## References

• Aurélien Géron, Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow