Ensemble Learning

ensemble

\ɑ̃.sɑ̃bl\
  1. set, e.g. $\;\mathbb{R}, \{ 0, 1 \}$
  2. together

Part 1

Terminologies

Features, Labels, etc.

Example: Iris dataset

Conventions

  • data: X
  • labels/targets: y
  • instance/sample: one piece of data

In Python's Scikit-Learn


In [1]: from sklearn.datasets import load_iris

In [2]: iris = load_iris()

In [3]: iris.target_names
Out[3]: ["setosa", "versicolor", "virginica"]

In [4]: X, y = iris.data, iris.target

In [5]: X.shape
Out[5]: (150, 4)

In [6]: n_instances, n_features = X.shape

In [7]: n_instances
Out[7]: 150

In [8]: n_features
Out[8]: 4

In [9]: y
Out[9]: 
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [10]: iris.feature_names
Out[10]:
['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [11]: X[:3]
Out[11]: 
array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2]])
            

Performance Metrics

There are many different kinds of performance metrics.

Accuracy, precision, recall, to name a few.

Accuracy

Accuracy is particularly convenient because it is easy to understand and often applicable.

$$\text{accuracy} = \frac{\text{\# correctly predicted instances}}{\text{\# instances}}$$

For example, it's awkward to use precision or recall when classifying, say, three iris flowers.

Voting

Better Explained by Examples

$\;\texttt{04.02.3weak1strong.ipynb}$

A Biased Coin Toss

$51\%$ Head, $49\%$ Tail

\[\begin{aligned} P(\text{\# Head} > \text{\# Tail}, \; 1000\;\text{tosses}) &\approx 75\% \\ P(\text{\# Head} > \text{\# Tail}, \; 10000\;\text{tosses}) &\approx 97\% \end{aligned}\]

Why Is That?

Or, Is It Even True?

$\;\texttt{04.05.coin\_toss\_monte\_carlo.ipynb}$

Remember Back from Our High School?

\[\newcommand{\nchoosek}[2]{\begin{pmatrix}#1\\#2\end{pmatrix}} \begin{aligned} &\left(x + y\right)^{n} \\ &= \nchoosek{n}{0} x^{0} y^{n} + \nchoosek{n}{1} x^{1} y^{n-1} + \cdots + \nchoosek{n}{n} x^{n} y^{0} \\ &= \sum_{k=0}^{n} \nchoosek{n}{k} x^{k} y^{n-k} \\ \end{aligned}\]

In Our Case

\[\newcommand{\nchoosek}[2]{\begin{pmatrix}#1\\#2\end{pmatrix}} \begin{aligned} 1 &= \left(p + (1-p)\right)^{n} \\ &= \sum_{k=0}^{n} \nchoosek{n}{k} p^{k} (1-p)^{n-k} \\ \end{aligned}\]

More Precisely

\[\newcommand{\nchoosek}[2]{\begin{pmatrix}#1\\#2\end{pmatrix}} \begin{aligned} P(\text{\# Head} > \text{\# Tail}, \; n\;\text{tosses}) \\ = \sum_{k=\lfloor \frac{n}{2} \rfloor + 1}^{n} \nchoosek{n}{k} p^{k} (1-p)^{n-k} \\ \end{aligned}\]

But This Is Hard to Calculate

👉 👈

Let's Calculate Sth Simpler

When $p = 0.5$, we have

\[\newcommand{\nchoosek}[2]{\begin{pmatrix}#1\\#2\end{pmatrix}} P(\text{\# Head} > \text{\# Tail}, \; n\;\text{tosses}) \\ = \begin{cases} \frac{1}{2} &\text{if $n$ is odd} \\ \frac{1-\varepsilon}{2} &\text{if $n$ is even} \end{cases} \] where $\;\varepsilon = \begin{pmatrix}n \\ \lfloor \frac{n}{2} \rfloor\end{pmatrix} (\frac{1}{2})^{n}\,.$

Having Boosted Our Confidence

Let's go back to the case $p=0.51\;.$

At least we kind of see that

  • $P(\text{\# Head} > \text{\# Tail}, \; n\;\text{tosses})$ equals the latter half of the binomial sum
  • It is obvious that $$(\text{the latter half}) > (\text{the former half})$$
  • $\lim_{n\to\infty} \varepsilon = 0$

But Still...

I still don't see how this could become $97\%$

Let' Go to

$\;\texttt{04.12.half\_binomial\_sum.ipynb}$

Two Flavors: Hard and Soft

Consider the famous Dog/Cat binary classification

Hard Voting

instance 1 instance 2 instance 3
estimator 1 Dog Cat Cat
estimator 2 Dog Dog Cat
estimator 3 Cat Cat Dog
decision Dog Cat Cat
ground truth Dog Cat Dog

Soft Voting $P(X_{i} = \text{Dog})$

instance 1 instance 2 instance 3
estimator 1 0.87 (Dog) 0.30 (Cat) 0.47 (Cat)
estimator 2 0.90 (Dog) 0.51 (Dog) 0.48 (Cat)
estimator 3 0.93 (Dog) 0.12 (Cat) 0.58 (Dog)
average 0.90 (Dog) 0.31 (Cat) 0.51 (Dog)
ground truth Dog Cat Dog

Which Is Better? Soft or Hard?

$\;\texttt{04.16.agerons\_example.ipynb}$

Bagging and Pasting

i.e. sampling w/ or w/o replacement

Statisticians call sampling w/ replacement bootstrap.

with Replacement? How?

Same sample appears more than once in training set?

Wouldn't they be overly represented?

sklearn
Implementation

"Yes, just let them appear more than once."

Proof?

The source code in, e.g.


~/miniconda3/envs/ensemble_learning/lib/python3.10/site-packages/sklearn/ensemble/_bagging.py
            
if you use Miniconda like me.

def _generate_indices(random_state, bootstrap, n_population, n_samples):
    """Draw randomly sampled indices."""
    # Draw sample indices
    if bootstrap:
        indices = random_state.randint(0, n_population, n_samples)
    else:
        indices = sample_without_replacement(
            n_population, n_samples, random_state=random_state
        )

    return indices
            
# Draw samples, using sample weights, and then fit
if support_sample_weight:
    if sample_weight is None:
        curr_sample_weight = np.ones((n_samples,))
    else:
        curr_sample_weight = sample_weight.copy()

    if bootstrap:
        sample_counts = np.bincount(indices, minlength=n_samples)
        curr_sample_weight *= sample_counts
    else:
        not_indices_mask = ~indices_to_mask(indices, n_samples)
        curr_sample_weight[not_indices_mask] = 0

    estimator_fit(X[:, features], y, sample_weight=curr_sample_weight)

else:
    estimator_fit(X[indices][:, features], y[indices])
            

BaggingClassifier
from
sklearn

class BaggingClassifier(ClassifierMixin, BaseBagging):
    def __init__(
        self,
        base_estimator=None,
        n_estimators=10,
        *,
        max_samples=1.0,
        max_features=1.0,
        bootstrap=True,
        bootstrap_features=False,
        oob_score=False,
        warm_start=False,
        n_jobs=None,
        random_state=None,
        verbose=0,
    ):
            

Out-of-Bag, aka oob, Instances

"Only about $63\%$ of the training instances are sampled on average"

$63\%$? Where does this number come from?

Ambiguity

The quote about $63\%$ seems a bit unclear.

If we spend time thinking about it, sampling involves two numbers

  1. Population $m$: the number of (distinct) objects from which we do our sampling
  2. Sampling size $k$

One Plausible Answer

Obviously the oob ratio will depend on the sampling size $k$. So a plausible answer for the origin of $63\%$ is $$k = m \implies \lim_{m\to \infty} \left(\text{noob ratio}\right) = 63\%$$

A Quick Python Experiment

In [1]: import numpy as np

In [2]: def get_noob_ratio(m=100, k=100):
   ...:     """
   ...:     noob stands for "Not Out-Of-Bag"
   ...:     """
   ...:     population = np.arange(m)
   ...:     sampled = np.random.choice(population, size=k, replace=True)
   ...:     ratio = len(np.unique(sampled)) / m
   ...:     return ratio
   ...: 

In [3]: get_noob_ratio()
Out[3]: 0.62

In [4]: get_noob_ratio(999_999, 999_999)
Out[4]: 0.6321746321746322

In [5]: 1 - 1/np.e
Out[5]: 0.6321205588285577
            

Mathematically

Of course there are people curious about this before you on the Internet: https://math.stackexchange.com/questions/489772/probability-of-sampling-with-and-without-replacement

In particular, when $k = m$, we have \[ \lim_{m\to\infty} \left( 1 - \frac{1}{m} \right)^{m} = \frac{1}{\lim_{m\to\infty} \left( 1 + \frac{1}{m} \right)^{m}} = \frac{1}{\;e\;} \approx 37\% \]

oob, What for?

These oob instances could, e.g., be used to evaluate model performance w/o having to preserve a special validation set beforehands.

Random Subspaces and Random Patches

If you think about it, you might realize that what we did before to instances could be equally applied to features!

That is, we could also sample only a subset of the features to train an estimator. And we could also choose to (or not to) bootstrap such a sampling.

Random Forests

Boosting

AdaBoost

Gradient Boosting

XGBoost

https://xgboost.readthedocs.io/en/stable/

LightGBM

https://lightgbm.readthedocs.io/en/v3.3.2/

Stacking

Part 2

Q&A

  1. (Slide 5.7) Why are there two "class args"?
    
    class BaggingClassifier(ClassifierMixin, BaseBagging)
                    
  1. (Slide 5.12) Why \[ \lim_{m\to\infty} \left( 1 - \frac{1}{m} \right)^{m} \overset{?}{=} \frac{1}{\lim_{m\to\infty} \left( 1 + \frac{1}{m} \right)^{m}} \]
  1. (Slide 4.10) Why \[ \lim_{n\to\infty} \varepsilon \overset{?}{=} 0 \]
  1. ($\texttt{04.02.3weak1strong.ipynb}$) The predict method of the Weak class runs, like, very very slowly. Is there any way to reduce its time complexity?

References

  • Aurélien Géron, Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow

Thank You!