PAC results with Hoeffding-Bernstein’s inequality are the bread and butter of machine learning theory. In this post, we’ll see how to use Hoeffding’s inequality to derive agnostic PAC bounds with an error rate that decreases as in number of training samples
. Then, we’ll see that in the realizable setting we can use Bernstein’s inequality to improve this to
.
Supervised Learning
Consider the task of regression with input space and output space
. We have access to a model class (also known as hypothesis family)
. We will assume
to be finite for simplicity. There exists some unknown data distribution
that is generating examples. Of course, we don’t have access to this data distribution but we have access to some samples from it. Let
be
independent identically distributed (i.i.d) samples from the data distribution i.e.,
and probability of observing a sample
is given by
.
In order, to evaluate the performance of a model we will need a loss function. Let be a loss function. Given a datapoint
and a model
, the loss associated with our prediction is given by
. For now, we will not make any assumption about
.
Our aim is to find a function that is as good as possible. What is a good function? Well one that has the least loss function across the input space i.e., its generalization error
is as low as possible:
. The only signal we have for finding such a
is the training data
. We will define the training error
. The aim of machine learning theory is to (i) allow us to find a “good”
, and (ii) bound the performance of the predicted model in terms of
and other relevant parameters. The first result we will look at is the Occam’s Razor Bound and then we will see how to improve this result.
Occam’s Razor Bound
One of the first result that is taught in a machine learning theory class is the Occam’s Razor bound. The formal result is given in the theorem below. Occam’s Razor bound suggests using the model that minimizes the error on the observed sample i.e., the solution of the following optimization:
. This solution is called the empirical risk minimizer (ERM) or the ERM solution. We will drop the
from the notation when it is clear from the context.
Theorem: Let be a supervised learning problem with input space
, output space
, data distribution
, and
be the loss function. We are given
i.i.d samples
from the data distribution
. Let
be a finite collection of models from
to
. Fix any
then with probability at least
over samples
drawn i.i.d. from
, we have:
, and
.
Proof: Fix any model in the model family . Let
. Then observe
and
are i.i.d random variables. They are random variables with respect to the sample
which is randomly chosen. Further,
and
,
where the first equality follows from linearity of expectation and the second inequality follows from iid assumption. Then using Hoeffding’s inequality:
,
Setting , we get:
with probability of at least
. Let
be result derived in the above equation then we have
or in other words
(
represents the complement of measurable set
). Even though we derived this result for a fixed
, we never used any property of
and therefore, it holds for all
. Therefore, we can bound the probability of
using union bound:
.
This means with probability of at least we have:
.
Replacing by
proves the first result. Let
be the ERM solution and fix
in
. Then:
(from Hoeffidng’s inequality)
(from the definition of ERM)
(from Hoeffidng’s inequality)
This implies, for any , with probability at least
:
. As this holds for any
, therefore, we can pick the
with smallest value of
. This proves the second result.
There are couple of things that stand out in this result:
- More training data helps: Formally, the bound improves as
in sample size
. Informally, this means as we have more training data we will learn a model whose generalized error has a better upper bound.
- Complex model class can be undesirable: The bound becomes weaker as the model family
becomes larger. This is an important lesson in model regularization: if we have two model class
that have the same training error on a sample
, then if
is more complex class than
i.e.,
, then the bound on the generalization error for the later will be weaker. Note that the result does not say that the larger model class is always bad. We can get better bounds with a more complex model class provided we can appropriately improve the training error.
Improving Occam’s Razor Bound with Bernstein’s Inequality
Bernstein’s inequality allows us to incorporate the variance of the random variables in the bound thereby allowing us to get tighter bound for low variance settings.
Theorem (Bernstein’s Inequality): Let be
i.i.d random variables with mean
and variance
. Further, let
almost surely, for all
. Then:
If the variance was negligible then we will get
. This bound, if we repeat what we did for Occam’s Razor Bound, will give us
bound.
Before we proceed to derive a in realizable setting, we will first state the Bernstein’s inequality in the more familiar form. Setting the right hand side to
and solving the quadratic equation gives us:
With probability at least we have:
We now apply the inequality to get:
.
Theorem: Let be a supervised learning problem with input space
, output space
, data distribution
, and loss function
. Let there be
such that
. We are given
i.i.d samples
from the data distribution
. Let
be a finite collection of models from
to
. Fix any
then with probability at least
we have:
.
Proof: Fix a and define
as before. This time we will apply Bernstein’s inequality. We have
as the loss function is bounded in
. Bernstein’s inequality gives us:
,
with probability of at least . All we need to do is bound the variance
. As the loss function is bounded in
therefore,
. This gives us
. Plugging this in we get:
.
Applying AM-GM inequality, we have giving us,
,
with probability . Applying union bound on all hypothesis we get:
,
with probability at least . Let
be the ERM solution then:
(Bernstein’s inequality)
(ERM solution)
(Bernstein’s inequality)
(realizability condition)
Combining these we get with probability at least
. Hence, proved.
So what is happening? We are making a realizability assumption i.e., there is a function such that
. As the loss is lower bounded by 0, therefore, the variance of the samples of its loss are 0. Now as our ERM solver is becoming better with more data, it is converging towards this
whose variance is 0. Therefore, the variance of the ERM (which is bounded by its mean
) is becoming smaller. This helps in bounding the rate of convergence.
Summary: We saw how Hoeffding’s inequality gives us a bound in number of samples
. When we are in the realizable case, we can get
using Bernstein’s inequality.