This blog is a summary of the first part of the course ECE543@UIUC. More materials can be found here.
Problem setup
In this blog, we will focus on the binary classification problem, where the input is , and the label is . A set of i.i.d. training samples are drawn from a joint distribution defined over . Sometimes, we may also denote and . A hypothesis is a mapping . Moreover, denotes the hypotheses class, which consists of all the possible hypotheses that our model (e.g. a neural network) can represent. The learning algorithm is denoted by , which maps a dataset to a hypothesis .
Given a hypothesis , we define its true risk as and the empirical risk on as
PAC-learnability
After applying the learning algorithm to the dataset , we get a hypothesis which depends on and minimizes the empirical risk . We care about the difference between the true risk of and that of . Here, is the best hypothesis in the sense of minimizing the true risk, i.e. . In other words, we want to get some guarantees on .
However, is a random variable since depends on which is random. Therefore, it is not reasonable to look for a uniform upper bound on . In the PAC (Probably Approximately Correct) learning framework, instead of a uniform upper bound, we care about guarantees of the form Here, “Probably” is characterized by , and “Approximately” is characterized by . Next, we give the definition of PAC-learnability of a hypothesis class.
Definition 1 (PAC-learnability): A hypothesis class is PAC-learnable if there exists an algorithm such that for all data distribution and all , such that given any training set which contains at least samples, satisfies
Theorem 1 (No free lunch): This theorem shows that we need some restrictions on the hypothesis class. We cannot learn an arbitrary set of hypotheses: there is no free lunch. Please see these lecture notes 1 and 2 for details.
A trick
Using a small trick, can be bounded as follows. By doing this, and are eliminated.
where is denoted by .
Thus,
Theorem 2: Finite hypothesis classes are PAC-learnable.
Proof: Since , we can first bound for each and then apply the union bound. For all and all sets with samples, We have that where in the last step we use the Hoeffding’s inequality. Thus, .
Theorem 1 is not very useful since almost all hypothesis classes we care about contain infinite hypotheses.
McDiarmid’s inequality
In order to derive PAC bounds for infinite hypothesis classes, we need some more sophisticated concentration inequalities.
Theorem 3 (McDiarmid’s inequality): Let be i.i.d. random variables. Let be a function of variables. If , then , Such a function is said to have bounded differences.
Proof: TODO.
It can be shown that has bounded differences, and thus McDiarmid’s inequality can be applied. Next, we have to derive an upper bound for .
Rademacher complexity
Definition 2 (Rademacher complexity of a set): Let , the Rademacher complexity of set is defined as follows. where ’s are i.i.d. random variables and , .
Intuitively, Rademacher complexity characterizes the diversity of a set. If for every direction vector , we can always find an element such that the inner product of and is large, then would be large.
Theorem 4: The expectation of can be bounded as follows. where
Proof: To simplify the notation, the loss term is denoted by . Moreover, the set of all ’s is denoted by . The training set is the source of randomness. Thus, where we introduce a virtual validation set or a set of ghost samples , which is independent of . This is known as the symmertrization trick. Thus, Since , and are i.i.d., we have that the following three random variables have the same distribution, Thus,
Some properties of Rademacher complexity
Rademacher calculus:
The following theorem is useful when dealing with the composition of a function and a set.
Theorem 5 (contraction principle): Let , , , and be -Lipschitz continuous. Applying to the set leads to a new set . Then, the Rademacher complexity of is bounded as follows.
VC-dimension
Theorem 6 (finite class lemma): Suppose is a finite subset of for and , then the Radmacher complexity of is bounded as follows,
- finite class lemma
- definition of VC-dimension
- VC’s lemma
- relationship of Rademacher complexity and VC-dimension
Others
- surrogate loss function
- contraction principle
Some examples
Now, we consider the separators on . Given and a data distribution over . A set of i.i.d. samples is sampled from conforming the distribution . Denoting as the hypothesis class, the learning alorithm maps to a hypothesis . Given a hypothesis , define its true risk as and the empirical risk on as
Then using the VC-theory, we can derive a PAC bound for this learning problem. Here, we assume this case is realizable, i.e. .
Using the Radmacher average to bound ,
Then, we use the finite class lemma to bound the Radmacher average. Let , then
Next, we use the VC-dimension and the Sauer–Shelah lemma to bound . The Sauer-Shelah lemma, Thus, we finally have
Finally we use the McDiarmid’s inequality to derive a concentration property. We have which with the above results implies that with probability higher than ,
With Dudley’s theorem, we can construct the so-called Dudley classes whose VC-dimension is known. Specifically, we can choose a basis of linearly independent functions to construct a -dimensional Hilbert space of which the VC-dimension is . For example, is a valid basis.