# The PAC learning framework

This blog is a summary of the first part of the ECE543 course in [email protected] More materials used in this course can be found here.

## Problem setup

In this blog, we will focus on the binary classification problem, where the input is $X \in \mathbb{R}^D$, and the label is $Y \in \{0, 1\}$. A set of training samples $S = \{(X_1, Y_1), \dots, (X_n, Y_n)\}$ is drawn from the joint distribution $P(X, Y)$. Sometimes, we may also denote $Z = (X, Y)$ and $Z_i = (X_i, Y_i)$. A classifier is a mapping $h: \mathbb{R}^D \mapsto \{0, 1\}$. In the PAC learning framework, we call such a function a hypothesis. Given the model that we will use to classify the data, we define the hypothesis class $\mathcal{H}$ which consists of all the possible hypotheses our model can represent. We also have a learning algorithm $\mathcal{A}$ which maps a dataset $S$ to a hypothesis $h_S \in \mathcal{H}$.

Given a hypothesis $h$, we define its true risk as $$L_P(h) = \mathbb{E}_{(X, Y) \sim P}~\mathbb{I}_{h(X) \neq Y},$$ and the empirical risk on $S$ as $$L_S(h) = \frac{1}{n}\sum_{i=1}^{n} \mathbb{I}_{h(X_i) \neq Y_i}.$$

## PAC-learnability

After applying the learning algorithm $\mathcal{A}$ to the dataset $S$, we get a hypothesis $h_S$ which depends on $S$. We care about the difference between the true risk of $h_S$ and that of $h^*$. Here, $h^*$ is the best hypothesis in the sense of minimizing the true risk, i.e. $h^* = \text{arg} \min\limits_{h \in \mathcal{H}} L_P(h)$. In other words, we want to get some guarantees on $L_P(h_S)-L_P(h^*)$.

However, $L_P(h_S)-L_P(h^*)$ is a random variable since $h_S$ depends on $S$ which is random. Therefore, it is not reasonable to look for a uniform upper bound on $L_P(h_S)-L_P(h^*)$. In the PAC (Probably Approximately Correct) learning framework, instead of a uniform upper bound, we care about guarantees in the form of $$\Pr(L_P(h_S)-L_P(h^*) > \epsilon) < \delta.$$ Here, “Probably” is characterized by $\delta$, and “Approximately” is characterized by $\epsilon$. Next, we give the definition of PAC-learnability of a hypothesis class.

Definition 1 (PAC-learnability): A hypothesis class $\mathcal{H}$ is PAC-learnable if there exists an algorithm $\mathcal{A}$ such that for any data distribution $P_{X,Y}$ and any $\epsilon,\delta \in \mathbb{R}^+$, $\exists n=n(\epsilon,\delta)$ such that given any training set $S$ which contains at least $n$ samples, $h_S=\mathcal{A}(S)$ satisfies $$\Pr(L_P(h_S)-L_P(h^*) > \epsilon) < \delta.$$

Theorem 1: Finite hypothesis classes are PAC-learnable.

Theorem 1 is not very useful since almost all hypothesis classes we care about contains infinite hypotheses.

Theorem 2: No free lunch.

## McDiarmid’s inequality

In order to derive PAC bound for infinite hypothesis classes, we need some more sophisticated concentration inequalities.

Theorem 3 (McDiarmid’s inequality): Let $X_1, \dots, X_n$ be i.i.d. random variables. Let $f$ be a function of $n$ variables such that $\forall x_i, x_i’$, $$|f(x_1,\dots,x_i,\dots,x_n) – f(x_1,\dots,x_i’,\dots,x_n)| \leq c_i.$$ Then, $\forall x>0$, $$P(f(X_1,\dots,X_n)-\mathbb{E}\left[f(X_1,\dots,X_n)\right] \geq x) \leq \exp\left(\frac{-2x^2}{\sum_{i=1}^{n}c_i}\right).$$

Proof:

• using McDiarmid’s inequality to get a PAC bound

• the symmetry trick
• the expectation of that quantity is bounded by Rademacher complexity

## VC-dimension

The finite class lemma. Suppose $A = \{a^{(1)},\dots,a^{(N)}\}$ is a finite subset of $\mathbb{R}^n$ for $N>2$ and $||a^{(i)}|| \leq L,\,\forall i$, then the Radmacher average of $A$ $$R_n(A) \leq \frac{2L\sqrt{\log N}}{n}.$$

• finite class lemma
• definition of VC-dimension
• VC’s lemma
• relationship of Rademacher complexity and VC-dimension

## Others

• surrogate loss function
• contraction principle

## Some examples

Now, we consider the separators on $\mathbb{R}^n$. Given $\Gamma \subseteq \mathbb{R}^n$ and a data distribution $\mathcal{D}$ over $\Gamma$. A set of i.i.d. samples $S=\{Z_1,\dots,Z_n\}$ is sampled from $\Gamma$ conforming the distribution $\mathcal{D}$. Denoting $\mathcal{H}$ as the hypothesis class, the learning alorithm maps $S$ to a hypothesis $h_S \in \mathcal{H}$. Given a hypothesis $h$, define its true risk as $$L_P(h) = \mathbb{E}_{Z \sim \mathcal{D}}~\mathbb{I}_{h(Z) \neq 1},$$ and the empirical risk on $S$ as $$L_S(h) = \frac{1}{n}\sum_{i=1}^{n} \mathbb{I}_{h(Z_i) \neq 1}.$$

Then using the VC-theory, we can derive a PAC bound for this learning problem. Here, we assume this case is realizable, i.e. $L_P(h^*) = 0$.

$\begin{multline} L_P(h_S) – L_P(h^*) = [L_P(h_S) – L_S(h_S)] + [L_S(h^*) – L_P(h^*)] \\+ [L_S(h_S) – L_S(h^*)]_{\leq 0} \\\leq [L_P(h_S) – L_S(h_S)] + [L_S(h^*) – L_P(h^*)] \\\leq 2\sup\limits_{h \in \mathcal{H}} \left|L_P(h)-L_S(h)\right| = 2 \text{SUP}\end{multline}$

Using the Radmacher average to bound $\mathbb{E}[\text{SUP}]$, $$\mathbb{E}[\text{SUP}] \leq 2\mathbb{E}[R(\mathcal{H}(Z^n))].$$

Then, we use the finite class lemma to bound the Radmacher average. Let $S_n(\mathcal{H}) = \sup\limits_{z^n}\left|\mathcal{H}(z^n)\right|$, then $$\mathbb{E}[R(\mathcal{H}(Z^n))] \leq \mathbb{E}\left[2\sqrt{\frac{\log S_n(\mathcal{H})}{n}}\right] = 2\sqrt{\frac{\log S_n(\mathcal{H})}{n}}.$$

Next, we use the VC-dimension and the Sauer–Shelah lemma to bound $S_n(\mathcal{H})$. The Sauer-Shelah lemma, $$S_n(\mathcal{H}) \leq \sum_{i=1}^{VC(\mathcal{H})} {n \choose i} \leq \left(\frac{ne}{d}\right)^{VC(\mathcal{H})}.$$ Thus, we finally have $$\mathbb{E} \left[L_P(h_S)\right] = \mathbb{E} \left[L_P(h_S) – L_P(h^*)\right] \leq 8\sqrt{\frac{VC(\mathcal{H}) (1 + \log \frac{n}{d})}{n}}.$$

Finally we use the McDiarmid’s inequality to derive a concentration property. We have $$\Pr\left(L_P(h_S) – \mathbb{E} \left[L_P(h_S)\right] \leq \frac{1}{2n}\sqrt{\log \frac{1}{\delta}}\right) \geq 1-\delta$$ which with the above results implies that with probability higher than $1-\delta$, $$L_P(h_S) \leq 8\sqrt{\frac{VC(\mathcal{H}) (1 + \log \frac{n}{d})}{n}} + \frac{1}{2n}\sqrt{\log \frac{1}{\delta}}.$$

With Dudley’s theorem, we can consturct the so-called Dudley classes whose VC-dimension is known. Specifically, we can choose a basis of $n$ linearly independent functions to construct a $n$-dimensional Hilbert space of which the VC-dimension is $n$. For example, $\{1,x,\dots,x^n\}$.