This blog is a summary of the first part of the course [email protected]. More materials can be found here.

## Problem setup

In this blog, we will focus on the binary classification problem, where the input is $X \in \mathbb{R}^D$, and the label is $Y \in \{0, 1\}$. A set of i.i.d. training samples $S = \{(X_1, Y_1), \dots, (X_n, Y_n)\}$ are drawn from a joint distribution $P$ defined over $\mathbb{R}^D \times \{0,1\}$. Sometimes, we may also denote $Z = (X, Y)$ and $Z_i = (X_i, Y_i)$. A *hypothesis* is a mapping $h: \mathbb{R}^D \mapsto \{0, 1\}$. Moreover, $\mathcal{H}$ denotes the *hypotheses class*, which consists of all the possible hypotheses that our model (e.g. a neural network) can represent. The learning algorithm is denoted by $\mathcal{A}$, which maps a dataset $S$ to a hypothesis $h_S \in \mathcal{H}$.

Given a hypothesis $h$, we define its *true risk* as $$L_P(h) = \mathbb{E}_{(X, Y) \sim P}~\mathbb{I}_{h(X) \neq Y},$$ and the *empirical risk* on $S$ as $$L_S(h) = \frac{1}{n}\sum_{i=1}^{n} \mathbb{I}_{h(X_i) \neq Y_i}.$$

## PAC-learnability

After applying the learning algorithm $\mathcal{A}$ to the dataset $S$, we get a hypothesis $h_S$ which depends on $S$ and minimizes the empirical risk $L_S(h)$. We care about the difference between the true risk of $h_S$ and that of $h^*$. Here, $h^*$ is the best hypothesis in the sense of minimizing the true risk, i.e. $h^* = \text{arg} \min\limits_{h \in \mathcal{H}} L_P(h)$. In other words, we want to get some guarantees on $L_P(h_S)-L_P(h^*)$.

However, $L_P(h_S)-L_P(h^*)$ is a random variable since $h_S$ depends on $S$ which is random. Therefore, it is not reasonable to look for a uniform upper bound on $L_P(h_S)-L_P(h^*)$. In the PAC (Probably Approximately Correct) learning framework, instead of a uniform upper bound, we care about guarantees of the form $$\Pr(L_P(h_S)-L_P(h^*) \geq \epsilon) \leq \delta.$$ Here, “Probably” is characterized by $\delta$, and “Approximately” is characterized by $\epsilon$. Next, we give the definition of PAC-learnability of a hypothesis class.

* Definition 1* (PAC-learnability): A hypothesis class $\mathcal{H}$ is PAC-learnable if there exists an algorithm $\mathcal{A}$ such that for

*all*data distribution $P_{X,Y}$ and

*all*$\epsilon,\delta \in \mathbb{R}^+$, $\exists n=n(\epsilon,\delta)$ such that given any training set $S$ which contains at least $n$ samples, $h_S=\mathcal{A}(S)$ satisfies $$\Pr(L_P(h_S)-L_P(h^*) \geq \epsilon) \leq \delta.$$

* Theorem *1 (No free lunch): This theorem shows that we need some restrictions on the hypothesis class. We cannot learn an arbitrary set of hypotheses: there is no free lunch. Please see these lecture notes 1 and 2 for details.

## A trick

Using a small trick, $L_P(h_S) – L_P(h^*)$ can be bounded as follows. By doing this, $h_S$ and $h^*$ are eliminated.

$$\begin{multline} L_P(h_S) – L_P(h^*) = [L_P(h_S) – L_S(h_S)] + [L_S(h^*) – L_P(h^*)] \\+ [L_S(h_S) – L_S(h^*)]_{(\text{this term}\leq 0)} \\\leq [L_P(h_S) – L_S(h_S)] + [L_S(h^*) – L_P(h^*)] \\\leq 2\sup\limits_{h \in \mathcal{H}} \left|L_P(h)-L_S(h)\right| = 2 \mathcal{V},\end{multline}$$ where $\sup\limits_{h \in \mathcal{H}} \left|L_P(h)-L_S(h)\right|$ is denoted by $\mathcal{V}$.

Thus, $$\Pr(L_P(h_S)-L_P(h^*) \geq \epsilon) \leq \Pr(\mathcal{V} \geq \frac{\epsilon}{2}).$$

* Theorem 2:* Finite hypothesis classes are PAC-learnable.

*Proof:* Since $|\mathcal{H}|<\infty$, we can first bound $\Pr\left(\left|L_S(h)-L_P(h)\right|\geq\frac{\epsilon}{2}\right)$ for each $h$ and then apply the union bound. For all $\epsilon > 0$ and all sets $S$ with $n$ samples, We have that $$\begin{align} &\Pr\left(L_P(h_S)-L_P(h^*) \geq \epsilon\right)\\ \leq &\Pr\left(\max\limits_{h \in \mathcal{H}} \left|L_S(h)-L_P(h)\right| \geq \frac{\epsilon}{2}\right)\\= &\Pr\left(\cup_{h \in \mathcal{H}} \left\{\left|L_S(h)-L_P(h) \right| \geq \frac{\epsilon}{2}\right\}\right)\\\leq &\sum_{h\in\mathcal{H}}\Pr\left(\left|L_S(h) – L_P(h) \right|\geq\frac{\epsilon}{2}\right) \\= &\sum_{h\in\mathcal{H}}\Pr\left(\left|\frac{1}{n}\sum_{i=1}^{n}\mathbb{I}_{h(x_i)\neq y_i} – L_P(h) \right|\geq\frac{\epsilon}{2}\right) \\\leq &\sum_{h\in\mathcal{H}} \exp\left(-\frac{2n\epsilon^2}{4}\right) = |\mathcal{H}|\exp\left(-\frac{n\epsilon^2}{2}\right) = \delta, \end{align}$$ where in the last step we use the Hoeffding’s inequality. Thus, $n(\epsilon, \delta) = \frac{2}{\epsilon^2}\log(\frac{|\mathcal{H}|}{\delta})$.

Theorem 1 is not very useful since almost all hypothesis classes we care about contain infinite hypotheses.

## McDiarmid’s inequality

In order to derive PAC bounds for infinite hypothesis classes, we need some more sophisticated concentration inequalities.

* Theorem 3* (McDiarmid’s inequality): Let $X_1, \dots, X_n$ be i.i.d. random variables. Let $f$ be a function of $n$ variables. If $\forall x_i, x_i’$, $$|f(x_1,\dots,x_i,\dots,x_n) – f(x_1,\dots,x_i’,\dots,x_n)| \leq c_i,$$ then $\forall \epsilon>0$, $$\Pr(f(X_1,\dots,X_n)-\mathbb{E}\left[f(X_1,\dots,X_n)\right] \geq \epsilon) \leq \exp\left(\frac{-2\epsilon^2}{\sum_{i=1}^{n}c_i}\right).$$ Such a function $f$ is said to have

*bounded differences*.

*Proof:* TODO.

It can be shown that $\mathcal{V}$ has bounded differences, and thus McDiarmid’s inequality can be applied. Next, we have to derive an upper bound for $\mathbb{E}\left[\mathcal{V}\right]$.

## Rademacher complexity

* Definition 2* (Rademacher complexity of a set): Let $A \subseteq \mathbb{R}^n$, the Rademacher complexity $R(A)$ of set $A$ is defined as follows. $$R(A) = \mathbb{E}_{\sigma}\left[\sup\limits_{a \in A} \left|\frac{1}{n}\sum_{i=1}^{n}\sigma_i a_i\right|\right],$$ where $\sigma_i$’s are i.i.d. random variables and $\forall i$, $\Pr(\sigma_i = 1) = \Pr(\sigma_i = -1) = 0.5$.

Intuitively, Rademacher complexity characterizes the diversity of a set. If for every direction vector $\sigma = [\sigma_1, \cdots, \sigma_n]$, we can always find an element $a \in A$ such that the inner product of $\sigma$ and $a$ is large, then $R(A)$ would be large.

* Theorem 4:* The expectation of $\mathcal{V}$ can be bounded as follows. $$\mathbb{E}_S\left[\mathcal{V}\right] \leq 2\mathbb{E}_S \left[R(\mathcal{F}(S))\right],$$ where $\mathcal{F}(S) = \left\{\left(\mathbb{I}_{h(x_1) \neq y_1}, \cdots, \mathbb{I}_{h(x_n) \neq y_n}\right), h \in \mathcal{H}\right\}.$

*Proof:* To simplify the notation, the loss term $\mathbb{I}_{h(x_i)\neq y_i}$ is denoted by $f(z_i)$. Moreover, the set of all $f$’s is denoted by $\mathcal{F}$. The training set $S = \{Z_i\}_{i=1}^{n}$ is the source of randomness. Thus, $$\begin{align} &\mathbb{E}_S\left[\mathcal{V}\right]\\ = &\mathbb{E}_S\left[\sup\limits_{f\in\mathcal{F}}\left|\frac{1}{n}\sum_{i=1}^{n}f(Z_i) – \mathbb{E}_Z\left[f(Z)\right]\right|\right]\\= &\mathbb{E}_S\left[\sup\limits_{f\in\mathcal{F}}\left|\frac{1}{n}\sum_{i=1}^{n}f(Z_i) – \frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{Z_i’}\left[f(Z_i’)\right]\right|\right],\end{align}$$ where we introduce a *virtual validation set* or a set of *ghost samples* $S’ := \{Z_i’\}_{i=1}^{n}$, which is independent of $S$. This is known as the *symmertrization* trick. Thus, $$\begin{align}&\mathbb{E}_S\left[\mathcal{V}\right]\\= &\mathbb{E}_S\left[\sup\limits_{f\in\mathcal{F}}\left|\mathbb{E}_{S’}\left[\frac{1}{n}\sum_{i=1}^{n}\left(f(Z_i) – f(Z_i’)\right)\right]\right|\right] \\ \underset{\text{Jensen’s}}{\leq} &\mathbb{E}_S\left[\sup\limits_{f\in\mathcal{F}}\mathbb{E}_{S’}\left[\left|\frac{1}{n}\sum_{i=1}^{n}\left(f(Z_i) – f(Z_i’)\right)\right|\right]\right]\\ \leq &\mathbb{E}_{S,S’}\left[\sup\limits_{f\in\mathcal{F}}\left|\frac{1}{n}\sum_{i=1}^{n}\left(f(Z_i) – f(Z_i’)\right)\right|\right].\end{align}$$ Since $\forall i$, $Z_i$ and $Z_i’$ are i.i.d., we have that the following three random variables have the same distribution, $$f(Z_i)-f(Z_i’)\stackrel{\mbox{d}}{=}f(Z_i’)-f(Z_i)\stackrel{\mbox{d}}{=}\sigma_i(f(Z_i)-f(Z_i’)).$$ Thus, $$\begin{align}&\mathbb{E}_S\left[\mathcal{V}\right]\\\leq&\mathbb{E}_{S,S’,\sigma}\left[\sup\limits_{f\in\mathcal{F}}\left|\frac{1}{n}\sum_{i=1}^{n}\sigma_i\left(f(Z_i) – f(Z_i’)\right)\right|\right]\\\leq&2\mathbb{E}_{S,\sigma}\left[\sup\limits_{f\in\mathcal{F}}\left|\frac{1}{n}\sum_{i=1}^{n}\sigma_if(Z_i)\right|\right]= 2\mathbb{E}_S\left[R(\mathcal{F}(S)\right].\end{align}$$

## Some properties of Rademacher complexity

Rademacher calculus:

The following theorem is useful when dealing with the composition of a function and a set.

* Theorem 5* (contraction principle): Let $A\subseteq \mathbb{R}^n$, $\phi: \mathbb{R}\mapsto\mathbb{R}$, $\phi(0) = 0$, and $\phi(\cdot)$ be $M$-Lipschitz continuous. Applying $\phi$ to the set $A$ leads to a new set $\phi \circ A = \left\{\left(\phi(a_1), \cdots, \phi(a_n)\right), a \in A\right\}$. Then, the Rademacher complexity of $\phi \circ A$ is bounded as follows. $$R\left(\phi \circ A\right) \leq 2MR(A).$$

## VC-dimension

** Theorem 6** (finite class lemma): Suppose $A = \{a^{(1)},\dots,a^{(N)}\}$ is a

*finite*subset of $\mathbb{R}^n$ for $N>2$ and $||a^{(i)}|| \leq L,\,\forall i$, then the Radmacher complexity of $A$ is bounded as follows, $$R_n(A) \leq \frac{2L\sqrt{\log N}}{n}.$$

- finite class lemma
- definition of VC-dimension
- VC’s lemma
- relationship of Rademacher complexity and VC-dimension

## Others

- surrogate loss function
- contraction principle

## Some examples

Now, we consider the separators on $\mathbb{R}^d$. Given $\Gamma \subseteq \mathbb{R}^d$ and a data distribution $\mathcal{D}$ over $\Gamma$. A set of i.i.d. samples $S=\{Z_1,\dots,Z_n\}$ is sampled from $\Gamma$ conforming the distribution $\mathcal{D}$. Denoting $\mathcal{H}$ as the hypothesis class, the learning alorithm maps $S$ to a hypothesis $h_S \in \mathcal{H}$. Given a hypothesis $h$, define its *true risk* as $$L_P(h) = \mathbb{E}_{Z \sim \mathcal{D}}~\mathbb{I}_{h(Z) \neq 1},$$ and the *empirical risk* on $S$ as $$L_S(h) = \frac{1}{n}\sum_{i=1}^{n} \mathbb{I}_{h(Z_i) \neq 1}.$$

Then using the VC-theory, we can derive a PAC bound for this learning problem. Here, we assume this case is realizable, i.e. $L_P(h^*) = 0$.

$\begin{multline} L_P(h_S) – L_P(h^*) = [L_P(h_S) – L_S(h_S)] + [L_S(h^*) – L_P(h^*)] \\+ [L_S(h_S) – L_S(h^*)]_{\leq 0} \\\leq [L_P(h_S) – L_S(h_S)] + [L_S(h^*) – L_P(h^*)] \\\leq 2\sup\limits_{h \in \mathcal{H}} \left|L_P(h)-L_S(h)\right| = 2 \mathcal{V}\end{multline}$

Using the Radmacher average to bound $\mathbb{E}[\mathcal{V}]$, $$\mathbb{E}[\mathcal{V}] \leq 2\mathbb{E}[R(\mathcal{H}(Z^n))].$$

Then, we use the finite class lemma to bound the Radmacher average. Let $S_n(\mathcal{H}) = \sup\limits_{z^n}\left|\mathcal{H}(z^n)\right|$, then $$\mathbb{E}[R(\mathcal{H}(Z^n))] \leq \mathbb{E}\left[2\sqrt{\frac{\log S_n(\mathcal{H})}{n}}\right] = 2\sqrt{\frac{\log S_n(\mathcal{H})}{n}}.$$

Next, we use the VC-dimension and the Sauerâ€“Shelah lemma to bound $S_n(\mathcal{H})$. The Sauer-Shelah lemma, $$S_n(\mathcal{H}) \leq \sum_{i=1}^{VC(\mathcal{H})} {n \choose i} \leq \left(\frac{ne}{d}\right)^{VC(\mathcal{H})}.$$ Thus, we finally have $$\mathbb{E} \left[L_P(h_S)\right] = \mathbb{E} \left[L_P(h_S) – L_P(h^*)\right] \leq 8\sqrt{\frac{VC(\mathcal{H}) (1 + \log \frac{n}{d})}{n}}.$$

Finally we use the McDiarmid’s inequality to derive a concentration property. We have $$\Pr\left(L_P(h_S) – \mathbb{E} \left[L_P(h_S)\right] \leq \frac{1}{2n}\sqrt{\log \frac{1}{\delta}}\right) \geq 1-\delta$$ which with the above results implies that with probability higher than $1-\delta$, $$L_P(h_S) \leq 8\sqrt{\frac{VC(\mathcal{H}) (1 + \log \frac{n}{d})}{n}} + \frac{1}{2n}\sqrt{\log \frac{1}{\delta}}.$$

With Dudley’s theorem, we can construct the so-called Dudley classes whose VC-dimension is known. Specifically, we can choose a basis of $n$ linearly independent functions to construct a $n$-dimensional Hilbert space of which the VC-dimension is $n$. For example, $\{1,x,\dots,x^n\}$ is a valid basis.