The PAC learning framework

This blog is a summary of the first part of the course ECE543@UIUC. More materials can be found here.

Problem setup

In this blog, we will focus on the binary classification problem, where the input is XRD, and the label is Y{0,1}. A set of i.i.d. training samples S={(X1,Y1),,(Xn,Yn)} are drawn from a joint distribution P defined over RD×{0,1}. Sometimes, we may also denote Z=(X,Y) and Zi=(Xi,Yi). A hypothesis is a mapping h:RD{0,1}. Moreover, H denotes the hypotheses class, which consists of all the possible hypotheses that our model (e.g. a neural network) can represent. The learning algorithm is denoted by A, which maps a dataset S to a hypothesis hSH.

Given a hypothesis h, we define its true risk as LP(h)=E(X,Y)P Ih(X)Y, and the empirical risk on S as LS(h)=1ni=1nIh(Xi)Yi.

PAC-learnability

After applying the learning algorithm A to the dataset S, we get a hypothesis hS which depends on S and minimizes the empirical risk LS(h). We care about the difference between the true risk of hS and that of h. Here, h is the best hypothesis in the sense of minimizing the true risk, i.e. h=argminhHLP(h). In other words, we want to get some guarantees on LP(hS)LP(h).

However, LP(hS)LP(h) is a random variable since hS depends on S which is random. Therefore, it is not reasonable to look for a uniform upper bound on LP(hS)LP(h). In the PAC (Probably Approximately Correct) learning framework, instead of a uniform upper bound, we care about guarantees of the form Pr(LP(hS)LP(h)ϵ)δ. Here, “Probably” is characterized by δ, and “Approximately” is characterized by ϵ. Next, we give the definition of PAC-learnability of a hypothesis class.

Definition 1 (PAC-learnability): A hypothesis class H is PAC-learnable if there exists an algorithm A such that for all data distribution PX,Y and all ϵ,δR+, n=n(ϵ,δ) such that given any training set S which contains at least n samples, hS=A(S) satisfies Pr(LP(hS)LP(h)ϵ)δ.

Theorem 1 (No free lunch): This theorem shows that we need some restrictions on the hypothesis class. We cannot learn an arbitrary set of hypotheses: there is no free lunch. Please see these lecture notes 1 and 2 for details.

A trick

Using a small trick, LP(hS)LP(h) can be bounded as follows. By doing this, hS and h are eliminated.

LP(hS)LP(h)=[LP(hS)LS(hS)]+[LS(h)LP(h)]+[LS(hS)LS(h)](this term0)[LP(hS)LS(hS)]+[LS(h)LP(h)]2suphH|LP(h)LS(h)|=2V, where suphH|LP(h)LS(h)| is denoted by V.

Thus, Pr(LP(hS)LP(h)ϵ)Pr(Vϵ2).

Theorem 2: Finite hypothesis classes are PAC-learnable.

Proof: Since |H|<, we can first bound Pr(|LS(h)LP(h)|ϵ2) for each h and then apply the union bound. For all ϵ>0 and all sets S with n samples, We have that Pr(LP(hS)LP(h)ϵ)Pr(maxhH|LS(h)LP(h)|ϵ2)=Pr(hH{|LS(h)LP(h)|ϵ2})hHPr(|LS(h)LP(h)|ϵ2)=hHPr(|1ni=1nIh(xi)yiLP(h)|ϵ2)hHexp(2nϵ24)=|H|exp(nϵ22)=δ, where in the last step we use the Hoeffding’s inequality. Thus, n(ϵ,δ)=2ϵ2log(|H|δ).

Theorem 1 is not very useful since almost all hypothesis classes we care about contain infinite hypotheses.

McDiarmid’s inequality

In order to derive PAC bounds for infinite hypothesis classes, we need some more sophisticated concentration inequalities.

Theorem 3 (McDiarmid’s inequality): Let X1,,Xn be i.i.d. random variables. Let f be a function of n variables. If xi,xi, |f(x1,,xi,,xn)f(x1,,xi,,xn)|ci, then ϵ>0, Pr(f(X1,,Xn)E[f(X1,,Xn)]ϵ)exp(2ϵ2i=1nci). Such a function f is said to have bounded differences.

Proof: TODO.

It can be shown that V has bounded differences, and thus McDiarmid’s inequality can be applied. Next, we have to derive an upper bound for E[V].

Rademacher complexity

Definition 2 (Rademacher complexity of a set): Let ARn, the Rademacher complexity R(A) of set A is defined as follows. R(A)=Eσ[supaA|1ni=1nσiai|], where σi’s are i.i.d. random variables and i, Pr(σi=1)=Pr(σi=1)=0.5.

Intuitively, Rademacher complexity characterizes the diversity of a set. If for every direction vector σ=[σ1,,σn], we can always find an element aA such that the inner product of σ and a is large, then R(A) would be large.

Theorem 4: The expectation of V can be bounded as follows. ES[V]2ES[R(F(S))], where F(S)={(Ih(x1)y1,,Ih(xn)yn),hH}.

Proof: To simplify the notation, the loss term Ih(xi)yi is denoted by f(zi). Moreover, the set of all f’s is denoted by F. The training set S={Zi}i=1n is the source of randomness. Thus, ES[V]=ES[supfF|1ni=1nf(Zi)EZ[f(Z)]|]=ES[supfF|1ni=1nf(Zi)1ni=1nEZi[f(Zi)]|], where we introduce a virtual validation set or a set of ghost samples S:={Zi}i=1n, which is independent of S. This is known as the symmertrization trick. Thus, ES[V]=ES[supfF|ES[1ni=1n(f(Zi)f(Zi))]|]Jensen’sES[supfFES[|1ni=1n(f(Zi)f(Zi))|]]ES,S[supfF|1ni=1n(f(Zi)f(Zi))|]. Since i, Zi and Zi are i.i.d., we have that the following three random variables have the same distribution, f(Zi)f(Zi)=df(Zi)f(Zi)=dσi(f(Zi)f(Zi)). Thus, ES[V]ES,S,σ[supfF|1ni=1nσi(f(Zi)f(Zi))|]2ES,σ[supfF|1ni=1nσif(Zi)|]=2ES[R(F(S)].

Some properties of Rademacher complexity

Rademacher calculus:

The following theorem is useful when dealing with the composition of a function and a set.

Theorem 5 (contraction principle): Let ARn, ϕ:RR, ϕ(0)=0, and ϕ() be M-Lipschitz continuous. Applying ϕ to the set A leads to a new set ϕA={(ϕ(a1),,ϕ(an)),aA}. Then, the Rademacher complexity of ϕA is bounded as follows. R(ϕA)2MR(A).

VC-dimension

Theorem 6 (finite class lemma): Suppose A={a(1),,a(N)} is a finite subset of Rn for N>2 and ||a(i)||L,i, then the Radmacher complexity of A is bounded as follows, Rn(A)2LlogNn.

  • finite class lemma
  • definition of VC-dimension
  • VC’s lemma
  • relationship of Rademacher complexity and VC-dimension

Others

  • surrogate loss function
  • contraction principle

Some examples

Now, we consider the separators on Rd. Given ΓRd and a data distribution D over Γ. A set of i.i.d. samples S={Z1,,Zn} is sampled from Γ conforming the distribution D. Denoting H as the hypothesis class, the learning alorithm maps S to a hypothesis hSH. Given a hypothesis h, define its true risk as LP(h)=EZD Ih(Z)1, and the empirical risk on S as LS(h)=1ni=1nIh(Zi)1.

Then using the VC-theory, we can derive a PAC bound for this learning problem. Here, we assume this case is realizable, i.e. LP(h)=0.

LP(hS)LP(h)=[LP(hS)LS(hS)]+[LS(h)LP(h)]+[LS(hS)LS(h)]0[LP(hS)LS(hS)]+[LS(h)LP(h)]2suphH|LP(h)LS(h)|=2V

Using the Radmacher average to bound E[V], E[V]2E[R(H(Zn))].

Then, we use the finite class lemma to bound the Radmacher average. Let Sn(H)=supzn|H(zn)|, then E[R(H(Zn))]E[2logSn(H)n]=2logSn(H)n.

Next, we use the VC-dimension and the Sauer–Shelah lemma to bound Sn(H). The Sauer-Shelah lemma, Sn(H)i=1VC(H)(ni)(ned)VC(H). Thus, we finally have E[LP(hS)]=E[LP(hS)LP(h)]8VC(H)(1+lognd)n.

Finally we use the McDiarmid’s inequality to derive a concentration property. We have Pr(LP(hS)E[LP(hS)]12nlog1δ)1δ which with the above results implies that with probability higher than 1δ, LP(hS)8VC(H)(1+lognd)n+12nlog1δ.

With Dudley’s theorem, we can construct the so-called Dudley classes whose VC-dimension is known. Specifically, we can choose a basis of n linearly independent functions to construct a n-dimensional Hilbert space of which the VC-dimension is n. For example, {1,x,,xn} is a valid basis.

Leave a Reply