Blog

Learning Certified Control using Contraction Metric

Abstract: In this paper, we solve the problem of finding a certified control policy that drives a robot from any given initial state and under any bounded disturbance to the desired reference trajectory, with guarantees on the convergence or bounds on the tracking error. Such a controller is crucial in safe motion planning. We leverage the advanced theory in Control Contraction Metric and design a learning framework based on neural networks to co-synthesize the contraction metric and the controller for control-affine systems. We further provide methods to validate the convergence and bounded error guarantees. We demonstrate the performance of our method using a suite of challenging robotic models, including models with learned dynamics as neural networks. We compare our approach with leading methods using sum-of-squares programming, reinforcement learning, and model predictive control. Results show that our methods indeed can handle a broader class of systems with less tracking error and faster execution speed. Code is available at https://github.com/sundw2014/C3M. The paper is available at https://arxiv.org/abs/2011.12569.

Problem definition

Formally, given a control-affine system x˙=f(x)+B(x)u, we want to find a feedback controller u(,,), so that for any reference (x(t),u(t)) solving the ODE, the closed-loop system perturbed by d x˙=f(x(t))+B(x(t))u(x(t),x(t),u(t))+d(t) satisfies that the tracking error |x(t)x(t)| is upper bounded, for all |d(t)|ϵ and all initial condition x(0)X.

Contraction analysis

Contraction analysis can be viewed as a differential version of Lyapunov’s theory. It analyzes the incremental stability by considering the evolution of the distance between two neighboring trajectories of the system. Lyapunov’s theory considers whether the trajectories will finally converge to a point (equilibrium), while contraction theory considers whether all the trajectories converge to a common trajectory. Control contraction metric (CCM) theory extends the analysis to cases with control input, which enables tracking controller synthesis.

Consider a system x˙=f(x)+B(x)u, its virtual displacement δx between any pair of arbitrarily close neighboring trajectories evolves as δ˙x=A(x,u)δx+B(x)δu, where A(x,u):=fx+i=1muibix.

We say M:XSn0 is a CCM if there exists a controller u(x,x,u) s.t. x,x,uX×X×U, δxTxX, ddt(δxM(x)δx)λδxM(x)δx, which implies δxM:=δxM(x)δx converges to zero exponentially at rate λ. Such a closed-loop system is referred to be contracting (under metric M), which implies good tracking performance.

ddt(δxM(x)δx)λδxM(x)δx,δx,x,x,u, δ˙x=A(x,u)δx+B(x)δu, is equivalent to x,x,u, M˙+sym(M(A+BK))+2λM0,     () where K:=ux and sym(A)=A+A.

Learning a controller with a controller

The learning framework.

The loss function is designed as follows.

LM,u(θM,θu):=Ex,x,uUnif[LNSD(M˙+sym(M(A+BK))+2λM)], where LNSD:Rn×nR and LNSD(A)=0 iff. matrix A is negative semi-definite.

Obviously, by assuming the LHS of () is continuous, LM,u=0        () holds x,x,uX×X×U.

Experimental results

The above figures show the tracking error of the proposed method and several others. The proposed method outperforms all the others. It is worth mentioning that Quadrotor is a 9-dimensional system, and part of the dynamics of Neural Lander is represented as an NN.

The PAC learning framework

This blog is a summary of the first part of the course ECE543@UIUC. More materials can be found here.

Problem setup

In this blog, we will focus on the binary classification problem, where the input is XRD, and the label is Y{0,1}. A set of i.i.d. training samples S={(X1,Y1),,(Xn,Yn)} are drawn from a joint distribution P defined over RD×{0,1}. Sometimes, we may also denote Z=(X,Y) and Zi=(Xi,Yi). A hypothesis is a mapping h:RD{0,1}. Moreover, H denotes the hypotheses class, which consists of all the possible hypotheses that our model (e.g. a neural network) can represent. The learning algorithm is denoted by A, which maps a dataset S to a hypothesis hSH.

Given a hypothesis h, we define its true risk as LP(h)=E(X,Y)P Ih(X)Y, and the empirical risk on S as LS(h)=1ni=1nIh(Xi)Yi.

PAC-learnability

After applying the learning algorithm A to the dataset S, we get a hypothesis hS which depends on S and minimizes the empirical risk LS(h). We care about the difference between the true risk of hS and that of h. Here, h is the best hypothesis in the sense of minimizing the true risk, i.e. h=argminhHLP(h). In other words, we want to get some guarantees on LP(hS)LP(h).

However, LP(hS)LP(h) is a random variable since hS depends on S which is random. Therefore, it is not reasonable to look for a uniform upper bound on LP(hS)LP(h). In the PAC (Probably Approximately Correct) learning framework, instead of a uniform upper bound, we care about guarantees of the form Pr(LP(hS)LP(h)ϵ)δ. Here, “Probably” is characterized by δ, and “Approximately” is characterized by ϵ. Next, we give the definition of PAC-learnability of a hypothesis class.

Definition 1 (PAC-learnability): A hypothesis class H is PAC-learnable if there exists an algorithm A such that for all data distribution PX,Y and all ϵ,δR+, n=n(ϵ,δ) such that given any training set S which contains at least n samples, hS=A(S) satisfies Pr(LP(hS)LP(h)ϵ)δ.

Theorem 1 (No free lunch): This theorem shows that we need some restrictions on the hypothesis class. We cannot learn an arbitrary set of hypotheses: there is no free lunch. Please see these lecture notes 1 and 2 for details.

A trick

Using a small trick, LP(hS)LP(h) can be bounded as follows. By doing this, hS and h are eliminated.

LP(hS)LP(h)=[LP(hS)LS(hS)]+[LS(h)LP(h)]+[LS(hS)LS(h)](this term0)[LP(hS)LS(hS)]+[LS(h)LP(h)]2suphH|LP(h)LS(h)|=2V, where suphH|LP(h)LS(h)| is denoted by V.

Thus, Pr(LP(hS)LP(h)ϵ)Pr(Vϵ2).

Theorem 2: Finite hypothesis classes are PAC-learnable.

Proof: Since |H|<, we can first bound Pr(|LS(h)LP(h)|ϵ2) for each h and then apply the union bound. For all ϵ>0 and all sets S with n samples, We have that Pr(LP(hS)LP(h)ϵ)Pr(maxhH|LS(h)LP(h)|ϵ2)=Pr(hH{|LS(h)LP(h)|ϵ2})hHPr(|LS(h)LP(h)|ϵ2)=hHPr(|1ni=1nIh(xi)yiLP(h)|ϵ2)hHexp(2nϵ24)=|H|exp(nϵ22)=δ, where in the last step we use the Hoeffding’s inequality. Thus, n(ϵ,δ)=2ϵ2log(|H|δ).

Theorem 1 is not very useful since almost all hypothesis classes we care about contain infinite hypotheses.

McDiarmid’s inequality

In order to derive PAC bounds for infinite hypothesis classes, we need some more sophisticated concentration inequalities.

Theorem 3 (McDiarmid’s inequality): Let X1,,Xn be i.i.d. random variables. Let f be a function of n variables. If xi,xi, |f(x1,,xi,,xn)f(x1,,xi,,xn)|ci, then ϵ>0, Pr(f(X1,,Xn)E[f(X1,,Xn)]ϵ)exp(2ϵ2i=1nci). Such a function f is said to have bounded differences.

Proof: TODO.

It can be shown that V has bounded differences, and thus McDiarmid’s inequality can be applied. Next, we have to derive an upper bound for E[V].

Rademacher complexity

Definition 2 (Rademacher complexity of a set): Let ARn, the Rademacher complexity R(A) of set A is defined as follows. R(A)=Eσ[supaA|1ni=1nσiai|], where σi’s are i.i.d. random variables and i, Pr(σi=1)=Pr(σi=1)=0.5.

Intuitively, Rademacher complexity characterizes the diversity of a set. If for every direction vector σ=[σ1,,σn], we can always find an element aA such that the inner product of σ and a is large, then R(A) would be large.

Theorem 4: The expectation of V can be bounded as follows. ES[V]2ES[R(F(S))], where F(S)={(Ih(x1)y1,,Ih(xn)yn),hH}.

Proof: To simplify the notation, the loss term Ih(xi)yi is denoted by f(zi). Moreover, the set of all f’s is denoted by F. The training set S={Zi}i=1n is the source of randomness. Thus, ES[V]=ES[supfF|1ni=1nf(Zi)EZ[f(Z)]|]=ES[supfF|1ni=1nf(Zi)1ni=1nEZi[f(Zi)]|], where we introduce a virtual validation set or a set of ghost samples S:={Zi}i=1n, which is independent of S. This is known as the symmertrization trick. Thus, ES[V]=ES[supfF|ES[1ni=1n(f(Zi)f(Zi))]|]Jensen’sES[supfFES[|1ni=1n(f(Zi)f(Zi))|]]ES,S[supfF|1ni=1n(f(Zi)f(Zi))|]. Since i, Zi and Zi are i.i.d., we have that the following three random variables have the same distribution, f(Zi)f(Zi)=df(Zi)f(Zi)=dσi(f(Zi)f(Zi)). Thus, ES[V]ES,S,σ[supfF|1ni=1nσi(f(Zi)f(Zi))|]2ES,σ[supfF|1ni=1nσif(Zi)|]=2ES[R(F(S)].

Some properties of Rademacher complexity

Rademacher calculus:

The following theorem is useful when dealing with the composition of a function and a set.

Theorem 5 (contraction principle): Let ARn, ϕ:RR, ϕ(0)=0, and ϕ() be M-Lipschitz continuous. Applying ϕ to the set A leads to a new set ϕA={(ϕ(a1),,ϕ(an)),aA}. Then, the Rademacher complexity of ϕA is bounded as follows. R(ϕA)2MR(A).

VC-dimension

Theorem 6 (finite class lemma): Suppose A={a(1),,a(N)} is a finite subset of Rn for N>2 and ||a(i)||L,i, then the Radmacher complexity of A is bounded as follows, Rn(A)2LlogNn.

  • finite class lemma
  • definition of VC-dimension
  • VC’s lemma
  • relationship of Rademacher complexity and VC-dimension

Others

  • surrogate loss function
  • contraction principle

Some examples

Now, we consider the separators on Rd. Given ΓRd and a data distribution D over Γ. A set of i.i.d. samples S={Z1,,Zn} is sampled from Γ conforming the distribution D. Denoting H as the hypothesis class, the learning alorithm maps S to a hypothesis hSH. Given a hypothesis h, define its true risk as LP(h)=EZD Ih(Z)1, and the empirical risk on S as LS(h)=1ni=1nIh(Zi)1.

Then using the VC-theory, we can derive a PAC bound for this learning problem. Here, we assume this case is realizable, i.e. LP(h)=0.

LP(hS)LP(h)=[LP(hS)LS(hS)]+[LS(h)LP(h)]+[LS(hS)LS(h)]0[LP(hS)LS(hS)]+[LS(h)LP(h)]2suphH|LP(h)LS(h)|=2V

Using the Radmacher average to bound E[V], E[V]2E[R(H(Zn))].

Then, we use the finite class lemma to bound the Radmacher average. Let Sn(H)=supzn|H(zn)|, then E[R(H(Zn))]E[2logSn(H)n]=2logSn(H)n.

Next, we use the VC-dimension and the Sauer–Shelah lemma to bound Sn(H). The Sauer-Shelah lemma, Sn(H)i=1VC(H)(ni)(ned)VC(H). Thus, we finally have E[LP(hS)]=E[LP(hS)LP(h)]8VC(H)(1+lognd)n.

Finally we use the McDiarmid’s inequality to derive a concentration property. We have Pr(LP(hS)E[LP(hS)]12nlog1δ)1δ which with the above results implies that with probability higher than 1δ, LP(hS)8VC(H)(1+lognd)n+12nlog1δ.

With Dudley’s theorem, we can construct the so-called Dudley classes whose VC-dimension is known. Specifically, we can choose a basis of n linearly independent functions to construct a n-dimensional Hilbert space of which the VC-dimension is n. For example, {1,x,,xn} is a valid basis.