Learning Certified Control using Contraction Metric

Abstract: In this paper, we solve the problem of finding a certified control policy that drives a robot from any given initial state and under any bounded disturbance to the desired reference trajectory, with guarantees on the convergence or bounds on the tracking error. Such a controller is crucial in safe motion planning. We leverage the advanced theory in Control Contraction Metric and design a learning framework based on neural networks to co-synthesize the contraction metric and the controller for control-affine systems. We further provide methods to validate the convergence and bounded error guarantees. We demonstrate the performance of our method using a suite of challenging robotic models, including models with learned dynamics as neural networks. We compare our approach with leading methods using sum-of-squares programming, reinforcement learning, and model predictive control. Results show that our methods indeed can handle a broader class of systems with less tracking error and faster execution speed. Code is available at The paper is available at

Problem definition

Formally, given a control-affine system $\dot{x} = f(x) + B(x)u$, we want to find a feedback controller $u(\cdot,\cdot,\cdot)$, so that for any reference $(x^*(t), u^*(t))$ solving the ODE, the closed-loop system perturbed by $d$ $$ \dot{x} = f(x(t)) + B(x(t))u(x(t),x^*(t),u^*(t)) + d(t)$$ satisfies that the tracking error $|x(t) – x^*(t)|$ is upper bounded, for all $|d(t)| \leq \epsilon$ and all initial condition $x(0) \in \mathcal{X}$.

Contraction analysis

Contraction analysis can be viewed as a differential version of Lyapunov’s theory. It analyzes the incremental stability by considering the evolution of the distance between two neighboring trajectories of the system. Lyapunov’s theory considers whether the trajectories will finally converge to a point (equilibrium), while contraction theory considers whether all the trajectories converge to a common trajectory. Control contraction metric (CCM) theory extends the analysis to cases with control input, which enables tracking controller synthesis.

Consider a system $\dot{x} = f(x) + B(x) u$, its virtual displacement $\delta_x$ between any pair of arbitrarily close neighboring trajectories evolves as $\dot{\delta}_x = A(x, u) \delta_x + B(x) \delta_u$, where $A(x,u) := \frac{\partial f}{\partial x} + \sum_{i=1}^{m}u^i\frac{\partial b_i}{\partial x}$.

We say $M : \mathcal{X} \mapsto \mathbb{S}_n^{\geq 0}$ is a CCM if there exists a controller $u(x,x^*,u^*)$ s.t. $\forall x, x^*, u^* \in \mathcal{X} \times \mathcal{X} \times \mathcal{U}$, $\forall \delta_x \in \mathcal{T}_{x}{\mathcal{X}}$, $$\frac{d}{dt}\left(\delta_x^\intercal M(x) \delta_x\right) \leq -\lambda \delta_x^\intercal M(x) \delta_x,$$ which implies $\|\delta_x\|_M := \sqrt{\delta_x^\intercal M(x) \delta_x}$ converges to zero exponentially at rate $\lambda$. Such a closed-loop system is referred to be contracting (under metric $M$), which implies good tracking performance.

$$\frac{d}{dt}\left(\delta_x^\intercal M(x) \delta_x\right) \leq -\lambda \delta_x^\intercal M(x) \delta_x,\, \forall \delta_x,\, \forall x,x^*,u^*,$$ $$\dot{\delta}_x = A(x, u) \delta_x + B(x) \delta_u,$$ is equivalent to $\forall x,x^*,u^*$, $$\dot{M} + \mathtt{sym}\left(M(A+BK)\right) + 2 \lambda M \preceq 0,~~~~~(*)$$ where $K := \frac{\partial u}{\partial x}$ and $\mathtt{sym}\left(A\right) = A + A^\intercal$.

Learning a controller with a controller

The learning framework.

The loss function is designed as follows.

$$\begin{multline*} \mathcal{L}_{M,u} (\theta_M, \theta_u) := \\\mathop{\mathbb{E}}_{x,x^*,u^* \sim \mathtt{Unif}}\left[L_{NSD}\left(\dot{M} + \mathtt{sym}\left(M(A+BK)\right) + 2 \lambda M\right)\right], \end{multline*}$$ where $L_{NSD} : \mathbb{R}^{n \times n} \mapsto \mathbb{R}_{\geq}$ and $L_{NSD}(A) = 0$ iff. matrix $A$ is negative semi-definite.

Obviously, by assuming the LHS of $(*)$ is continuous, $$\mathcal{L}_{M,u} = 0~~~\Longleftrightarrow~~~~~(*)\text{ holds } \forall x,x^*,u^* \in \mathcal{X} \times \mathcal{X} \times \mathcal{U}.$$

Experimental results

The above figures show the tracking error of the proposed method and several others. The proposed method outperforms all the others. It is worth mentioning that $\texttt{Quadrotor}$ is a $9$-dimensional system, and part of the dynamics of $\texttt{Neural Lander}$ is represented as an NN.

The PAC learning framework

This blog is a summary of the first part of the course ECE543@UIUC. More materials can be found here.

Problem setup

In this blog, we will focus on the binary classification problem, where the input is $X \in \mathbb{R}^D$, and the label is $Y \in \{0, 1\}$. A set of i.i.d. training samples $S = \{(X_1, Y_1), \dots, (X_n, Y_n)\}$ are drawn from a joint distribution $P$ defined over $\mathbb{R}^D \times \{0,1\}$. Sometimes, we may also denote $Z = (X, Y)$ and $Z_i = (X_i, Y_i)$. A hypothesis is a mapping $h: \mathbb{R}^D \mapsto \{0, 1\}$. Moreover, $\mathcal{H}$ denotes the hypotheses class, which consists of all the possible hypotheses that our model (e.g. a neural network) can represent. The learning algorithm is denoted by $\mathcal{A}$, which maps a dataset $S$ to a hypothesis $h_S \in \mathcal{H}$.

Given a hypothesis $h$, we define its true risk as $$L_P(h) = \mathbb{E}_{(X, Y) \sim P}~\mathbb{I}_{h(X) \neq Y},$$ and the empirical risk on $S$ as $$L_S(h) = \frac{1}{n}\sum_{i=1}^{n} \mathbb{I}_{h(X_i) \neq Y_i}.$$


After applying the learning algorithm $\mathcal{A}$ to the dataset $S$, we get a hypothesis $h_S$ which depends on $S$ and minimizes the empirical risk $L_S(h)$. We care about the difference between the true risk of $h_S$ and that of $h^*$. Here, $h^*$ is the best hypothesis in the sense of minimizing the true risk, i.e. $h^* = \text{arg} \min\limits_{h \in \mathcal{H}} L_P(h)$. In other words, we want to get some guarantees on $L_P(h_S)-L_P(h^*)$.

However, $L_P(h_S)-L_P(h^*)$ is a random variable since $h_S$ depends on $S$ which is random. Therefore, it is not reasonable to look for a uniform upper bound on $L_P(h_S)-L_P(h^*)$. In the PAC (Probably Approximately Correct) learning framework, instead of a uniform upper bound, we care about guarantees of the form $$\Pr(L_P(h_S)-L_P(h^*) \geq \epsilon) \leq \delta.$$ Here, “Probably” is characterized by $\delta$, and “Approximately” is characterized by $\epsilon$. Next, we give the definition of PAC-learnability of a hypothesis class.

Definition 1 (PAC-learnability): A hypothesis class $\mathcal{H}$ is PAC-learnable if there exists an algorithm $\mathcal{A}$ such that for all data distribution $P_{X,Y}$ and all $\epsilon,\delta \in \mathbb{R}^+$, $\exists n=n(\epsilon,\delta)$ such that given any training set $S$ which contains at least $n$ samples, $h_S=\mathcal{A}(S)$ satisfies $$\Pr(L_P(h_S)-L_P(h^*) \geq \epsilon) \leq \delta.$$

Theorem 1 (No free lunch): This theorem shows that we need some restrictions on the hypothesis class. We cannot learn an arbitrary set of hypotheses: there is no free lunch. Please see these lecture notes 1 and 2 for details.

A trick

Using a small trick, $L_P(h_S) – L_P(h^*)$ can be bounded as follows. By doing this, $h_S$ and $h^*$ are eliminated.

$$\begin{multline} L_P(h_S) – L_P(h^*) = [L_P(h_S) – L_S(h_S)] + [L_S(h^*) – L_P(h^*)] \\+ [L_S(h_S) – L_S(h^*)]_{(\text{this term}\leq 0)} \\\leq [L_P(h_S) – L_S(h_S)] + [L_S(h^*) – L_P(h^*)] \\\leq 2\sup\limits_{h \in \mathcal{H}} \left|L_P(h)-L_S(h)\right| = 2 \mathcal{V},\end{multline}$$ where $\sup\limits_{h \in \mathcal{H}} \left|L_P(h)-L_S(h)\right|$ is denoted by $\mathcal{V}$.

Thus, $$\Pr(L_P(h_S)-L_P(h^*) \geq \epsilon) \leq \Pr(\mathcal{V} \geq \frac{\epsilon}{2}).$$

Theorem 2: Finite hypothesis classes are PAC-learnable.

Proof: Since $|\mathcal{H}|<\infty$, we can first bound $\Pr\left(\left|L_S(h)-L_P(h)\right|\geq\frac{\epsilon}{2}\right)$ for each $h$ and then apply the union bound. For all $\epsilon > 0$ and all sets $S$ with $n$ samples, We have that $$\begin{align} &\Pr\left(L_P(h_S)-L_P(h^*) \geq \epsilon\right)\\ \leq &\Pr\left(\max\limits_{h \in \mathcal{H}} \left|L_S(h)-L_P(h)\right| \geq \frac{\epsilon}{2}\right)\\= &\Pr\left(\cup_{h \in \mathcal{H}} \left\{\left|L_S(h)-L_P(h) \right| \geq \frac{\epsilon}{2}\right\}\right)\\\leq &\sum_{h\in\mathcal{H}}\Pr\left(\left|L_S(h) – L_P(h) \right|\geq\frac{\epsilon}{2}\right) \\= &\sum_{h\in\mathcal{H}}\Pr\left(\left|\frac{1}{n}\sum_{i=1}^{n}\mathbb{I}_{h(x_i)\neq y_i} – L_P(h) \right|\geq\frac{\epsilon}{2}\right) \\\leq &\sum_{h\in\mathcal{H}} \exp\left(-\frac{2n\epsilon^2}{4}\right) = |\mathcal{H}|\exp\left(-\frac{n\epsilon^2}{2}\right) = \delta, \end{align}$$ where in the last step we use the Hoeffding’s inequality. Thus, $n(\epsilon, \delta) = \frac{2}{\epsilon^2}\log(\frac{|\mathcal{H}|}{\delta})$.

Theorem 1 is not very useful since almost all hypothesis classes we care about contain infinite hypotheses.

McDiarmid’s inequality

In order to derive PAC bounds for infinite hypothesis classes, we need some more sophisticated concentration inequalities.

Theorem 3 (McDiarmid’s inequality): Let $X_1, \dots, X_n$ be i.i.d. random variables. Let $f$ be a function of $n$ variables. If $\forall x_i, x_i’$, $$|f(x_1,\dots,x_i,\dots,x_n) – f(x_1,\dots,x_i’,\dots,x_n)| \leq c_i,$$ then $\forall \epsilon>0$, $$\Pr(f(X_1,\dots,X_n)-\mathbb{E}\left[f(X_1,\dots,X_n)\right] \geq \epsilon) \leq \exp\left(\frac{-2\epsilon^2}{\sum_{i=1}^{n}c_i}\right).$$ Such a function $f$ is said to have bounded differences.

Proof: TODO.

It can be shown that $\mathcal{V}$ has bounded differences, and thus McDiarmid’s inequality can be applied. Next, we have to derive an upper bound for $\mathbb{E}\left[\mathcal{V}\right]$.

Rademacher complexity

Definition 2 (Rademacher complexity of a set): Let $A \subseteq \mathbb{R}^n$, the Rademacher complexity $R(A)$ of set $A$ is defined as follows. $$R(A) = \mathbb{E}_{\sigma}\left[\sup\limits_{a \in A} \left|\frac{1}{n}\sum_{i=1}^{n}\sigma_i a_i\right|\right],$$ where $\sigma_i$’s are i.i.d. random variables and $\forall i$, $\Pr(\sigma_i = 1) = \Pr(\sigma_i = -1) = 0.5$.

Intuitively, Rademacher complexity characterizes the diversity of a set. If for every direction vector $\sigma = [\sigma_1, \cdots, \sigma_n]$, we can always find an element $a \in A$ such that the inner product of $\sigma$ and $a$ is large, then $R(A)$ would be large.

Theorem 4: The expectation of $\mathcal{V}$ can be bounded as follows. $$\mathbb{E}_S\left[\mathcal{V}\right] \leq 2\mathbb{E}_S \left[R(\mathcal{F}(S))\right],$$ where $\mathcal{F}(S) = \left\{\left(\mathbb{I}_{h(x_1) \neq y_1}, \cdots, \mathbb{I}_{h(x_n) \neq y_n}\right), h \in \mathcal{H}\right\}.$

Proof: To simplify the notation, the loss term $\mathbb{I}_{h(x_i)\neq y_i}$ is denoted by $f(z_i)$. Moreover, the set of all $f$’s is denoted by $\mathcal{F}$. The training set $S = \{Z_i\}_{i=1}^{n}$ is the source of randomness. Thus, $$\begin{align} &\mathbb{E}_S\left[\mathcal{V}\right]\\ = &\mathbb{E}_S\left[\sup\limits_{f\in\mathcal{F}}\left|\frac{1}{n}\sum_{i=1}^{n}f(Z_i) – \mathbb{E}_Z\left[f(Z)\right]\right|\right]\\= &\mathbb{E}_S\left[\sup\limits_{f\in\mathcal{F}}\left|\frac{1}{n}\sum_{i=1}^{n}f(Z_i) – \frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{Z_i’}\left[f(Z_i’)\right]\right|\right],\end{align}$$ where we introduce a virtual validation set or a set of ghost samples $S’ := \{Z_i’\}_{i=1}^{n}$, which is independent of $S$. This is known as the symmertrization trick. Thus, $$\begin{align}&\mathbb{E}_S\left[\mathcal{V}\right]\\= &\mathbb{E}_S\left[\sup\limits_{f\in\mathcal{F}}\left|\mathbb{E}_{S’}\left[\frac{1}{n}\sum_{i=1}^{n}\left(f(Z_i) – f(Z_i’)\right)\right]\right|\right] \\ \underset{\text{Jensen’s}}{\leq} &\mathbb{E}_S\left[\sup\limits_{f\in\mathcal{F}}\mathbb{E}_{S’}\left[\left|\frac{1}{n}\sum_{i=1}^{n}\left(f(Z_i) – f(Z_i’)\right)\right|\right]\right]\\ \leq &\mathbb{E}_{S,S’}\left[\sup\limits_{f\in\mathcal{F}}\left|\frac{1}{n}\sum_{i=1}^{n}\left(f(Z_i) – f(Z_i’)\right)\right|\right].\end{align}$$ Since $\forall i$, $Z_i$ and $Z_i’$ are i.i.d., we have that the following three random variables have the same distribution, $$f(Z_i)-f(Z_i’)\stackrel{\mbox{d}}{=}f(Z_i’)-f(Z_i)\stackrel{\mbox{d}}{=}\sigma_i(f(Z_i)-f(Z_i’)).$$ Thus, $$\begin{align}&\mathbb{E}_S\left[\mathcal{V}\right]\\\leq&\mathbb{E}_{S,S’,\sigma}\left[\sup\limits_{f\in\mathcal{F}}\left|\frac{1}{n}\sum_{i=1}^{n}\sigma_i\left(f(Z_i) – f(Z_i’)\right)\right|\right]\\\leq&2\mathbb{E}_{S,\sigma}\left[\sup\limits_{f\in\mathcal{F}}\left|\frac{1}{n}\sum_{i=1}^{n}\sigma_if(Z_i)\right|\right]= 2\mathbb{E}_S\left[R(\mathcal{F}(S)\right].\end{align}$$

Some properties of Rademacher complexity

Rademacher calculus:

The following theorem is useful when dealing with the composition of a function and a set.

Theorem 5 (contraction principle): Let $A\subseteq \mathbb{R}^n$, $\phi: \mathbb{R}\mapsto\mathbb{R}$, $\phi(0) = 0$, and $\phi(\cdot)$ be $M$-Lipschitz continuous. Applying $\phi$ to the set $A$ leads to a new set $\phi \circ A = \left\{\left(\phi(a_1), \cdots, \phi(a_n)\right), a \in A\right\}$. Then, the Rademacher complexity of $\phi \circ A$ is bounded as follows. $$R\left(\phi \circ A\right) \leq 2MR(A).$$


Theorem 6 (finite class lemma): Suppose $A = \{a^{(1)},\dots,a^{(N)}\}$ is a finite subset of $\mathbb{R}^n$ for $N>2$ and $||a^{(i)}|| \leq L,\,\forall i$, then the Radmacher complexity of $A$ is bounded as follows, $$R_n(A) \leq \frac{2L\sqrt{\log N}}{n}.$$

  • finite class lemma
  • definition of VC-dimension
  • VC’s lemma
  • relationship of Rademacher complexity and VC-dimension


  • surrogate loss function
  • contraction principle

Some examples

Now, we consider the separators on $\mathbb{R}^d$. Given $\Gamma \subseteq \mathbb{R}^d$ and a data distribution $\mathcal{D}$ over $\Gamma$. A set of i.i.d. samples $S=\{Z_1,\dots,Z_n\}$ is sampled from $\Gamma$ conforming the distribution $\mathcal{D}$. Denoting $\mathcal{H}$ as the hypothesis class, the learning alorithm maps $S$ to a hypothesis $h_S \in \mathcal{H}$. Given a hypothesis $h$, define its true risk as $$L_P(h) = \mathbb{E}_{Z \sim \mathcal{D}}~\mathbb{I}_{h(Z) \neq 1},$$ and the empirical risk on $S$ as $$L_S(h) = \frac{1}{n}\sum_{i=1}^{n} \mathbb{I}_{h(Z_i) \neq 1}.$$

Then using the VC-theory, we can derive a PAC bound for this learning problem. Here, we assume this case is realizable, i.e. $L_P(h^*) = 0$.

$\begin{multline} L_P(h_S) – L_P(h^*) = [L_P(h_S) – L_S(h_S)] + [L_S(h^*) – L_P(h^*)] \\+ [L_S(h_S) – L_S(h^*)]_{\leq 0} \\\leq [L_P(h_S) – L_S(h_S)] + [L_S(h^*) – L_P(h^*)] \\\leq 2\sup\limits_{h \in \mathcal{H}} \left|L_P(h)-L_S(h)\right| = 2 \mathcal{V}\end{multline}$

Using the Radmacher average to bound $\mathbb{E}[\mathcal{V}]$, $$\mathbb{E}[\mathcal{V}] \leq 2\mathbb{E}[R(\mathcal{H}(Z^n))].$$

Then, we use the finite class lemma to bound the Radmacher average. Let $S_n(\mathcal{H}) = \sup\limits_{z^n}\left|\mathcal{H}(z^n)\right|$, then $$\mathbb{E}[R(\mathcal{H}(Z^n))] \leq \mathbb{E}\left[2\sqrt{\frac{\log S_n(\mathcal{H})}{n}}\right] = 2\sqrt{\frac{\log S_n(\mathcal{H})}{n}}.$$

Next, we use the VC-dimension and the Sauer–Shelah lemma to bound $S_n(\mathcal{H})$. The Sauer-Shelah lemma, $$S_n(\mathcal{H}) \leq \sum_{i=1}^{VC(\mathcal{H})} {n \choose i} \leq \left(\frac{ne}{d}\right)^{VC(\mathcal{H})}.$$ Thus, we finally have $$\mathbb{E} \left[L_P(h_S)\right] = \mathbb{E} \left[L_P(h_S) – L_P(h^*)\right] \leq 8\sqrt{\frac{VC(\mathcal{H}) (1 + \log \frac{n}{d})}{n}}.$$

Finally we use the McDiarmid’s inequality to derive a concentration property. We have $$\Pr\left(L_P(h_S) – \mathbb{E} \left[L_P(h_S)\right] \leq \frac{1}{2n}\sqrt{\log \frac{1}{\delta}}\right) \geq 1-\delta$$ which with the above results implies that with probability higher than $1-\delta$, $$L_P(h_S) \leq 8\sqrt{\frac{VC(\mathcal{H}) (1 + \log \frac{n}{d})}{n}} + \frac{1}{2n}\sqrt{\log \frac{1}{\delta}}.$$

With Dudley’s theorem, we can construct the so-called Dudley classes whose VC-dimension is known. Specifically, we can choose a basis of $n$ linearly independent functions to construct a $n$-dimensional Hilbert space of which the VC-dimension is $n$. For example, $\{1,x,\dots,x^n\}$ is a valid basis.