An Introduction to LQR from RL perspective

Motivation
What is LQR?
Discrete-time Deterministic LQR
Discrete-time Stochastic LQR
Continuous-time Deterministic LQR
Continuous-time Stochastic LQR
Run RL Algorithm on LQR
Formulating an LQR on reacher tasks
Show me the code
Reference

This is an introduction to Linear Quadratic Regulator (LQR) for people with an RL background.
I view LQR as an important class of control problems that can be solved with RL algorithms.

There are a lot of excellent tutorials online about LQR. But they are mostly in line with the control theory textbooks. I have not found any that is geared toward RL researchers/practitioners. So here you go, I will try to explain LQR using the RL terminology in this post. It’s intended to cover the concepts that can be related to the RL side of things, rather than a comprehensive coverage of all details of LQR.

Motivation

Many real-world control tasks can be modeled as LQR, which has been extensively studied by the control theory community. In the RL community, LQR seems less popular, although it has been argued by Ben Becht that attempting to solve this (easy?) problem using RL would help us to better understand the capability and limitation of RL algorithms [2]. I will refer the readers to his blogs instead of repeating why that’s the case.

What I want to point out here is that, the RL community may have been throwing RL algorithms at LQR problems without necessarily knowing it. If we take a close look at the simulated control tasks commonly used in RL research today, lots of them can be approximated as LQR problems. For instance, some control environments in Openai Gym like pendulum, reacher and FetchReach, are approximately LQR. The approximation is two folds: first, it’s in the sense that the system dynamics is often nonlinear that can be approximated by a linear model; second, the original reward function does not exactly follow the LQR formulation but is very close.

I’d like to think of LQR as

a counterpart of tabular MDP 
in continuous-time with continuous state and action space.

What is LQR?

Let us first get some terminology straight. LQR stands for Linear Quadratic Regulator.

Linear refers to the linear system dynamics.
Quadratic refers to the quadratic cost.
Regulator (I guess) historically refers to regulating some signal of a system.

In total, there are four sub-types of LQR that we consider:

Discrete-time Deterministic LQR
Discrete-time Stochastic LQR
Continuous-time Deterministic LQR
Continuous-time Stochastic LQR

The key difference between the deterministic versions and the stochastic versions is that the former is noiseless whereas the latter considers noise in the dynamics. The stochastic version is what we will truly encounter in the real world. Nonetheless, we will first go over the deterministic setup since it is easier to understand and analyze.

Discrete-time Deterministic LQR

Discrete-time Deterministic LQR can be expressed as:

$x_{k+1} = A x_k + B u_k$ $c_{k} = x_k^\top S x_k + u_k^\top R u_k$

where $x_k\in \mathcal{R}^n$ and $u_k\in \mathcal{R}^m$ denote the state and action at step $k$ respectively (as opposed to $s_k, a_k$ ). $A$ and $B$ are matrices of dimensions of n x n and n x m. $c_k$ denotes the cost at step $k$ . $S$ and $R$ are matrices of dimensions of n x n and m x m, where $S$ is positive semi-definite and $R$ is positive definite: $S\succeq 0$ and $R\succ 0$ . The literature typically assumes that $A$ and $B$ are unknown and $S$ and $R$ are known.

Instead of choosing actions to maximize the sum of rewards $G_t$ as in RL, in LQR the objective is to find a sequence of control that minimizes the sum of cost.

There are finite horizon and infinite horizon settings, corresponding to episodic and continuing tasks in RL. And just like in RL, we can formulate the objective as either discounted or average.

Discounted Cost:

$J(u_0, u_1, ...) = \mathbb{E} [\sum_{k=0}^{N} \gamma^{k} c_k] = \mathbb{E} [\sum_{k=0}^{N-1} \gamma^{k} (x_k^\top S x_k + u_k^\top R u_k) + \gamma^{N} (x_N^\top S x_N)]$

where the discount factor $\gamma \in [0,1]$ .

Average Cost:

$J(u_0, u_1, ...) = \lim_{N\to \infty} \mathbb{E} [ \frac{1}{N} \sum_{k=0}^{N} c_k] = \lim_{N\to \infty} \mathbb{E} [\frac{1}{N} \sum_{k=0}^{N} (x_k^\top S x_k + u_k^\top R u_k)]$

In both objectives, the expectation is wrt the noise so it’s not necessary for the deterministic case but we leave it there for completeness.

The average cost objective is studied a lot more than the discounted cost objective in the control theory literature, which is the opposite of the RL literature.

Although the average cost setting seems to make sense especially for finite sample analysis, we only cover the discounted setting in the rest of this post to align with the RL algorithms.

Some properties of LQR

What’s nice about LQR is that its structure allows for some neat properties:

* the state and action value function are a quadratic function
* the optimal policy (called optimal control in LQR), is linear in the state.
* if we know the parameters of the problem, the optimal policy, 
  and the value functions for any given linear policies can be computed
  using routines in python, Julia, Matlab etc.

Optimal Policy $\pi_*$

The optimal control is given by

\begin{equation} u_* = K_* x = -\gamma (R + \gamma B^\top P_* B)^{-1} B^\top P_* A x \label{eq:ustar} \end{equation}

where $x$ is any given state, the $*$ denotes the optimal solution, $K_*$ just denotes the parameter matrix in the linear function of state, and the $P_*$ is the solution to Algebraic Riccati Equation (ARE):

$P = S + \gamma A^\top P A - \gamma^2 A^\top P B(R+\gamma B^\top P B)^{-1} B^\top P A$

$P$ is n x n symmetric matrix. Note that with the discount factor, the equation above is not a typical Riccati Equation as you might find on wikipedia but it’s easy to show that it is equivalent to a typical ARE with a discounted A and B matrix.

Let’s take a moment to understand what it implies.

In RL, to find the optimal policy, if the dynamics $p(s_{k+1}, r|s_k,a_k)$ is known, we would need to run some iterative algorithms like value or policy iteration. But in LQR, the optimal policy can be computed directly.
This matrix $P_*$ is the steady-state of a $P_k$ that evolves backward in time. In other words, we can only hope to achieve $P_*$ if the time is sufficiently away from the terminal time. In finite-horizon, typically we don’t attain $P_*$ so the optimal control $u_*$ in Eq. \eqref{eq:ustar} is in fact sub-optimal.
If you wonder what to do when we don’t know the true parameters, we use function approximation! Before I cover that later in this post, I want to lay out the foundation.

State Value Function $V_\pi(x)$

The value function can be expressed as

\begin{equation} V_\pi(x_k) = x_k^\top P_\pi x_k \end{equation}

where $P_\pi$ is a cost matrix associated with the policy $\pi: \mathcal{X} \to \mathcal{U}$ that maps the state space $\mathcal{X}$ to action space $\mathcal{U}$ . Since the optimal policy is contained in the familiy of linear policies, we can assume that the policy is always linear: $\pi(x_k) = K_\pi x_k$ where $K_\pi \in \mathcal{R}^{m\times n}$ .

Similar to cost matrix $P_*$ for the optimal policy discussed above, the cost matrix $P_\pi$ for an arbitrary policy $\pi = K_\pi x$ also evolves backward in time and has a steady state. But instead of using ARE to solve it, we should use a different equation:

\begin{equation} -P_k + S + K^\top R K + \gamma [(A+BK)^\top P_{k+1}(A+BK)] = 0 \end{equation}

Let $M = S + K^\top R K$ , $L = \sqrt{\gamma}(A+BK)$ , we can rewrite the equation above as \begin{equation} -P_k + M + L^\top P_{k+1} L = 0 \end{equation}

this equation in terms of its steady-state solution $P_\pi$ is called Lyapunov Equation.

Note that, if the horizon is finite and not long enough to attain steady-state, we can solve for $P_\pi$ recursively (backward in time) where the base case is given by the $P_k$ at the terminal step $N$ where we assume no control cost: $V(N) = x_N ^\top S x_N$ .

Action Value Function $Q_\pi(x, u)$

It can be shown that the action value function is quadratic in state and action [4]. $\begin{align} \nonumber Q_\pi(x, u) &= c(x,u) + \gamma V_\pi( Ax + Bu) \\ &= \begin{bmatrix} x \\ u \end{bmatrix}^\top \begin{bmatrix} S + \gamma A^\top P_\pi A & \gamma A^\top P_\pi B \\ \gamma B^\top P_\pi A & R+\gamma B^\top P_\pi B \end{bmatrix} \begin{bmatrix} x \\ u \end{bmatrix} \\ &= \begin{bmatrix} x \\ u \end{bmatrix}^\top \begin{bmatrix} H_{11} & H_{12} \\ H_{21} & H_{22} \end{bmatrix} \begin{bmatrix} x \\ u \end{bmatrix} \label{eq:Q} \end{align}$

where $\begin{bmatrix} x \\ u \end{bmatrix}$ is a column vector from concatenating the vectors $x$ and $u$ . $H$ is a positive definite matrix of size (n+m) x (n+m). Combined with Eq.\eqref{eq:ustar}, the greedy action at state $x$ wrt the current policy $\pi$ is:

\begin{equation} u_g = -H_{22}^{-1} H_{21} x \label{eq:greedy} \end{equation}

What this implies is that, if we know the matrix $H$ , we can perform policy improvement by taking the greedy action $u_g$ . Now this is one of the nicest feature of LQR from RL perspective. It is well known that problems involving continuous actions are difficult for RL algorithms. But in the case of LQR, we can compute the greedy policy if we have some estimate of matrix $H$ after policy evaluation, just like in discrete action case!

Model-based vs Model-free, and Function Approximation

In most cases, we don’t know the true parameters $A, B$ of the system. Model-based algorithms estimate the dynamic model and compute the optimal control based on the estimated parameters $A, B$ . Model-free algorithms do not explicitly estimate the model. Instead, we approximate the optimal action value function by estimating the matrix $H$ in Eq.\eqref{eq:Q} and compute the optimal control by Eq.\eqref{eq:greedy}.

What functions shall we choose for approximating the action value function? We can throw a neural network at it. But that might be an overkill. Since we know that the action value funciton is quadratic, we can build the quadratic basis functions from the state-action pair $[x\ u]$ and approximate the action value as a linear combination of them:

$Q_\pi(x, u) = \theta^\top \phi([x^\top, u^\top])$

where $\theta$ is the weight vector and the $\phi([x^\top, u^\top])$ is the quadratic basis feature vector: $\phi(v) = [v_1^2, \cdots, v_1v_n, v_2^2, \cdots, v_2v_n, v_{n-1}^2, v_{n-1}v_n, \cdots v_n^2]$

There are $(n+m)^2$ entries in the matrix $H$ but we do not need to estimate that many parameters due to $H$ being symmetric.

The dimensionality of $\theta$ and $\phi(x, u)$ is (n+m+1)(n+m)/2. $\theta$ can be viewed as a flattened version of the upper triangular of matrix $H$ .

Stability, Controllability

Two quite useful concepts from control theory are stability and controllability [3].

Stability of policy $K_\pi$ tells us if following this policy would lead to divergence. It is determined by $A, B$ and $K_\pi$ . For a policy to be stable, the eigenvalues of the matrix $(A+BK_\pi)$ needs to be strictly within the unit circle(often complex numbers).

Controllability of the system tells us for any initial state $x_0$ , if there exists some policy that drives the state to zero. It is determined by $A, B$ . For a system to be controllable, the matrix $[B | AB | A^2B | ... | A^{k}B | ... | A^{n−1}B ]$ needs to be full rank.

Discrete-time Stochastic LQR

When we consider the noise during the transition of the system, the problem can be formulated as

$x_{k+1} = A x_k + B u_k + w_k$ $c_{k} = x_k^\top S x_k + u_k^\top R u_k$

where $w_k \in \mathcal{R}^n$ is the noise term, often modeled as an i.i.d multi-variate Gaussian $w_k \sim \mathcal{N}(0,\Sigma_w)$ . The per-step cost is defined the same way as the deterministic case.

What’s nice about this is that despite the noise, the optimal policy $\pi_*$ and the matrices $P_*, P_\pi$ remain the same as the deterministic case. The value and action value functions differ from the ones in deterministic cases by only a constant(time dependent, but not state or action dependent). It is caused by the accumulation of additional cost from the noise and would reduce over time to 0 at the terminal state.

Continuous-time Deterministic LQR

Shifting paradigm from discrte-time to continuous-time requires formulating the following differently:

- time index going from algorithmic time k to actual time t
- integration replaces summation
- discount changes with the actual time t
- we need to discretize the system

The continuous-time deterministic LQR can be described by

$\begin{align} \dot x(t) &= Ax(t)+Bu(t) \\ c_t &= c(x(t), u(t)) = x(t)^\top S x(t) + u(t)^\top R u(t) \label{eq:lqr-cts} \end{align}$

where $t$ is the actual time, the vector $x\in \mathbb{R}^n$ is state of the system and the vector $u\in \mathbb{R}^m$ is the control signal. Same as in the discrete-time system, $A$ , $B$ are fixed matrices of dimension $n \times n$ , $n \times m$ , respectively.
A learner continuously updates the control signal after observing the current state of the system. The cost $c_t$ is a quadratic function of the state and control, where $S$ and $R$ are fixed, symmetric positive semi-definite and definite matrices respectively, of dimension $n \times n$ and $m \times m$ .

Discounted Cost:

$J(u) = \mathbb{E}[\int_0^T \gamma^{t} c_t dt] = \mathbb{E}[ \int_0^T \gamma^{t} (x(t)^\top S x(t) + u(t)^\top R u(t)) dt + \gamma^{T} x(T)^\top S x(T)]$

Average Cost:

$J(u) = \lim_{T\to \infty} \mathbb{E}[ \frac{1}{T} \int_0^T c_t] dt = \lim_{T\to \infty} \mathbb{E}[ \frac{1}{T} \int_0^T (x(t)^\top S x(t) + u(t)^\top R u(t)) dt]$

Discretization of the System with Time Duration $h$ :

To compute the cost, we need to discretize the system with time duration $h$ . In practice, $h$ is a parameter of the algorithm that needs to be chosen carefully. It is often neglected in the RL literature but shown to affect the performance of the algorithm [6,7].

Optimal Policy $\pi_*$

The optimal control is again, linear in the state:

\begin{equation} u_* = K_* x = -\gamma^h R^{-1} B^\top P_* x \label{eq:ustar-cts} \end{equation}

State Value Function $V_\pi(x)$

Same as in the discrete-time case, the value function is quadratic in the state:

\begin{equation} V_\pi(x_t) = x_t^\top P_\pi x_t \end{equation}

To solve for the cost matrix $P_*$ for the optimal policy, we use the following equation

\begin{equation} 0 = S + \gamma^h (PA+A^\top P) - \gamma^{2h}PBR^{-1}B^\top P - \frac{1-\gamma^h}{h}P \label{eq:care} \end{equation}

which can be rearranged as:

\begin{equation} 0 = S + (P\tilde{A}+\tilde{A}^\top P) - P\tilde{B}R^{-1}\tilde{B}^\top P \end{equation} where $\tilde{A} = \gamma^h A - \frac{1-\gamma^h}{2h} I_{n \times n}$ and $\tilde{B} = \gamma^h B$

This equation is called the Continuous-time Algebraic Riccati Equation (CARE).

Similar to the discrete-time case, we use a different equation to solve for the cost matrix $P_\pi$ for an arbitrary policy $\pi = K_\pi x$ :

\begin{equation} 0 = S + K_\pi^\top R K_\pi - \frac{1-\gamma^h}{h}P + (\gamma^h (A+BK_\pi))^\top P + P (\gamma^h (A+BK_\pi)) \label{eq:cts-lyap} \end{equation}

which can be rearranged as:

\begin{equation} 0 = M + D^\top P + P D \end{equation} where $M = S + K_\pi^\top R K_\pi$ , and $D = \gamma^h (A+BK_\pi) - \frac{1-\gamma^h}{2h} I_{n \times n}$

This is the continuous-time Lyapunov Equation.

$P_\pi$ is the solution to this Lyapunov Equation. Again, both $P_\pi$ and $P_*$ are the steady-state and we need to be aware of the situation when the steady-state is not attainable in finite horizon. The above equations Eq. \eqref{eq:care} \eqref{eq:cts-lyap} can be modified a bit to include $\dot{P}$ , which can be used to solve for $P$ recursively backward in time, similar to the discrete-time case.

Action Value Function $Q_\pi(x, u)$

$\begin{align} Q_\pi(x(t),u(t)) &= r(x(t),u(t)) + \gamma^h V_\pi(x(t+h)) \\ &= [x\ u]^\top \begin{bmatrix} hS + \gamma^h(P_\pi +hP_\pi A+hA^\top P_\pi) & \gamma^h h P_\pi B \\ \gamma^h h B^\top P_\pi & hR \end{bmatrix} \begin{bmatrix} x \\ u \end{bmatrix} \\ &= \begin{bmatrix} x \\ u \end{bmatrix}^\top \begin{bmatrix} H_{11} & H_{12} \\ H_{21} & H_{22} \end{bmatrix} \begin{bmatrix} x \\ u \end{bmatrix} \\ &= \begin{bmatrix} x \\ u \end{bmatrix}^\top H_\pi \begin{bmatrix} x \\ u \end{bmatrix} \end{align}$

where the $r(x(t),u(t))$ is the accumulated cost in $[t,t+h]$ , approximated as: $r(x(t),u(t)) = h c_t = h(x(t)^\top S x(t) + u(t)^\top R u(t))$ .

$V_\pi(x(t+h))$ can be computed by applying Taylor approximation based on the system dynamics.

Greedy Policy Doesn’t Change in Function Approximation

What’s remarkable is that the greedy action wrt the current policy $\pi$ is the same as Eq. \eqref{eq:greedy}: $u_g = -H_{22}^{-1} H_{21} x$ . In fact, it’s the same across all four subtypes of LQR. What that implies is that, if we run function approximation with model-free algorithms, the parameterization of the action value function can be the same. Once we obtain good estimation of the weight vector, we can compute the greedy action in a unified way.

Stability and Controllability

For stability, we need to make sure the real parts of the eigenvalues of $A+BK$ are all negative.

For controllability, it’s the same as in the discrete-time.

Continuous-time Stochastic LQR

In stochastic case, we consider the process noise and have the following dynamic:

A stochastic continuous-time LQR is similar to the deterministic case in Eq.\eqref{eq:lqr-cts} except that it has disturbance in its state dynamics [1]: $\begin{align} dx(t) &= (Ax(t)+Bu(t))dt + \sigma_w dw(t) \\ c_t &= c(x(t), u(t)) = x(t)^\top S x(t) + u(t)^\top R u(t) \label{eq:slqr-cts} \end{align}$

where the vector $w(t)\in \mathbb{R}^n$ is the disturbance of the system, $\sigma_w$ denotes a constant that scales the variance of the disturbance.

$\{w(t):t\in[0,\infty)\}$ is an n-dimensional Wiener process (or, standard Brownian motion) that has the following properties:

$w(0) = \mathbf{0}$ .
$w(t)$ has stationary independent increments:

the distribution of $w(t)-w(s)$ is independent of the past events $\{w(r): r\leq s\}$ and $w(s+t) - w(s)=w(t)-w(0)=w(t)$ .
$w(t)-w(s) \sim \mathcal{N}(\mathbf{0},(t-s)I)$ for any $0\leq s\leq t$ . We have $w(t) \sim \mathcal{N}(\mathbf{0},tI)$ .
$w(t)$ is continuous in $t$ almost surely.

Differences from the deterministic case?

It’s pretty much the same as in the discrete-time case. The optimal control $\pi_*$ , $P_*$ and $P_\pi$ remains the same. Value function and action value function have an additional time-dependent constant term (state/action indepdent). This term gradually reduces in time to zero at the terminal state.

Run RL Algorithm on LQR

This section covers how we can run value-based RL algorithms on LQR.

It is mentioned earlier that once we have an approximated action value function $Q_\pi$ , we know how to greedify the policy. Hence we just need some algorithm that does policy evalutation well. Standard on-policy and off-policy algorithms like Sarsa, Q-learning can be used here.

The problem is that LQR does not put constraint on the control so to get the algorithm working may be more difficult than it appears. The greedy action equation $u_g = -H_{22}^{-1} H_{21} x$ involves an inverse of an estimate, which is risky.

A few details that are worth attention:

The initial policy $K_0$ needs to be stable.
The system need to be controllable.
For model-free algorithms, we only have guarantee of convergence to the optimal policy in the discrete-time deterministic LQR, for policy iteration.
The behaviour policy also needs to be explorative as in tabular MDP. It’s called persistent excitation in control theory.
The correlation in the data generated by RL makes it difficult to solve for the weight vector which poses more risk in greedifying the policy.
Instead of using the greedy policy plus exploration noise as the behaviour policy, it may be better to use a fixed stable policy.

Formulating an LQR on reacher tasks

The input: state $x$ is the distance between the current position and the target position
The reward/cost: instead of Euclidean distance, should be the square of Euclidean distance, plus some quadratic cost on the action. Analysis can be performed for the case of zero-cost on action. But the optimal policy will be bang-bang control.
The system dynamic can be approximated as piece-wise linear, which is another layer of approximation on top of the time-invariant system discussed in this post.

Show me the code

Well, sorry that the code is not fully developed at this point to share.

You might want to check out this awesome tutorial by Stephen Tu on running LQR on Cartpole Swing-Up: https://github.com/stephentu/zero-to-cartpole/blob/master/swing-up.ipynb

Reference

[1] Wendell, H. FLEMING, R. W. Rishel, and W. Raymond. Deterministic and stochastic optimal control. (1975).

[2] Blog of Ben Recht: http://www.argmin.net/2018/02/08/lqr/

[3] D. P. Bertsekas. Dynamic programming and optimal control, volume 1. Athena scientific, Belmont, MA, 2005.

[4] S. J. Bradtke. Reinforcement learning applied to linear quadratic regulation. NIPS 1993

[5] Stephen Boyd’s EE363 Lecture Slides

[6] Mahmood, A. Rupam, Dmytro Korenkevych, Brent J. Komer, and James Bergstra. Setting up a reinforcement learning task with a real-world robot. IROS 2018.

[7] Tallec, C., Blier, L. & Ollivier, Y.. Making Deep Q-learning methods robust to time discretization. ICML 2019.