An Introduction to LQR from RL perspective
- Motivation
- What is LQR?
- Discrete-time Deterministic LQR
- Discrete-time Stochastic LQR
- Continuous-time Deterministic LQR
- Continuous-time Stochastic LQR
- Run RL Algorithm on LQR
- Formulating an LQR on reacher tasks
- Show me the code
- Reference
This is an introduction to Linear Quadratic Regulator (LQR) for people with an RL background.
I view LQR as an important class of control problems that can be solved with RL algorithms.
There are a lot of excellent tutorials online about LQR. But they are mostly in line with the control theory textbooks. I have not found any that is geared toward RL researchers/practitioners. So here you go, I will try to explain LQR using the RL terminology in this post. It’s intended to cover the concepts that can be related to the RL side of things, rather than a comprehensive coverage of all details of LQR.
Motivation
Many real-world control tasks can be modeled as LQR, which has been extensively studied by the control theory community. In the RL community, LQR seems less popular, although it has been argued by Ben Becht that attempting to solve this (easy?) problem using RL would help us to better understand the capability and limitation of RL algorithms [2]. I will refer the readers to his blogs instead of repeating why that’s the case.
What I want to point out here is that, the RL community may have been throwing RL algorithms at LQR problems without necessarily knowing it.
If we take a close look at the simulated control tasks commonly used in RL research today, lots of them can be approximated as LQR problems. For instance, some control environments in Openai Gym like pendulum, reacher and FetchReach, are approximately LQR. The approximation
is two folds: first, it’s in the sense that the system dynamics is often nonlinear that can be approximated
by a linear model; second, the original reward function does not exactly follow the LQR formulation but is very close.
I’d like to think of LQR as
a counterpart of tabular MDP
in continuous-time with continuous state and action space.
What is LQR?
Let us first get some terminology straight. LQR stands for Linear Quadratic Regulator.
Linear
refers to the linear system dynamics.Quadratic
refers to the quadratic cost.Regulator
(I guess) historically refers toregulating
some signal of a system.
In total, there are four sub-types of LQR that we consider:
- Discrete-time Deterministic LQR
- Discrete-time Stochastic LQR
- Continuous-time Deterministic LQR
- Continuous-time Stochastic LQR
The key difference between the deterministic versions and the stochastic versions is that the former is noiseless whereas the latter considers noise in the dynamics. The stochastic version is what we will truly encounter in the real world. Nonetheless, we will first go over the deterministic setup since it is easier to understand and analyze.
Discrete-time Deterministic LQR
Discrete-time Deterministic LQR can be expressed as:
where and denote the state and action at step respectively (as opposed to ).
and are matrices of dimensions of n x n
and n x m
.
denotes the cost at step .
and are matrices of dimensions of n x n
and m x m
, where is positive semi-definite and is positive definite: and .
The literature typically assumes that and are unknown and and are known.
Instead of choosing actions to maximize the sum of rewards as in RL, in LQR the objective is to find a sequence of control that minimizes the sum of cost.
There are finite horizon and infinite horizon settings, corresponding to episodic and continuing tasks in RL. And just like in RL, we can formulate the objective as either discounted or average.
Discounted Cost:
where the discount factor .
Average Cost:
In both objectives, the expectation is wrt the noise so it’s not necessary for the deterministic case but we leave it there for completeness.
The average cost objective is studied a lot more than the discounted cost objective in the control theory literature, which is the opposite of the RL literature.
Although the average cost setting seems to make sense especially for finite sample analysis, we only cover the discounted setting in the rest of this post to align with the RL algorithms.
Some properties of LQR
What’s nice about LQR is that its structure allows for some neat properties:
* the state and action value function are a quadratic function
* the optimal policy (called optimal control in LQR), is linear in the state.
* if we know the parameters of the problem, the optimal policy,
and the value functions for any given linear policies can be computed
using routines in python, Julia, Matlab etc.
Optimal Policy
The optimal control is given by
\begin{equation} u_* = K_* x = -\gamma (R + \gamma B^\top P_* B)^{-1} B^\top P_* A x \label{eq:ustar} \end{equation}
where is any given state, the denotes the optimal solution, just denotes the parameter matrix in the linear function of state, and the is the solution to Algebraic Riccati Equation (ARE):
is n x n
symmetric matrix.
Note that with the discount factor, the equation above is not a typical Riccati Equation as you might find on wikipedia but it’s easy to show that it is equivalent to a typical ARE with a discounted A and B matrix.
Let’s take a moment to understand what it implies.
- In RL, to find the optimal policy, if the dynamics is known, we would need to run some iterative algorithms like value or policy iteration. But in LQR, the optimal policy can be computed directly.
- This matrix is the steady-state of a that evolves backward in time. In other words, we can only hope to achieve if the time is sufficiently away from the terminal time. In finite-horizon, typically we don’t attain so the optimal control in Eq. \eqref{eq:ustar} is in fact sub-optimal.
- If you wonder what to do when we don’t know the true parameters, we use function approximation! Before I cover that later in this post, I want to lay out the foundation.
State Value Function
The value function can be expressed as
\begin{equation} V_\pi(x_k) = x_k^\top P_\pi x_k \end{equation}
where is a cost matrix associated with the policy that maps the state space to action space . Since the optimal policy is contained in the familiy of linear policies, we can assume that the policy is always linear: where .
Similar to cost matrix for the optimal policy discussed above, the cost matrix for an arbitrary policy also evolves backward in time and has a steady state. But instead of using ARE to solve it, we should use a different equation:
\begin{equation} -P_k + S + K^\top R K + \gamma [(A+BK)^\top P_{k+1}(A+BK)] = 0 \end{equation}
Let , , we can rewrite the equation above as \begin{equation} -P_k + M + L^\top P_{k+1} L = 0 \end{equation}
this equation in terms of its steady-state solution is called Lyapunov Equation.
Note that, if the horizon is finite and not long enough to attain steady-state, we can solve for recursively (backward in time) where the base case is given by the at the terminal step where we assume no control cost: .
Action Value Function
It can be shown that the action value function is quadratic in state and action [4].
where is a column vector from concatenating the vectors and . is a positive definite matrix of size (n+m) x (n+m)
.
Combined with Eq.\eqref{eq:ustar}, the greedy action at state wrt the current policy is:
\begin{equation} u_g = -H_{22}^{-1} H_{21} x \label{eq:greedy} \end{equation}
What this implies is that, if we know the matrix , we can perform policy improvement by taking the greedy action . Now this is one of the nicest feature of LQR from RL perspective. It is well known that problems involving continuous actions are difficult for RL algorithms. But in the case of LQR, we can compute the greedy policy if we have some estimate of matrix after policy evaluation, just like in discrete action case!
Model-based vs Model-free, and Function Approximation
In most cases, we don’t know the true parameters of the system. Model-based algorithms estimate the dynamic model and compute the optimal control based on the estimated parameters . Model-free algorithms do not explicitly estimate the model. Instead, we approximate the optimal action value function by estimating the matrix in Eq.\eqref{eq:Q} and compute the optimal control by Eq.\eqref{eq:greedy}.
What functions shall we choose for approximating the action value function? We can throw a neural network at it. But that might be an overkill. Since we know that the action value funciton is quadratic, we can build the quadratic basis functions from the state-action pair and approximate the action value as a linear combination of them:
where is the weight vector and the is the quadratic basis feature vector:
There are entries in the matrix but we do not need to estimate that many parameters due to being symmetric.
The dimensionality of and is (n+m+1)(n+m)/2
. can be viewed as a flattened version of the upper triangular of matrix .
Stability, Controllability
Two quite useful concepts from control theory are stability and controllability [3].
Stability
of policy tells us if following this policy would lead to divergence.
It is determined by and . For a policy to be stable, the eigenvalues of the matrix needs to be strictly within the unit circle(often complex numbers).
Controllability
of the system tells us for any initial state , if there exists some policy that drives the state to zero. It is determined by .
For a system to be controllable, the matrix needs to be full rank.
Discrete-time Stochastic LQR
When we consider the noise during the transition of the system, the problem can be formulated as
where is the noise term, often modeled as an i.i.d multi-variate Gaussian . The per-step cost is defined the same way as the deterministic case.
What’s nice about this is that despite the noise, the optimal policy and the matrices remain the same as the deterministic case. The value and action value functions differ from the ones in deterministic cases by only a constant(time dependent, but not state or action dependent). It is caused by the accumulation of additional cost from the noise and would reduce over time to 0 at the terminal state.
Continuous-time Deterministic LQR
Shifting paradigm from discrte-time to continuous-time requires formulating the following differently:
- time index going from algorithmic time k to actual time t
- integration replaces summation
- discount changes with the actual time t
- we need to discretize the system
The continuous-time deterministic LQR can be described by
where is the actual time,
the vector is state of the system and the vector is
the control signal.
Same as in the discrete-time system, , are fixed matrices of dimension , , respectively.
A learner continuously updates the control signal after observing the current state of the system.
The cost is a quadratic function of the state and control,
where and are fixed, symmetric positive semi-definite and definite matrices respectively, of dimension and .
Discounted Cost:
Average Cost:
Discretization of the System with Time Duration :
To compute the cost, we need to discretize the system with time duration . In practice, is a parameter of the algorithm that needs to be chosen carefully. It is often neglected in the RL literature but shown to affect the performance of the algorithm [6,7].
Optimal Policy
The optimal control is again, linear in the state:
\begin{equation} u_* = K_* x = -\gamma^h R^{-1} B^\top P_* x \label{eq:ustar-cts} \end{equation}
State Value Function
Same as in the discrete-time case, the value function is quadratic in the state:
\begin{equation} V_\pi(x_t) = x_t^\top P_\pi x_t \end{equation}
To solve for the cost matrix for the optimal policy, we use the following equation
\begin{equation} 0 = S + \gamma^h (PA+A^\top P) - \gamma^{2h}PBR^{-1}B^\top P - \frac{1-\gamma^h}{h}P \label{eq:care} \end{equation}
which can be rearranged as:
\begin{equation} 0 = S + (P\tilde{A}+\tilde{A}^\top P) - P\tilde{B}R^{-1}\tilde{B}^\top P \end{equation} where and
This equation is called the Continuous-time Algebraic Riccati Equation (CARE).
Similar to the discrete-time case, we use a different equation to solve for the cost matrix for an arbitrary policy :
\begin{equation} 0 = S + K_\pi^\top R K_\pi - \frac{1-\gamma^h}{h}P + (\gamma^h (A+BK_\pi))^\top P + P (\gamma^h (A+BK_\pi)) \label{eq:cts-lyap} \end{equation}
which can be rearranged as:
\begin{equation} 0 = M + D^\top P + P D \end{equation} where , and
This is the continuous-time Lyapunov Equation.
is the solution to this Lyapunov Equation. Again, both and are the steady-state and we need to be aware of the situation when the steady-state is not attainable in finite horizon. The above equations Eq. \eqref{eq:care} \eqref{eq:cts-lyap} can be modified a bit to include , which can be used to solve for recursively backward in time, similar to the discrete-time case.
Action Value Function
where the is the accumulated cost in , approximated as: .
can be computed by applying Taylor approximation based on the system dynamics.
Greedy Policy Doesn’t Change in Function Approximation
What’s remarkable is that the greedy action wrt the current policy is the same as Eq. \eqref{eq:greedy}: . In fact, it’s the same across all four subtypes of LQR. What that implies is that, if we run function approximation with model-free algorithms, the parameterization of the action value function can be the same. Once we obtain good estimation of the weight vector, we can compute the greedy action in a unified way.
Stability and Controllability
For stability
, we need to make sure the real parts of the eigenvalues of are all negative.
For controllability
, it’s the same as in the discrete-time.
Continuous-time Stochastic LQR
In stochastic case, we consider the process noise and have the following dynamic:
A stochastic continuous-time LQR is similar to the deterministic case in Eq.\eqref{eq:lqr-cts} except that it has disturbance in its state dynamics [1]:
where the vector is the disturbance of the system, denotes a constant that scales the variance of the disturbance.
is an n-dimensional Wiener process (or, standard Brownian motion) that has the following properties:
- .
-
has stationary independent increments:
the distribution of is independent of the past events and .
- for any . We have .
- is continuous in almost surely.
Differences from the deterministic case?
It’s pretty much the same as in the discrete-time case. The optimal control , and remains the same. Value function and action value function have an additional time-dependent constant term (state/action indepdent). This term gradually reduces in time to zero at the terminal state.
Run RL Algorithm on LQR
This section covers how we can run value-based RL algorithms on LQR.
It is mentioned earlier that once we have an approximated action value function , we know how to greedify the policy. Hence we just need some algorithm that does policy evalutation well. Standard on-policy and off-policy algorithms like Sarsa, Q-learning can be used here.
The problem is that LQR does not put constraint on the control so to get the algorithm working may be more difficult than it appears. The greedy action equation involves an inverse of an estimate, which is risky.
A few details that are worth attention:
- The initial policy needs to be
stable
. - The system need to be
controllable
. - For model-free algorithms, we only have guarantee of convergence to the optimal policy in the discrete-time deterministic LQR, for policy iteration.
- The behaviour policy also needs to be explorative as in tabular MDP. It’s called persistent excitation in control theory.
- The correlation in the data generated by RL makes it difficult to solve for the weight vector which poses more risk in greedifying the policy.
- Instead of using the greedy policy plus exploration noise as the behaviour policy, it may be better to use a fixed
stable
policy.
Formulating an LQR on reacher tasks
- The input: state is the distance between the current position and the target position
- The reward/cost: instead of Euclidean distance, should be the square of Euclidean distance, plus some quadratic cost on the action. Analysis can be performed for the case of zero-cost on action. But the optimal policy will be bang-bang control.
- The system dynamic can be approximated as piece-wise linear, which is another layer of approximation on top of the time-invariant system discussed in this post.
Show me the code
Well, sorry that the code is not fully developed at this point to share.
You might want to check out this awesome tutorial by Stephen Tu on running LQR on Cartpole Swing-Up: https://github.com/stephentu/zero-to-cartpole/blob/master/swing-up.ipynb
Reference
[1] Wendell, H. FLEMING, R. W. Rishel, and W. Raymond. Deterministic and stochastic optimal control. (1975).
[2] Blog of Ben Recht: http://www.argmin.net/2018/02/08/lqr/
[3] D. P. Bertsekas. Dynamic programming and optimal control, volume 1. Athena scientific, Belmont, MA, 2005.
[4] S. J. Bradtke. Reinforcement learning applied to linear quadratic regulation. NIPS 1993
[5] Stephen Boyd’s EE363 Lecture Slides
[6] Mahmood, A. Rupam, Dmytro Korenkevych, Brent J. Komer, and James Bergstra. Setting up a reinforcement learning task with a real-world robot. IROS 2018.
[7] Tallec, C., Blier, L. & Ollivier, Y.. Making Deep Q-learning methods robust to time discretization. ICML 2019.