Cart Pole Controllers – Tim Wheeler

Here we have a sad little cart-pole robot:

The cart-pole robot is sad because it can’t balance its pole. The poor thing is struggling. Look at it, just floundering around.

Let’s help it out.

Linear Control Theory

Ideally we’d just tell the robot how to balance its pole, but unfortunately we don’t exactly know how to do that. Sure – once the pole is up you kind of want to shift back and forth to counteract its sway, but that only works once its mostly balanced. Getting it up there requires other behavior. Seems kind of complicated.

But fear not! We can break out control theory and then construct a controller.

Let’s see. Okay, we start with the nonlinear dynamics given by Open AI’s cart-pole problem. Our state is given by a position \(x\), a speed \(v\), an angle from vertical \(\theta\) (in radians) and an angular speed \(\omega\). The cart has mass \(m_c\), the pole has mass \(m_p\) and the pole has length \(\ell\). The robot can use its wheels to exert a lateral force \(F\). Our transition dynamics are:

\[\zeta = \frac{F + m_p \ell \sin(\theta) \omega^2}{m_p + m_c}\]

\[\ddot{\theta} = \frac{g\sin(\theta) – \zeta \cos(\theta)}{\ell\left( \frac{4}{3} – \frac{m_p \cos(\theta)^2}{m_p + m_c} \right)}\]

\[\ddot{x} = \zeta – \frac{m_p \ell \ddot{\theta} \cos(\theta)}{m_p + m_c}\]

Once we compute these values, we can thus use Euler integration to update our state by a small timestep \(\Delta t\):

\[ \begin{bmatrix} x’ \\ v’ \\ \theta’ \\ \omega’ \end{bmatrix} = \begin{bmatrix} x \\ v \\ \theta \\ \omega \end{bmatrix} + \Delta t \begin{bmatrix} v \\ \ddot{x} \\ \omega \\ \ddot{\theta} \end{bmatrix} \]

Linear control theory requires that our dynamics be linear. As such, we linearize our system about our target stable point (\(x = v = \theta = \omega = 0\)) by computing the Jacobian and forming a linear system:

\[\begin{bmatrix} \dot{x} \\ \dot{v} \\ \dot{\theta} \\ \dot{\omega} \end{bmatrix} \approx \begin{bmatrix} \frac{\partial}{\partial x} v & \frac{\partial}{\partial v} v & \frac{\partial}{\partial \theta} v & \frac{\partial}{\partial \omega} v \\ \frac{\partial}{\partial x} \ddot{x} & \frac{\partial}{\partial v} \ddot{x} & \frac{\partial}{\partial \theta} \ddot{x} & \frac{\partial}{\partial \omega} \ddot{x} \\ \frac{\partial}{\partial x} \omega & \frac{\partial}{\partial v} \omega & \frac{\partial}{\partial \theta} \omega & \frac{\partial}{\partial \omega} \omega \\ \frac{\partial}{\partial x} \ddot{\theta} & \frac{\partial}{\partial v} \ddot{\theta} & \frac{\partial}{\partial \theta} \ddot{\theta} & \frac{\partial}{\partial \omega} \ddot{\theta} \end{bmatrix} \begin{bmatrix} x \\ v \\ \theta \\ \omega \end{bmatrix} + \begin{bmatrix} \frac{\partial}{\partial F} v \\ \frac{\partial}{\partial F} \ddot{x} \\ \frac{\partial}{\partial F} \omega \\ \frac{\partial}{\partial F} \ddot{\theta} \end{bmatrix} F \]

I used Symbolics.jl to do the heavy lifting for me, which resulted in:

\[\begin{bmatrix} \dot{x} \\ \dot{v} \\ \dot{\theta} \\ \dot{\omega} \end{bmatrix} \approx \begin{bmatrix} 0 & 1 & 0 & 0 \\ 0 & 0 & \frac{-m_p g}{\frac{4}{3}M – m_p} & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & \frac{Mg}{\ell \left(\frac{4}{3}M – m_p\right)} & 0 \end{bmatrix} \begin{bmatrix} x \\ v \\ \theta \\ \omega \end{bmatrix} + \begin{bmatrix} 0 \\ \frac{\frac{4}{3}}{\frac{4}{3}M – m_p} \\ 0 \\ -\frac{1}{\ell\left(\frac{4}{3}M – m_p\right)}\end{bmatrix} F\]

where \(M = m_p + m_c\).

We can produce a linear controller \(F = K s\) for our linear system. In this case, \(K\) is simply a vector that determines the feedback gains for each of the state components.

I am using a basic cart pole environment, with \(\Delta t = 0.02, m_c = 1, m_p = 0.1, g = 9.8\). After some experimentation, \(K = [1.0, -0.5, 100.0, 1.0]\) worked pretty well. Here is how the little cart-pole guys is doing with that:

Much better! We’ve got the little guy balancing around his stable point. As long as he starts relatively close to stability, he’ll pop right back. And we did it with some math and tuning.

Automatic Tuning

Unfortunately, our gain matrix \(K\) sort of came out of nowhere. Ideally there would be a better way to derive one.

Fortunately for us, there is. We are going to derive the gain matrix using the optimal policy for linear quadratic regulator (LQR) problems. (These are covered in section 7.8 of Alg4DM.)

An LQR problem has linear dynamics:

\[s^{(k+1)} = T_s s^{(k)} + T_a a^{(k)} + w\]

where \(T_s\) and \(T_a\) are matrices, \(s\) is our state, \(a\) is our action, and \(w\) is zero-mean noise.

We already have the equation \(s^{(k+1)} = s^{(k)} + \dot{s}^{(k)} \Delta t\), and can get our linear dynamics that way:

\[\begin{bmatrix} x \\ v \\ \theta \\ \omega \end{bmatrix}^{(k+1)} = \begin{bmatrix} x \\ v \\ \theta \\ \omega \end{bmatrix}^{(k)} + \left(\begin{bmatrix} 0 & 1 & 0 & 0 \\ 0 & 0 & \frac{-m_p g}{\frac{4}{3}M – m_p} & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & \frac{Mg}{\ell \left(\frac{4}{3}M – m_p\right)} & 0 \end{bmatrix} \begin{bmatrix} x \\ v \\ \theta \\ \omega \end{bmatrix}^{(k)} + \begin{bmatrix} 0 \\ \frac{\frac{4}{3}}{\frac{4}{3}M – m_p} \\ 0 \\ -\frac{1}{\ell\left(\frac{4}{3}M – m_p\right)}\end{bmatrix} F^{(k)}\right) \Delta t + w\]

which simplifies to:

\[\begin{bmatrix} x \\ v \\ \theta \\ \omega \end{bmatrix}^{(k+1)} = \begin{bmatrix} 1 & \Delta t & 0 & 0 \\ 0 & 1 & \frac{-m_p g \Delta t}{\frac{4}{3}M – m_p} & 0 \\ 0 & 0 & 1 & \Delta t \\ 0 & 0 & \frac{M g \Delta t}{\ell \left(\frac{4}{3}M – m_p\right)} & 1 \end{bmatrix} \begin{bmatrix} x \\ v \\ \theta \\ \omega \end{bmatrix}^{(k)} + \begin{bmatrix} 0 \\ \frac{\frac{4}{3} \Delta t}{\frac{4}{3}M – m_p} \\ 0 \\ -\frac{\Delta t}{\ell\left(\frac{4}{3}M – m_p\right)}\end{bmatrix} F^{(k)} + w\]

An LQR problem additionally has quadratic rewards:

\[R(s,a) = s^\top R_s s + a^\top R_a a\]

which is accumulated across all timesteps.

As engineers, we get to choose the reward function that best captures quality in our problem, and then the LQR machinery will produce an optimal policy with respect to that reward function. Note that the LQR process only works if \(R_s\) and \(R_a\) overall produce costs, which amounts to \(R_s\) being negative semidefinite and \(R_a\) being negative definite.

Let’s simply penalize deviations from zero, with more weight given to the linear and angular positions:

\[R_s = \begin{bmatrix} -5 & 0 & 0 & 0 \\ 0 & -1 & 0 & 0 \\ 0 & 0 & -5 & 0 \\ 0 & 0 & 0 & -1 \end{bmatrix} \qquad R_s = \begin{bmatrix} -1 \end{bmatrix}\]

and the value of \(R_s\) determines how much to penalize large magnitudes of force.

If we plug these into an LQR solver, crank up the horizon, and pick the maximum-horizon control matrix, we get \(K = [2.06942, 3.84145, 43.32, 15.745]\). And if we load that up into our little robot, we get:

Wow! That’s pretty darn stable.

Let’s show it with a more adverse starting state:

Very cool.

Conclusion

To recap, we’ve just used LQR theory to derive our controller rather than just hand tune one. We got something procedural that works a lot better than what I got by hand. (I even got the sign on velocity wrong in my original gain matrix). Notice that we still had to specify something by hand, namely the rewards, but they were much easier to specify to still get something good.

This concludes this blog post, but it doesn’t conclude our cart-pole shenanigans. We’ve helped Tippy the Robot, but only with control theory. What if we don’t have continuous force output? What if we don’t have an unbounded output force? What if we want to be able to swing up from bad starting states? What if we don’t know the equations of motion? We’ll take a stab at it next time.