Wolfenstein 3D Raycasting in C

I recently happened upon a video on YouTube in which JDH implements a Doom clone. He does this in C, without any graphics libraries. Just SDL and a pixel buffer:

u32 pixels[SCREEN_SIZE_X * SCREEN_SIZE_Y]; 

The video blew me away. I have long admired how John Carmack et. al. were the first to create playable fast-paced 3D games. It was a technical achievement and a true breakthrough. There is also something clean and pure about restricting oneself to pure C. It is closer to the hardware. You’re setting each pixel byte by byte. You’re totally in control.

I had another project in the works, but this video just kept turning over in my head. I decided to give the first part, rendering for Wolfenstein 3D in pure C, a stab myself.

This post is my attempt at doing Wolfenstein-style rendering, based on gameplay I saw at the beginning of JDH’s video. It is probably not exactly what happens in Wolfenstein, but I’m hoping its somewhat close. I plan to check back with a future post to see where / how things diverge.

Setup

The biggest hurdle to getting started was setting myself up to program in C. I have no pure-C projects. Heck, even pure C++ projects can be a pain to set up.

I typically program in Sublime Text and compile in the terminal. However, for this I wanted to use Visual Studio Code. It seems to have the best debugger integration, and I feel like all game-programming folks that I follow use it. I have not used VS code on personal projects, so I figured this would be a great opportunity to change that.

As per usual, getting this all to work requiring some Googling and messing around for a while. I figure it’s worth talking about the process, at the very least so that I can refer to it myself in the future when I set up new projects. I’ll cover that at the end of this post, but for now let’s jump straight to C programming.

SDL Game Loop

We have to get a window we can render to and establish a basic game loop. This has the form:

int main(int argc, char *argv[]) {
   ... init things like the window ...
   
   // main loop
   while (state.quit == 0) {
      ProcessEvents();
      TickGame(dt);
      RenderToPixelBuffer();
      PresentPixelBuffer();
   }
}

One iteration through the main loop corresponds to one frame of our game. If the game runs at 30Hz, then we looping through this 30 times per second.

  • ProcessEvents – Games and other programs need to react to player inputs. Rather than have callbacks that interrupt your game in the middle of what its doing, SDL collects everything that happens in an event queue. We start every frame by emptying the queue and processing all of the events to determine how they impact our game.
    For example, if the player hits the ‘w’ key, we might want to store that information to later use it to move the player forward.
  • TickGame – This method updates the game state. In a larger game this would do a whole lot more work. Here we’re basically just updating the player’s position, orientation, and velocity based on keyboard inputs.
  • RenderToPixelBuffer – This is the core part of what makes this project exciting. Here we render the world state by setting color values in our pixel buffer. We have to render based on the camera’s current location and orientation, and use that information to get the pixels right.
  • PresentPixelBuffer – Here we ask SDL to take our pixels and show them on the screen. In my case, this presentation code has vsync enabled, so it is sychronized with the monitor refresh rate. This automatically waits the appropriate amount of time to update at that rate. In my case, that means the program ends up executing at 30Hz, despite being able to technically run much faster.

As far as SDL goes, we have to initialize a window, a texture, and a renderer. The window controls the actual window panel that shows up on our screen. The texture associated with the window contains the pixel data that is presented in the window. The renderer is needed for transferring our pixel buffer to the texture.

We copy our pixel buffer into this texture at the end of every frame. JDH created his texture with the pixel format SDL_PIXELFORMAT_ABGR8888, which means that every byte is an unsigned 32-bit number with 1 byte each for alpha, blue, green, and red. Thus, 0xFF00FF00 is opaque green.

We additionally maintain a pixel buffer:

u32 pixels[SCREEN_SIZE_X * SCREEN_SIZE_Y];

This pixel buffer is what we will manipulate in order to depict the 3D world “seen” by our player camera. Each pixel is also a 32-bit ABGR value.

For example, the following produces:

int x = 52;
int y = 75;
state.pixels[(y * SCREEN_SIZE_X) + x] = 0xFFFF00FF;
(we’ve got one magenta pixel here close to the bottom left)

We’ve got a lot of pixel work ahead of us to go from this to 3D graphics. But therein lies the challenge! No rendering libraries, just algorithms and math 😀

Problem

A Wolfenstein level is defined on a 2D grid. Grid tiles can be solid or empty, with walls taking up exactly one grid tile. Our camera only ever rotates horizontally, never vertically. These facts make rendering 3D scenes so much easier than general 3D rendering.

The grid dimensions are defined by two numbers – the tile width and the wall height:

In this case I’m using 1.0 and 1.2 units of length, respectively.

Our player exists at some 2D location within this world, and is facing in some direction. Our camera is coincident with the player and faces the same direction:

The camera has a horizontal and vertical field of view. In the top-down 2D perspective we can see its horizontal field of view:

In a side 2D perspective we can see its vertical field of view:

We can imagine our game screen being a plane lying in this field of view, fairly close to the player. Our objective is to set each pixel based on what the camera “sees” at that pixel’s location. The color is determined by the object we intersect with if we cast a ray from the camera’s origin, through the pixel, and then intersect with the world:

In the case of a 2D cross-section, a bottom stretch of pixels are colored the floor’s color, a top stretch of pixels are colored the ceiling’s color, and everything in between is colored the block’s color:

Wolfenstein’s blocky world and consistent floor and ceiling height mean we can get all the information we need to render a column of pixels by doing a single horizontal ray cast:

We can use the fact that the larger triangles formed with the full raycast distance are similar to the triangles that from the camera plane a distance \(d=1\) from the player. If the raycast distance is \(\ell\), the player’s height is \(z\), and the wall height is \(H\), then:

\[z’ = \frac{d z}{\ell}, \qquad z” = \frac{d(H-z)}{\ell}\]

If we have \(n\) pixels per column (in my case, 360), then we need to fill pixels according to:

\[\begin{cases} \text{floor color} & \text{if } y < n/2 – \frac{d z}{\ell} n \\ \text{wall color} & \text{if } y < n/2 + \frac{d(H-z)}{\ell} n \\ \text{ceiling color} & \text{otherwise} \end{cases}\]

// Calculate the pixel bounds that we fill the wall in for
int y_lo = (int)(SCREEN_SIZE_Y/2.0f - cam_len*state.camera_z/ray_len * SCREEN_SIZE_Y / state.camera_height);
int y_hi = (int)(SCREEN_SIZE_Y/2.0f + cam_len*(WALL_HEIGHT - state.camera_z)/ray_len * SCREEN_SIZE_Y / state.camera_height);
y_lo = max(y_lo, 0);
y_hi = min(y_hi, SCREEN_SIZE_Y-1);

fill_column(x, 0, y_lo-1, color_floor);
fill_column(x, y_lo, y_hi, color_wall);
fill_column(x, y_hi + 1, SCREEN_SIZE_Y-1, color_ceil);

Next we have to figure out how to do the ray cast to get \(\ell\). We need to perform a ray cast for every column of pixels on our screen. Each such ray will originate at the player’s location, and then head in the direction of its pixel column until it strikes a wall. The distance traveled is \(\ell\). For example:

We can leverage the fact that we have a grid. We know that the intersection point, when we eventually find it, will lie on one of the grid’s horizontal or vertical lines.

Let’s start by decomposing our initial coordinate, the camera position \((p_x, p_y)\), into its tile indices and offsets from the tile’s bottom-left corner:

int x_ind_cam = (int)(floorf(state.camera_pos.x / TILE_WIDTH));
int y_ind_cam = (int)(floorf(state.camera_pos.y / TILE_WIDTH));
f32 x_rem_cam = state.camera_pos.x - TILE_WIDTH*x_ind_cam;
f32 y_rem_cam = state.camera_pos.y - TILE_WIDTH*y_ind_cam;

The raycast direction can be calculated from the pixel column position. We assume the camera screen is a unit distance \(d = 1\) from the camera, \(\boldsymbol{c}\). The raycast direction is a function of the camera’s width \(w\), the number of screen pixels \(m\), and the direction the camera is facing, \(\hat{\boldsymbol{b}}\):

We can locate the pixel’s location \(\boldsymbol{p}\) as:

\[\boldsymbol{p} = \boldsymbol{c} + d \hat{\boldsymbol{b}} + \left(\frac{w}{2} – w \frac{x}{m}\right) \texttt{rotr}(\hat{\boldsymbol{b}}) = \boldsymbol{c} + d \hat{\boldsymbol{b}} + \left(\frac{w}{2} – w’\right) \texttt{rotr}(\hat{\boldsymbol{b}})\]

Here \(\texttt{rotr}(\hat{\boldsymbol{b}})\) is the camera direction rotated by 90 degrees in the right-hand direction (counter clockwise).

The raycast direction is aligned with \(\bar{\boldsymbol{pc}}\):

\[\bar{\boldsymbol{pc}} = \boldsymbol{p} – \boldsymbol{c} = \boldsymbol{c} + d \hat{\boldsymbol{b}} + \left(\frac{w}{2} – w \frac{x}{m}\right) \texttt{rotr}(\hat{\boldsymbol{b}})\]

We can normalize \(\bar{\boldsymbol{pc}}\) by dividing by its length to produce our raycast direction \(\hat{\boldsymbol{r}}\).

Raycasting

Now that we have our ray origin and unit direction, we can cast it out into the map. If we imagine that the ray is travelling at unit speed, then its position vs. time within its current cell is:

\[\begin{aligned} x(t) & = x_\text{rem} + \hat{r}_x dt \\ y(t) & = y_\text{rem} + \hat{r}_y dt \end{aligned} \]

The direction the ray is facing determines when it will cross certain tile boundaries:

  • We cross \(x_\text{rem} = 0\) if \(\hat{r}_x <0\), at \(dt = -x_\text{rem} / \hat{r}_x\)
  • We cross \(x_\text{rem} = w_\text{TILE}\) if \(\hat{r}_x > 0\), at \(dt = (w_\text{TILE} – x_\text{rem}) / \hat{r}_x\)
  • We cross \(y_\text{rem} = 0\) if \(\hat{r}_y <0\), at \(dt = -y_\text{rem} / \hat{r}_y\)
  • We cross \(y_\text{rem} = w_\text{TILE}\) if \(\hat{r}_y > 0\), at \(dt = (w_\text{TILE} – y_\text{rem}) / \hat{r}_y\)

We can generalize this to avoid having a bunch of if statements in our code:

  • if \(\hat{r}_x < 0\), then \(dt = -1/\hat{r}_x \cdot x_\text{rem} + 0\) and \(x_\text{ind}\) will decrease by 1
  • if \(\hat{r}_x > 0\), then \(dt = -1/\hat{r}_x \cdot x_\text{rem} + w_\text{TILE}/\hat{r}_x\) and \(x_\text{ind}\) will increase by 1
  • if \(\hat{r}_x = 0\), then \(dt = 0 \cdot x_\text{rem} + \infty\) and \(x_\text{ind}\) will not change

We can use the same statements for the y direction.

This simplifies walking across the grid to calculating \(dt\) for x and y, and selecting whichever one is smaller (the earlier crossing). We then update our decomposed position (tile index and remaining offset) appropriately. This process continues until we enter a solid tile.

while (true) {
   f32 dt_best = INFINITY;
   dx_ind = 0;
   dy_ind = 0;
            
   f32 dt_x = dx_a*x_rem + dx_b;
   f32 dt_y = dy_a*y_rem + dy_b;
   if (dt_x < dt_y) {
       dt_best = dt_x;
       dx_ind = dx_ind_dir;
       dy_ind = 0;
   } else {
       dt_best = dt_y;
       dx_ind = 0;
       dy_ind = dy_ind_dir;
   }

   // Move up to the next cell
   x_ind += dx_ind;
   y_ind += dy_ind;
   x_rem += dir.x * dt_best - TILE_WIDTH*dx_ind;
   y_rem += dir.y * dt_best - TILE_WIDTH*dy_ind;

   // Check to see if the new cell is solid
   if (MAPDATA[y_ind*8 + x_ind] > 0) {
      break;
   }
}

Once we’ve collided, we can back out the raycast length. We know from the tile and from which direction we entered it what color we should render it, if we have different colors for different blocks. We can even employ a basic lighting trick where we color x-sides differently than y-sides:

The right image employs a basic lighting trick where y-faces are lighter.

And that’s it! We’ve got basic 3d graphics on the screen.

Profiling

I was curious how efficient this code is, and tried to measure it. I initially used SDL_GetTicks to measure the framerate in milliseconds, only to discover that SDL_RenderPresent is automatically vsyncing with the display to 30 Hz. So I wrapped everything before the SDL calls between sys/time.h gettimeofday calls. I am currently getting about 4.5ms per frame, which is 222 frames per second.

I’m certainly happy with that. I’m sure there are plenty of additional tricks that could be employed, and if things grow in complexity that headroom could very quickly be eaten away. I am planning on looking into how Wolfenstein rendering is _actually_ done.

Conclusion

Rendering in general 3D spaces with arbitrary object geometry and orientations is pretty tricky. Rendering when you constrain yourself to a grid world where you can only look horizontally is a lot easier. Granted, this demo only scratched the basics, but its still pretty cool to see how quickly one can get something neat up and running with some math and code.

The code for this demo is available here.

Special thanks to JDH for the inspiration to pursue this and for the reference code to get up and running.

Getting Started with VS Code on Ubuntu

In this section I wanted to give some pointers on how to set oneself up with Visual Studio Code on Ubuntu.

To start, I downloaded VS Code. For Ubuntu, that means downloading the .deb file. I then installed it with:

sudo apt install ./<file>.deb

I then installed SDL2:

sudo apt install libsdl2-dev libsdl2-2.0-0

I then installed some recommended VS code extensions. To do that, you have to navigate via the left-hand bar to the tetris-like icon:

Then, use the provided search bar to find C/C++:

You can click on that extension and install it.

I similarly installed:

  • C/C++ Makefile Project by Adriano Markovic
  • C/C++ Runner by franneck94
  • Code Runner by Jun Han
  • CodeLLDB by Vadim Chugunov

Each VS Code project is contained inside a workspace. I created a new folder on my machine and navigated to it inside the VS code terminal. I then initialized a new project with Ctrl + Shift + P, and selected “C/C++ Make: INIT Project”. This creates a new C/C++ project with a Makefile.

I then edited the Makefile to link to SDL2 and math.h:

CXXFLAGS = -std=c11 -Wall -g # Note: -g adds proper debugging with symbols
LDFLAGS = -lSDL2 -lm

and named my program:

APPNAME = TOOM

I created a new directory, src, and created a new file, main.c inside of it. That’s where the coding starts!

#include <math.h>
#include <stdio.h>
#include <SDL2/SDL.h>
int main(int argc, char *argv[]) {
    printf("Hello World!\n");
    return 0;
}

At this point I could make the project in the terminal by typing make, and could execute it by typing ./TOOM. However, to use the debugger, I needed to set up a tasks.json file in the generated .vscode directory:

Then, opening the debugger via the left-column icon and hitting the settings gear icon creates a launch.json file. I set this to run the prelaunch task “build” which we just set up in tasks.json. The overall config looks like this:

This lets you set breakpoints in your code and execute in debug mode. You can use VS code’s fantastic introspection tools to see variable values and see what your program is up to.

Subsumption Architectures and Transformers

I’ve run into subsumption architectures a few times now. They’re basically a way to structure a complicated decision-making system by letting higher-level reasoning modules control lower-level reasoning modules. Recently, I’ve been thinking about some parallels there with the transformer architecture that has dominated large-scale deep learning. This article is a brief overview of subsumption architectures and attempts to tie them in with attention models, with the hope to better understand both.

What is a Subsumption Architecture?

The Wikipedia entry on subsumption architecture describes it as an approach to robotics from the 80s and 90s that composes behavior into hierarchical modules. The modules operate in parallel, with higher-level modules providing inputs to lower-level modules, thereby being able to operate and reason at a higher level.

These architectures were introduced by Rodney Brooks (of Baxter fame, though also much more) and his colleagues in the 80s. Apparently he “started creating robots based on a different notion of intelligence, resembling unconscious mind processes”. Brooks argued that rather than having a monolithic symbolic reasoner that could deduce what to do for the system as a whole, a subsumption approach could allow each module to be tied closer to the inputs and outputs that it cared about, and thus conduct local reasoning. This argument is often tied together with his view that, rather than making a big symbolic model of the world, which is a hopeless endeavor, that “the world is its own best model” and submodules should be able to directly interact with the portion of the world they interact with.

Perhaps this is best described using an example. Consider a control system for a fixed-wing aircraft. A basic software engineering approach might architect the aircraft’s code to look something like this:

Basic decomposition for the lowest level of a fixed-wing autopilot

where, for most flight dynamics, we can decouple forward speed / pitching / climbing from yaw / roll. Unfortunately, we run into some complexity as we consider the behavior of the longitudinal controller. When taking off, we want full throttle and to maintain a fixed target pitch, when flying below our target altitude, we want to use full throttle and regulate airspeed with pitch control, when near our target altitude we want to control our altitude with pitch and control our airspeed with throttle, and when above our target altitude we want zero throttle and to regulate airspeed with pitch control:

Longitudinal control strategies based on altitude. Effectively Figure 6.14 from Small Unmanned Aircraft.

As such, our longitudinal controller has quite a lot of internal complexity! Note that this is only for takeoff to cruise; landing would involve more logic. We want to control our aircraft with different strategies depending on the context.

Subsumption architectures are a means of tackling this sort of context-based control. Rather than programming a bunch of if/then statements in a single longitudinal control module, we can further break apart our controller and allow independent modules to operate concurrently, taking control when needed:

Breaking down the Longitudinal Controller into Hierarchical Modules

This architecture works by splitting our monolithic logic into independent modules. Each module can receive inputs from any number of upstream modules. For example, the throttle output module, which forwards the final throttle command to the motor, can receive direct throttle commands from the airspeed-via-throttle-controller, the descent controller, the climb controller, or the takeoff controller.

Each module thus needs a method for knowing which input to follow. There are multiple ways to achieve this, but the simplest is to assign different modules different priorities, and their outputs inherit those priorities. The highest propriety input into any given model would be the one acted upon. In our little example above, the cruise controller could just be running continuously, with its output being overwritten by the takeoff controller when it activates during takeoff.

Our longitudinal controller modules running in cruise, where only the cruise controller is active. The descent, climb, and takeoff controllers are dormant. As such, the airspeed-via-pitch and pitch controllers are dormant.

We get our emergent takeoff behavior by allowing the takeoff controller, when it activates, to override the cruise controller’s output:

Our longitudinal controller modules running in takeoff. The cruise controller is always running, but its lower-priority outputs are overwritten by the higher-priority takeoff controller.

This modularization has some major advantages. First, modules are self-contained, and it is much easier to define their desired behavior and to test it. It is much easier to validate the behavior of a simple pitch controller than the complicated behavior of the monolithic longitudinal controller. Second, the self-contained nature means that it is much easier to extend behavior by swapping or implementing new modules. Good luck having a new member of the team hook up everything properly in a monolithic code base.

We can similarly take this approach further and take it farther up the stack, with our longitudinal and lateral controllers taking inputs from trajectory trackers, or waypoint followers, a loitering controller, or a collision-avoidance module. All of the same principles apply. Finally, it is easier to monitor and debug the behavior of a modular system, especially if we make the modules as stateless as possible. We can log their inputs and outputs, and then can (more) easily figure out what the system was doing at any given time.

The approach is not without its disadvantages. The most obvious change is that we now have a bunch of disparate modules that can all act at various times. In the monolithic case we can often guarantee (with if statements) that only certain modules are active at certain times, and curtail complexity that way. Another issue is added complexity in handoff – how do we handle the transitions between module activations? We don’t want big control discontinuities. Finally, we have the issue of choosing priority levels. For simple systems it may be possible to have a strict priority ordering, but in more complicated situations the relative ordering becomes trickier.

Subsumption architectures were originally invented to be able to easily introduce new behaviors to a robotic system. Our previous controller, for example, could be extended with multiple higher-level modules that use the same architectural approach to command the lower-level modules:

Composition lets us subsume lower-level behaviors in higher-level modules. This is the main idea of subsumption architectures.

The running example is for a fixed-wing controller, but you could imagine the same approach being applied to a robot. The Wikipedia article suggests a higher-level exploration behavior that can call lower-level walk-around or avoid-object modules.

Unconscious Mind Processes

Brooks was originally motivated by subsumption architectures because it allowed for “unconscious mind processes”. By this I think Brooks meant that higher level controllers typically need not concern themselves with low level details, and could instead rely on lower level controllers to handle those details for them. In our example above, a trajectory follower can just direct the cruise controller to follow the trajectory, and does not need to think about elevator or throttle rates directly.

One might argue that claiming that this sort of an approach resembles how the human mind works is a stretch. We are all familiar with sensational articles making big claims on AI approaches. However, I think there are core ideas here that are in line with how human reasoning might come to be.

For example, Brooks argues that subsumption models have emergence. We don’t think of any of the lower-level modules as being particularly intelligent. A pitch controller is just a PID loop. However, the combined behavior of the symphony of higher-level modules knowing when to jump in and guide behavior causes the overall system to exhibit quite sophisticated reasoning. You, a human, also have some “dumb” low-level behavior – for example what happens when you put your hand on a hot stove top. Your higher-level reasoning can choose to think at the level of individual finger movements, but more often thinks about higher-level behavior like picking up a cup or how far you want to throw a baseball.

I think this line of reasoning bears more fruit when we consider the progress we have made with deep learning architectures. The field has more or less converged on a unified architecture – the transformer. Transformers depart from prior sequence learning architectures by doing away with explicit recurrence and convolution layers in favor of attention. Attention, it just so happens, is the process of deciding which of a set of inputs to pay attention to. Subsumption architectures are also very much about deciding which of a set of inputs to pay attention to.

We could restructure our subsumption architecture to use attention:

The transformer (attention-based) architecture is basically a deep learning counterpart to subsumption architectures. Here we have attention blocks between layers of fidelity.
Alternatively, every module could just attend to whatever inputs it chooses. There are still multiple levels of fidelity, but the adherence need not be as strict.

It would make a lot of sense for a robust deep learning model to have produced, inside of its layers, a diverse set of modules that can spring into action at appropriate moments and make use of other modules in lower levels. Such an outcome would have big wins in robustness (because lower-level modules become general tools used by higher-level modules), have the benefit of progressive learning (we’d refine our basic behaviors first, like when a child first learns how to walk before reasoning about where to walk to), and we’d be able to develop more nuanced behaviors later in life by building new higher-level modules.

Interestingly, this sort of emergence is the opposite of what a classical deep neural network typically produces. For perception, it has been shown that the first layers of a neural network produce high level features like edge detectors, and then the deeper you go, the more complicated the features get. Eventually, at the deepest levels, these neural networks have features for dogs or cats or faces or tails. For the subsumption architecture, the highest fidelity reasoning happens closest to the output, and the most abstract reasoning happens furthest from the output.

The Tribulations of Learning

Unfortunately, the deep-learning view of things runs into a critical issue – our current reliance on gradient information to refine our models.

First off, the way we have things structured now requires gradient information from the output (i.e., motor controllers) to propagate back up the system to whichever cause produced it. That works pretty well for the modules closest to the output, but modules further from the output can suffer from vanishing gradients.

Vanishing gradients affect any large model. Modern architectures include skip connections (a la ResNet) to help mitigate this effect. It helps some.

Second, these models are typically trained with reinforcement learning, and more specifically, with policy gradient methods. These are notorious for being high variance. It can take a very long time for large-scale systems to learn to generalize well. Apparently Alpha Go played 4.9 million times in order to train itself to play well, which is way more than any human master ever played. Less structured tasks, like robot grasping, seem to also require massive quantities of training time and data.

This second problem is more or less the main stickler of reinforcement learning. We know how to build big systems and train them if we are willing and able to burn a whole lot of cash to do so. But we haven’t figured out how to do it efficiently. We don’t even know how to leverage some sort of general pre-training. There are certainly projects that attempt this, it just isn’t there yet.

The third problem is more specific to the hierarchical module architecture. If we get sparse activation, where some modules are only affected by certain parent modules, then there are presumably a bunch of modules that currently do not contribute to the output at all. The back-propagated gradient to any inactive module will be zero. That is bad, because a module with zero gradient will not learn (and may degrade, if we have a regression term).

The white modules here, being inactive, would not get any gradient information backpropagated to them if we stick to traditional policy gradient methods

I think the path forward is to find a way past these three problems. I suspect that subsumption architectures have untapped potential in the space of reinforcement learning, and that these problems can be overcome. (Human babies learn pretty quickly, after all). There is something elegant about learning reliable lower-level modules that can be used by new higher level modules. That just seems like the way we’ll end up getting transfer learning and generalizable reasoning. We just need a way to get rewards from places other than our end-of-the-line control outputs.

Conclusion

So we talked about subsumption architectures, that’s cool. What did we really learn? Well, I hope that you can see the parallel that we drew between the subsumption architectures and transformers. Transformers have rapidly taken over all of large-scale deep learning, so they are something to pay attention to. If there are parallels to draw from them to a robotics architecture, then maybe that robotics architecture is worth paying attention to as well.

We also covered some of the primary pitfalls. These challenges more or less generally need to be overcome to figure out scalable / practical reinforcement learning, at least practical enough to actually apply to complex problems. I don’t provide specific suggestions here, but oftentimes laying out the problem helps us get a better perspective in actually solving it. I remain hopeful that we’ll figure it out.

2D Player Collision against Static Geometry

A few weeks ago I got an email asking me about a game I worked on ages ago. I had to inform the sender that, unfortunately, my inspired dreams of creative development had not come to full fruition.

The message did get me thinking. What had I been doing back then, about 8 years ago? What problems did I run into? Why did I stop?

Now, I took a look at my old source code, and I’m pretty sure I stopped because of the mounting complexity of classes and cruft that happens in a growing codebase, especially when one has comparatively little idea of what one is doing. That, and I’m pretty sure this was around the time of my qualification exams, and well, I think that took over my life at the time.

Anyhow, I was thinking about this top-down 2D game, and the thoughts that kept circling around my brain centered around 2D character movement and collision. This was only magnified when I played some iPhone games to supplement my entertainment when I flew long distance over the holidays, and grew dismayed at some of the movement systems. I thought player movement was hard back then, and even today a fair number of games don’t really cut the mustard.

The Problem

So this post is borne out of mulling over this old problem of how to move a player around in a 2D game without them running into things. We want a few things:

  • Nice continuous movement
  • To not intersect static geometry
  • To nicely slide along static geometry
  • To not get hung up on edges, and nicely roll around them

And, as bonus objectives, we’d love to:

  • Do all of these things efficiently
  • Be able to handle non-tile maps with arbitrary geometry

The central issue with many games is that you have things you don’t want to run into. For example, below we have a screenshot from my old game-in-development Mystic Dave. We have a player character (Mystic Dave) and some objects (red) that we don’t want Dave to run into:

The player typically has a collision volume centered on their location. This collision volume is not allowed to intersect with any of the level collision geometry:

So what we’re really dealing with is moving this player collision volume around with the player, and making sure that it doesn’t intersect with anything.

Basic Player Movement

The player movement part is pretty simple. We can just look at whatever direction the player input is getting us to move in, and move ourselves in that direction:

void UpdatePlayer(Vec2 input_dir) {
   Vec2 pos_delta = input_dir * kPlayerStepSize;
   pos_player_ += pos_delta;
}

We’ll run into tuning problems later on if we don’t make this a function of the time step, so let’s go ahead and do that now:

void UpdatePlayer(Vec2 input_dir, float dt) {
   Vec2 pos_delta = input_dir * kPlayerStepSizePerTime * dt;
   pos_player_ += pos_delta;
}

This will give us some pretty basic movement:

That movement is a bit boring. It stutters around.

We can make it smoother by storing the player’s velocity and adding to that instead. We enforce a maximum velocity and apply some air friction to slow the player down as we stop our inputs:

void UpdatePlayer(Vec2 input_dir, float dt) {
   // Update the player's velocity
   vel_player_ += input_dir * kPlayerInputAccel * dt;
   
   // Clamp the velocity to a maximum magnitude
   float speed = Norm(vel_player_);
   if (speed > kPlayerMaxSpeed) {
       vel_player_ *= kPlayerMaxSpeed / norm;
   }

   // Update the player's position
   Vec2 pos_delta = vel_player_ * dt;
   pos_player_ += pos_delta;

   // Apply air friction
   vel_player_ *= kFrictionAir;
}

Alright, now we’re zooming around:

So how do we do the colliding part?

Naive Collision

The way I first approached player collision, way back in the day, was to say “hey, we have a top-down 2D tile-based game, so our collision objects are all squares, and we can just check for collision against squares”. A reasonable place to start.

Okay, so we have a rectangle for our player and consistently-sized squares for all of our static geometry. That can’t be too hard, right?

Checking for collisions between axis-aligned rectangles isn’t all that hard. You can use something like the separating axis theorem. A quick Google search gives you code to work with.

Our strategy now is to calculate the new player position, check it against our collision geometry for intersections, and to not move our player if that fails. If we do that, we find that we need (1) to zero out the player’s speed on collision, and (2) we still actually want to move the player, except we want to move them up against the colliding body:

Doing this part is harder. We can do it, but it takes some math. In the case of an axis-aligned rectangle, if we know which side our collision point is closest to, we can just shift our new position \(P’\) out of the closest face by half the player-rect’s width in that axis. The code looks something like:

void UpdatePlayer(Vec2 input_dir, float dt) {
   // Update the player's velocity, clamp, etc.
   ... same as before

   // Update the player's position
   Vec2 pos_delta = vel_player_ * dt;
   pos_player_ += pos_player_ + pos_delta;
   Box player_box_next = Box(pos_next, player_height, player_width);
   for (const Box& box : collision_objects_) {
      auto collision_info = CheckCollision(box, player_box_next)
      if (collision_info.collided) {
         // Collision!
         vel_player_ = Vec2(0,0);
         pos_player_ = box.center;
         if (collision_info.nearest_side == NORTH) {
            pos_player_ += Vec2(0, box.height/2 + player_height/2);
         } else if (collision_info.nearest_side == EAST) {
            pos_player_ += Vec2(box.width/2 + player_width/2, 0);
         } else if (collision_info.nearest_side == SOUTH) {
            pos_player_ -= Vec2(0, box.height/2 + player_height/2);
         } else if (collision_info.nearest_side == WEST) {
            pos_player_ -= Vec2(box.width/2 + player_width/2, 0);
         }
      }
   }

   // Apply air friction
   vel_player_ *= kFrictionAir;
}

So this works, but has some issues. First off, we’re iterating over all boxes. That’s a big inefficiency right off the bat, especially on a large level with a lot of filled tiles. That particular problem is pretty easy to avoid, typically by using a more efficient data structure for keeping track of our collision geometry, like a quad tree for general collision objects and searching around a simple tile-based index for something like Mystic Dave.

Second, we project out of the nearest face. This can lead to all sorts of glitches. If we start on one side of the box, but move such that our closest face is a different one, we get shunted out the wrong side:

This effect is particularly problematic when we have multiple collision bodies. The code above processes them all in series, which could be totally wrong. We want to collide with the first thing we’d hit first.

Alternatively we could use a while loop, and keep on checking for collisions after we get shunted around by our collision logic, but this could end up causing weird infinite loops:

After we process any collisions, we set the player’s speed to zero. This causes the player to just stop. There is no nice sliding along the surface, let alone nicely rolling around corners:

Thinking in Points

Okay, maybe we can do this differently. Let’s take a step back and refine our problem.

We have a point \(P\), which we are moving a small step to some \(P’\). We assume \(P\) is valid (the player is not colliding with anything).

We don’t want to move to \(P’\) if it collides with anything. Furthermore, we don’t want to move to \(P’\) if we collide with anything on the way there:

So far we’ve been talking about points (\(P\) and \(P’\)), but what complicates things are the bounding boxes around them. Can we do something to remove the boxes from the equation?

Let’s look at the set of points where \(P’\) collides with our collision volume:

We can change our strategy to simply test \(P’\), a single point, against this larger collision volume. By using it, we remove the complexity of the player’s bounding rectangle, and instead get to work directly with point-in-shape tests.

When checking for a collision, we just have to check whether the ray segment \(P\rightarrow P’\) intersects with the extended volumes, and if it does, only move the player up to the earliest intersection.

If we have multiple collision bodies, this approach will ensure that we stop at the right location:

So we are now working with extended volumes. These are obtained by using the Minkowski sum of a collision body with the player’s collision volume. The idea is quite general, and works for arbitrary polygons:

Computing the Minkowski sum of two polygons is actually pretty straightforward:

// Compute the Minkowski sum of two convex polygons P and Q, whose
// vertices are assumed to be in CCW order.
vector<Vec2i> MinkowskiSum(const vector<Vec2i>& P, const vector<Vec2i>& Q) {
    // Vertices (a,b) are a < b if a.y < b.y || (a.y == b.y && a.x < b.x)
    size_t i_lowest = GetLowestVertexIndex(P);
    size_t j_lowest = GetLowestVertexIndex(Q);

    std::vector<Vec2i> retval;

    size_t i = 0, j = 0;
    while (i < P.size() || j < Q.size()) {
        // Get p and q via cyclic coordinates
        // This uses mod to allow to indices to wrap around
        const Vec2i& p = GetCyclicVertex(P, i_lowest + i);
        const Vec2i& q = GetCyclicVertex(Q, j_lowest + j);
        const Vec2i& p_next = GetCyclicVertex(P, i_lowest + i + 1);
        const Vec2i& q_next = GetCyclicVertex(Q, j_lowest + j + 1);

        // Append p + q
        retval.push_back(p + q);

        // Use the cross product to gage polar angle.
        // This implementation works on integers to avoid floating point
        // precision issues. I just convert floats say, in meters, to
        // integers by multiplying by 1000 first so we maintain 3 digits
        // of precision.
        long long cross = Cross(p_next - p, q_next - q);

        if (cross >= 0 && i < P.size()) {
            i++;
        }
        if (cross <= 0 && j < Q.size()) {
            j++;
        }
    }

    return retval;
}

This post isn’t really about how the Minkowski sum works, so I won’t go too deep into it, but you basically store your polygon vertices in a counter clockwise order, start with a vertex on each polygon that is furthest in a particular direction, and then rotate that furthest direction around in a circle, always moving an index (or both if the directions match) to keep your indices matching that furthest direction.

Sliding and Rolling around Corners

Before we get too deep, let’s quickly cover how to slide along the collision face. Our code above was zeroing out the player’s velocity every time we hit a face. What we want to do instead is zero out the velocity into the face. That is, remove the component of the velocity that points into the face.

If we know the face unit normal vector \(u\), then we just project our velocity \(v\) into the direction perpendicular to \(u\).

Because running into things slows us down, we typically then multiply our resulting speed vector by a friction coefficient to mimic sliding friction.

// Lose all velocity into the face.
// This results in the vector along the face.
player_vel_ = VectorProjection(player_vel_, v_face);
player_vel_ *= player_friction_slide_;  // sliding friction

To roll around edges we can just trade out our rectangular player collision volume for something rounder. Ideally we’d use a circle, but a circle is not a polygon, so the Minkowski sum would result in a more awkward shape with circular segments. We can do that, but its hard.

We can instead approximate a circle with a polygon. I’m just going to use a diamond for now, which will give us some rolling-like behavior around box corners.

Thinking in Triangles

Now that we took that brief aside, let’s think again about what this Minkowski sum stuff means for the code we prototyped. We could maintain a list of these inflated collision objects and check our points against them every time step. That’s a bit better before, because we’re primarily doing point-in-polygon checks rather than polygon-polygon intersection, but it is still rather complicated, and we want to avoid having to repeatedly iterate over all objects in our level.

I had a big aha! moment that motivated this entire project, which elegantly avoids this list-of-objects problem and ties in nicely with a previous blog post.

What if we build a triangulation of our level that has edges everywhere the inflated collision polygons have edges, and we then mark all triangles inside said polygons as solid. Then all we have to do is keep track of the player’s position and which triangle they are currently in. Any time the player moves, we just check to see if they pass across any of the faces of their current triangle, and if they cross into a solid triangle, we instead move them up to the face.

TLDR: 2D player movement on a constrained Delaunay mesh with solid cells for Minkowksi-summed collision volumes.

To illustrate, here is the Mystic Dave screenshot with some inflated collision volumes:

And here is a constrained triangulation on that, with solid triangles filled in:

And here it all is in action:

We see the static collision geometry in red, the player’s collision volume in blue, and the inflated collision volumes in yellow. We render the constrained Delaunay triangulation, which has edges at the sides of all of the inflated collision volumes. For player movement we simply keep track of the player’s position, the player’s speed, and which triangle they are in (highlighted).

We don’t have to limit ourselves to tile-centered boxes for this. We don’t even have to limit ourselves to tile-aligned boxes:

Player movement now has to track both the player’s position and the triangle that the player is in. We can find the enclosing triangle during initialization, and can use the same techniques covered the Delaunay Triangulation post. I’m going to repeat that section here, since it is so relevant.

Suppose we start with some random face and extract its vertices A, B, and C. We can tell if a point P is inside ABC by checking whether P is to the left of the ray AB, to the left of the ray BC, and to the left of the ray CA:

Checking whether P is left of a ray AB is the same as asking whether the triangle ABP has a counter-clockwise ordering. Conveniently, the winding of ABP is given by the determinant of a matrix:

\[\det\left(\begin{bmatrix}A_x & A_y & 1 \\ B_x & B_y & 1 \\ P_x & P_y & 1\end{bmatrix}\right)\]

If this value is positive, then we have right-handed (counter-clockwise) winding. If it is negative, then we have left-handed (clockwise) winding. If it is zero, then our points are colinear.

If our current triangle has positive windings then it encloses our point. If any winding is negative, the we know we can cross over that edge to get closer to our point.

Here triangle ABP and BCP have positive windings, but triangle CAP has a negative winding. As such, we need to cross the blue edge to get closer to P.

We simply iterate in this manner until we find our enclosing triangle. 

Every tick, when we update the player’s position, we run a similar procedure. We similarly are interested in whether the player’s new position has crossed any triangle edges.The basic concept is that, if the new position enters a new triangle, and the new triangle is solid, we need to prevent movement across the edge. If the new position enters a new triangle, and the new triangle is not solid, we can simply move the player and update the enclosing triangle.

Below we see the simplest case, where the move from P to P’ leads to a new triangle:

If we check the winding of CAP’ and find that it is negative, then we know that we have crossed AC.

There are a few corner cases that make simply accepting the new triangle the wrong thing to do. First off, a single move could traverse multiple triangles:

We need to make sure we iterate, and continue to check for edge transitions.

Second, we may cross multiple edges in the first triangle:

In the above example, the new point P’ is across both the line through BC and the line through CA. The order in which we choose to process these traversals matters, because some intermediate triangles may be solid.

The correct traversal order is to process them in the same order that the ray P -> P’ crosses them. In this case, we first cross CA:

Every time we cross a segment, we find the intersection with the edge and move our player up to that edge. If the new triangle is not solid, we can update our current triangle and iterate again, now with a segment from P” to P’.

If the new triangle is solid, we don’t update our triangle and we process a collision on the colliding face. This means removing all velocity into the face (perpendicular to it) and adding sliding friction along the face (parallel to it). We continue to iterate again, now with a reduced step in our new direction along the edge:

The code structure is thus a while loop that, in each iteration, checks for traversal across all three edges of the enclosing triangle, and retains the earliest collision (if any). If there is a collision, we move the player up to the crossed edge (P”) and potentially handle collision physics or update the enclosing triangle, and iterate again. If there is no collision, then we stay inside our current triangle and are done.

Finding the intersection point for a crossed segment (suppose it is AB) can be done as follows:

\[\begin{aligned}V & = B – A \\ W &= P – A \\ \alpha &= \frac{V \times W }{P’ \times V} \\ P^{\prime\prime} & = P + \alpha (P’ – P) \end{aligned}\]

The interpolant \(\alpha \in [0,1]\) tells us how far we’ve moved toward P’, so is a great way to track which edge we cross first. We first cross the edge with the lowest \(\alpha\).

The code for this is thus:

void UpdatePlayer(Vec2 input_dir, float dt) {
   // Update the player's velocity, clamp, etc.
   ... same as before

   // Update the player's position
   float dt_remaining = dt;
   while (dt_remaining > 0.0f) {
        Vec2f player_pos_delta = player_vel_ * dt_remaining;
        Vec2f player_pos_next = player_pos_ + player_pos_delta;

        // Exit if the step is too small, to avoid math problems.
        if (Norm(player_pos_delta) < 1e-4) {
            break;
        }

        // Check to see if we cross over any mesh edges
        QuarterEdge* qe_ab = qe_enclosing_triangle_->rot;
        QuarterEdge* qe_bc = qe_enclosing_triangle_->next->rot;
        QuarterEdge* qe_ca = qe_enclosing_triangle_->next->next->rot;

        // Get the vertices
        const auto& a = *(qe_ab->vertex);
        const auto& b = *(qe_bc->vertex);
        const auto& c = *(qe_ca->vertex);

        // The fraction of the distance we will move
        float interp = 1.0f; // (alpha)
        QuarterEdge* qe_new_triangle = nullptr;
        Vec2f v_face;

        if (GetRightHandedness(a, b, player_pos_next) < -1e-4) {
            // We would cross AB
            Vec2f v = b - a;
            Vec2f w = player_pos_ - a;
            float interp_ab = Cross(v, w) / Cross(player_pos_delta, v);
            if (interp_ab < interp) {
                interp = interp_ab;
                qe_new_triangle = qe_ab->rot;
                v_face = v;
            }
        }
        if (GetRightHandedness(b, c, player_pos_next) < -1e-4) {
            // We would cross BC
            Vec2f v = c - b;
            Vec2f w = player_pos_ - b;
            float interp_bc = Cross(v, w) / Cross(player_pos_delta, v);
            if (interp_bc < interp) {
                interp = interp_bc;
                qe_new_triangle = qe_bc->rot;
                v_face = v;
            }
        }
        if (GetRightHandedness(c, a, player_pos_next) < -1e-4) {
            // We would cross CA
            Vec2f v = a - c;
            Vec2f w = player_pos_ - c;
            float interp_ca = Cross(v, w) / Cross(player_pos_delta, v);
            if (interp_ca < interp) {
                interp = interp_ca;
                qe_new_triangle = qe_ca->rot;
                v_face = v;
            }
        }

        // Move the player
        // If we would have crossed any triangle, this merely moves
        // the player up to the edge instead
        player_pos_ += player_pos_delta * interp;
        dt_remaining *= (1.0f - interp);
        dt_remaining -= 1e-5;  // eps factor for numeric safety

        if (qe_new_triangle != nullptr) {
            // We would have crossed into a new triangle
            if (is_solid(qe_new_triangle)) {
                // The new triangle is solid, so do not change triangles
                // Process the collision
                player_vel_ = VectorProjection(player_vel_, v_face);
                player_vel_ *= player_friction_slide_; // sliding friction
            } else {
                // Accept the new triangle
                qe_enclosing_triangle_ = qe_new_triangle;
            }
        }
   }

   // Apply air friction
   vel_player_ *= kFrictionAir;
}

This code block isn’t drastically more complicated than what we had before. We get a lot for it – efficient collision physics against general static objects. There is a bit of repeated code, so you could definitely simplify this even further.

Conclusion

This post revisited the problem of 2D player collision against static geometry. We covered some basics of player movement, including integrating velocity over time, basic collision physics, and basic collision approaches based on geometric primitives. We used the Minkowski sum to inflate collision polygons, thereby changing polygon vs. polygon intersections into cheaper point vs. polygon intersections. We concluded by leveraging constrained Delaunay triangulations for our environment instead, which simplified the collision detection logic down to checking for when we cross enclosing triangle line segments.

Please note that I am not the first to come up with this. It turns out I independently discovered Nav Meshes. I since learned that they are used in motion planning for robotics and in path planning for games (notably, Starcraft II). I assume they are also used for collision physics in games, but I haven’t been able to verify that yet.

I implemented this demo in ROS 2 using C++ 20. This was a deviation from my previous posts, which are mostly created in Julia notebooks. This project had more game-like interactivity to it than I thought a Julia notebook could easily handle (I could be wrong) and I simply felt like I could get up and running faster with ROS 2.

The most complicated part to implement in this whole process turns out to be polygon unions. In order to get the constrained Delaunay triangulation to work, you want to union any overlapping collision polygons together. Polygon unions (and other Boolean operations between polygons, like clipping) is quite complicated. I tried to write my own, but ended up using Clipper2.

I hope you found this process interesting! I was a lot of fun to revisit.

Cart Pole Actor Critic

In the previous two posts, we used various control strategies to solve the classic cart-pole problem. In the first, we used a classic LQR controller to stabilize the cart pole in the vicinity of its equilibrium point. In the second, we built a system that could reach stability from anywhere, by forming a hierarchical problem using LQR controllers around various reference states and a top-level search strategy to find the right sequence of controls to reach stability.

All of this was well and good, but what if we change our problem even more drastically?

So far, everything we have worked with has had intimate knowledge of the underlying dynamics. The LQR approach only works because we know the equations of motion, and can take the necessary derivatives to form a linear system. What if we didn’t have those equations of motion? In many real-world cases, we simply don’t. Now we’re going to pretended we don’t know the equations of motion, and will learn a good controller anyways.

From Controls to Reinforcement Learning

We are going to take a journey, leaving the land of controls behind, and head to the realm of reinforcement learning. This land is at first glance completely foreign, but once you get your bearings you begin to realize their strange customs are really just alternate manifestations of things we’re already used to.

Previously, in the land of controls, we derived controllers that produced control inputs to drive the system’s state to equilibrium while minimizing cost. The people of the realm of reinforcement learning do the same thing, except they find policies \(\pi(s)\) that produce actions to drive the system’s state to equilibrium while maximizing reward.

Now, unlike in the land of controls, the people of the realm of reinforcement learning are very used to working on systems where they do not know the equations of motion. They know that a transition function \(T(s’\mid s,a)\) and a reward function \(R(s,a)\) exist, but do not necessarily know what it is. Instead, they allow their agent to interact with its environment, and thereby gain information about these from the state transitions and rewards observed.

In this post I am going to use an actor-critic method. These are covered in chapter 13 of Alg4DM. The “actor” part refers to the fact that we learn a policy that our agent then follows. The “critic” part refers to the fact that we also learn a value function that critiques the agent’s policy. Both are trained in parallel, the critic becoming better at maximizing the critic’s reward, and the critic getting better at reflecting the true value of the actor’s behavior.

Problem Definition

We always start by defining the actual problem we are solving. This is super important. The problem that we solve is always a model of the ground truth. As George Box famously said, “All models are wrong, some models are useful.” The real world doesn’t care that modeling actuator saturation is harder – it will simply do what it always does.

To shake things up a tad, we are going to use cart pole with discrete actions, as it was originally presented. Note that in the land of controls we worked with continuous actions. (We could totally do that here – its just nice to do something different).

Now, the real robot likely can exert continuous forces. We are making a modeling choice, and have to live with the consequences. A series of discrete force outputs will naturally be more jerky / stuttery than a nice smooth output. Maybe we’ll want to tidy that up later, perhaps with a low-pass filter. For now, we’ll go with this.

As such, our problem definition is:

  • State Space \(\mathcal{S}\): The same classic cart-pole states consisting of the lateral position \(x\), the lateral velocity \(v\), the pole angle \(\theta\), and the pole’s angular rate \(\omega\).
  • Action Space \(\mathcal{A}\): We assume the cart has a maximum actuation force of \(F\). The available actions at each time step (same time step of 50Hz as before) are \(F\), \(0\), and \(-F\). As such, we have three discrete actions.
  • Transition Function \(T(s’\mid s, a)\): We use the deterministic cart-pole transition function under the hood, with the same transition dynamics that we have already been using.
  • Reward Function \(R(s,a)\): The original paper just gives a reward of 1 any time the cart is in-bounds in \(x\) and \(\theta\). We are instead going to use the reward function formulation that penalizes deviations from stability:
    \[R(s,a) = \frac{1}{100}r_x + r_\theta = \frac{1}{100}(\Delta x_\text{max}^2 – x^2) + (1 + \cos(\theta))\]
    We continue to give zero reward if we are out of bounds in \(x\), but not in \(\theta\).

Note that our learning system does not know the details of the transition and reward function. Note also that our reward function is non-negative, which isn’t necessary, but can make things a bit more intuitive.

Vanilla Actor Critic

The core ideas of actor-critic are pretty straightforward. We are simultaneously learning a policy and a value function.

Our policy, (which is stochastic to make the math easier), is a function that produces a categorical distribution over our three actions. We’re going to use a simply deep neural net for this, one with three fully connected layers:

Notice the softmax activation function on the final layer, which ensures that our output is a valid probability distribution.

There are multiple actor-critic approaches with different value functions. In this case I’ll learn a state-value function \(U(s)\), which predicts the expected reward from the given state (the discounted sum of all future rewards). I’m also going to use a simple neural network for this:

The final layer has no activation function, so the out put can be any real number.

These can both readily be defined using Flux.jl:

get_stochastic_policy(n_hidden::Int=32) = Chain(Dense(4 => n_hidden,relu), Dense(n_hidden => n_hidden,relu), Dense(n_hidden => 3), softmax)
get_value_function(n_hidden::Int=32) = Chain(Dense(4 => n_hidden,relu), Dense(n_hidden => 1))

In every iteration we run a bunch of rollouts and:

  • compute a policy gradient to get it to maximize the value functions
  • compute a value gradient to get our value function to reflect the true expected value

Alg4DM presents this in one nicely compact pseudocode block:

This formulation uses the likelihood ration approach to computing the policy gradient, which for a stochastic policy is:

\[\nabla U(\theta) = \underset{\tau}{\mathbb{E}}\left[\sum_{k=1}^d \gamma^{k-1} \nabla_\theta \log \pi_\theta\left(a^{(k)}\mid s^{(k)}\right) A_\theta\left(s^{(k)},a^{(k)}\right)\right]\]

where we estimate the advantage using our critic value function:

\[A_\theta(s,a) = \underset{r, s’}{\mathbb{E}}\left[ r + \gamma U_\phi(s’) – U_\phi(s)\right]\]

For the critic, we simply minimize the difference between the predicted expected value \(U_\phi(s)\) and the expected value obtained by running the current policy \(U^{\pi_\theta}(s)\):

\[\ell(\phi) = \frac{1}{2}\underset{s}{\mathbb{E}}\left[\left(U_\phi(s) – U^{\pi_\theta}(s)\right)^2\right]\]

The critic gradient is thus:

\[\nabla \ell(\phi) = \underset{s}{\mathbb{E}}\left[ \sum_{k=1}^d \left( U_\phi(s^{(k)}) – r_\text{to-go}^{(k)} \right) \nabla_\phi U_\phi(s^{(k)}) \right]\]

While the math may look a bit tricky, the code block above shows how straightforward it is to implement. You just simulate a bunch of rollouts (\(\tau\)), and then calculate the gradients.

The code I wrote is functionally identical to the code in the textbook, but is laid out a bit differently to help with legibility (and later expansion). Notice that I compute a few extra things so we can plot them later:

function run_epoch(
    M::ActorCritic,
    π::Chain, # policy π(s)
    U::Chain,  # value function U(s)
    mem::ActorCriticMemory,
    )
    t_start = time()
    𝒫, d_max, m = M.𝒫, M.d, M.m
    data = ActorCriticData()
    params_π, params_U = Flux.params(π), Flux.params(U)
    ∇π, ∇U, vs, Us = mem.∇π, mem.∇U, mem.vs, mem.Us
    states, actions rewards = mem.states, mem.actions, mem.rewards
    a_selected = [0.0,0.0,0.0]
    # Zero out the gradients
    for grad in ∇π
        fill!(grad, 0.0)
    end
    for grad in ∇U
        fill!(grad, 0.0)
    end
    # Run m rollouts and accumulate the gradients.
    for i in 1:m
        # Run a rollout, storing the states, actions, and rewards.
        # Keep track of max depth reached.
        max_depth_reached = rollout(M, π, U, mem)
        fit!(data.max_depth_reached, max_depth_reached)
        # Compute the actor and critic gradients
        r_togo = 0.0
        for d in max_depth_reached:-1:1
            v = vs[d]
            s = states[d]
            a = actions[d]
            r = rewards[d]
            s′ = states[d+1]
            a_selected[1] = 0.0
            a_selected[2] = 0.0
            a_selected[3] = 0.0
            a_selected[a] = 1.0
            r_togo = r + γ*r_togo
            A = r_togo - Us[d]
            fit!(data.advantage, A)
            ∇logπ = gradient(() -> A*γ^(d-1)*log(π(v) ⋅ a_selected), params_π)
            ∇U_huber = gradient(() -> Flux.Losses.huber_loss(U(v), r_togo), params_U)
            for (i_param,param) in enumerate(params_π)
                ∇π[i_param] -= ∇logπ[param]
            end
            for (i_param,param) in enumerate(params_U)
                ∇U[i_param] += ∇U_huber[param]
            end
        end
    end
    # Divide to get the mean
    for grad in ∇π
        grad ./= m
    end
    for grad in ∇U
        grad ./= m
    end
    data.∇_norm_π = sqrt(sum(sum(v^2 for v in grad) for grad in ∇π))
    data.∇_norm_U = sqrt(sum(sum(v^2 for v in grad) for grad in ∇U))
    data.w_norm_π = sqrt(sum(sum(w^2 for w in param) for param in params_π))
    data.w_norm_U = sqrt(sum(sum(w^2 for w in param) for param in params_U))
    data.t_elapsed = time() - t_start
    return (∇π, ∇U, data)
end

Theoretically, that is all you need to get things to work. Each iteration you run a bunch of rollouts, accumulate gradients, and then use these gradients to update both the policy and the value function parameterizations. You keep iterating until the parameterizations converge.

With such an implementation, we get the striking result of….

As we can see, the policy isn’t doing too well. What did we do wrong?

Theoretically, nothing wrong, but practically, we’re missing some bells and whistles that make a massive difference. These sorts of best practices are incredibly important in reinforcement learning (and generally in deep learning).

Decomposing our Angles

The first change we are going to make is to the inputs to our deep neural net. The current inputs use the raw angle, \(\theta\). This may work well in the basic cart-pole problem where our angle is bounded within a small range, but in a more general cart-pole setting where the pole can rotate multiple times, we end up with weird angular effects that are difficult for a neural network to learn how to handle:

In order to better handle the angle, we are going to instead input \(\cos(\theta)\) and \(\sin(\theta)\):

Wrangling the Loss Functions

One of the biggest issues we have is with the reward function. The expected value at a given state can be quite large. A single state can \(R(s,a)\) produce values as large as 4.8^2/100 + 2 = 2.23, but these then get accumulated over 2000 time steps with a discount factor of \(\gamma = 0.995\), resulting in values as large as \(U(s) = \sum_{k=1}^2000 2.23 \gamma^{k-1} \approx 446\). Now, 446 isn’t massive, but it is still somewhat large. Pretty much everything deep-learning likes to work close to zero.

We can plot our gradient norm during training to see how this large reward can cause an issue. The magnitude of the value function gradient is massive! This leads to large instabilities, particularly late in training:

The first thing we will do is dampen out the loss signal for large deviations in reward. Right now, if we predict, say, \(U(s) = 50\) but the real reward is \(U(s) = 400\), we’d end up with a squared loss value of \(350^2 = 122500\). That is somewhat unwieldy.

We want to use a squared loss, for its simplicity and curvature, but only if we are pretty close to the correct value. If we are far off the mark, we’d rather use something gentler.

So we’ll turn to the Huber loss, which uses the squared loss for nearly accurate values and a linear loss for sufficiently inaccurate values:

This helps alleviate the large reward problem, but we can do better. The next thing we will do is normalize our expected values. This step adjusts the expected values to have mean zero and a unit standard deviation (i.e., follow a normal distribution):

\[\bar{u} = (u – \mu) / \sigma\]

Right now, our average advantage starts off very large (~150), with a large standard deviation (~50), and over time settles to around mean 0 and stdev 30:

To normalize, we need to keep a running estimate of the value function mean \(\mu\) and standard deviation \(\sigma\). OnlineStats.jl makes this easy. One important thing to note, however, is that we compute the reward-to-go for multiple depths, and the estimated values are going to be different depending on the depth. A reward-to-go from \(k=1\) with a max depth of 2000 is going to be different than a reward-to-go from \(k=1999\). As such, we actually maintain separate mean and standard deviation estimates for each depth.

The last thing we do is clamping, both to the standardized reward-to-go:

r_togo = r + γ*r_togo
r_togo_standardized = (r_togo - r_togo_μs[d]) / (r_togo_σs[d] + 1e-3)
r_togo_standardized = clamp(r_togo_standardized, -max_standardized_r_togo, max_standardized_r_togo)
fit!(r_togo_est[d], r_togo)

and to the gradient:

for grad in ∇π
    grad ./= m
    clamp!(grad, -max_gradient, max_gradient)
end
for grad in ∇U
    grad ./= m
    clamp!(grad, -max_gradient, max_gradient)
end

Together, these changes allow the value function to predict values with zero-mean and unit variance, which is a lot easier to learn than widely-ranging large values. This is especially important here because the true value function values depend on the policy, which is changing. A normalized reward will consistently be positive for “good” states and negative for “bad” states, whereas a non-normalized will be some large value for “good” states and some not as large value for “bad” states. Normalization thus stabilizes learning.

Regularization

The final change to include is regularization. I am not certain that this makes a huge difference here, but it often does when working with deep learning projects.

Directly minimizing the critic loss and maximizing the expected policy reward can often be achieved with multiple parameterizations. These parameterizations may be on different scales. As we’ve seen before, we don’t like working with large (in an absolute sense) numeric values. Regularization adds a small penalty for large neural network weights, thereby breaking ties and causing the process to settle on a smaller parameterization.

The math is simple if we try to minimize the squared size our parameters:

\[\ell_\text{regularization}(\theta) = \frac{1}{2} \theta^\top \theta\]

Here, the gradient is super simple:

\[\nabla \ell_\text{regularization}(\theta) = \theta\]

We can thus simply add in our parameter values, scaled by a regularization weight:

for (param, grad) in zip(params_π, ∇π)
    grad += M.w_reg_π * param  # regularization term
    Flux.update!(opt_π, param, grad)
end
for (param, grad) in zip(params_U, ∇U)
    grad += M.w_reg_U * param
    Flux.update!(opt_U, param, grad)
end

Putting it all Together

With all of these changes in place, learning is much better. We are able to get Little Tippy the robot to stabilize:

The performance isn’t perfect, because we are, after all, working without explicit knowledge of the system dynamics, and are outputting discrete forces. Nevertheless, this approach was able to figure it out.

The policy was trained for 500 epochs, each with 128 rollouts to a maximum depth of 1000. A rollout terminates if the cart-pole gets further than 4.8m from the origin. The policy pretty quickly figures out how to stay within bounds: (plus or minus one stdev are also shown):

For contrast, here is the same plot with the vanilla setup:

The actual performance is better measured by the mean reward-to-go (i.e., expected value) that the system maintains over time. We can see that we have more-or-less converged:

With regularization, our advantage is nicely kept roughly standardized:

Conclusion

Actor-critic methods are nice for many more general problems, because we often do not know real-world dynamics. Neural networks can be very good function approximations, adapting to the problem at hand (whether that be Go, or an aircraft controller). That being said, actor-critic methods alone are not a panacea, and still require special knowledge and insight. In this post alone, we covered the following decision points that we as a designer made:

  • Whether to use discrete actions
  • The cart-pole reward function
  • Which policy gradient method to use
  • Which critic loss function to minimize
  • Maximum rollout depth
  • Number of rollouts per epoch
  • Number of epochs / training termination condition
  • Policy neural network architecture
  • Critic neural network architecture
  • Neural network weight initialization method
  • Whether to use gradient clamping
  • Using sine and cosine rather than raw angles
  • Standardizing the expected value (reward-to-go)
  • Which optimizer to use (I used Adam).

Being a good engineer often requires being aware of so much more than the vanilla concepts. It is especially important to understand when one is making a decision, and what impact that design has.

Thank you for reading!

The code for this post is available here. A notebook is here.

Cart Pole with Actuator Limits

In the previous post, we used an LQR controller to solve the classic cart-pole problem. There, the action space is continuous, and our controller is allowed to output any force. This leads to some very smooth control.

LQR Control

We can plot the force over time:

We can see that our force starts off at 8 N and oscillates down to about -4 N before settling down as the cart-pole system stabilizes. This raises a question – don’t real world actuators saturate? What happens if we model that saturation?

LQR with Saturation

The original LQR problem finds the sequence of \(N\) actions (here, cart lateral forces) \(a^{(1)}, \ldots, a^{(N)}\) that produces the sequence of states \(s^{(1)}, \ldots, s^{(N+1)}\) that maximizes the discounted reward across states and actions:

\[\begin{matrix}\underset{\boldsymbol{x}}{\text{maximize}} & R(s^{(1)}, a^{(1)}) + R(s^{(2)}, a^{(2)}) + \ldots + R(s^{(N)}, a^{(N)}) + R_\text{final}(s^{(N+1)}) \\ \text{subject to} & s^{(i+1)} = T_s s^{(i)} + T_a a^{(i)} \qquad \text{ for } i = 1:N \end{matrix}\]

Well, actually we discount future rewards by some factor \(\gamma \in (0,1)\) and we don’t bother with the terminal reward \(R_\text{final}\):

\[\begin{matrix}\underset{\boldsymbol{x}}{\text{maximize}} & R(s^{(1)}, a^{(1)}) + \gamma R(s^{(2)}, a^{(2)}) + \ldots + \gamma^{N-1} R(s^{(N)}, a^{(N)}) \\ \text{subject to} & s^{(i+1)} = T_s s^{(i)} + T_a a^{(i)} \qquad \text{ for } i = 1:N \end{matrix}\]

Or more simply:

\[\begin{matrix}\underset{\boldsymbol{x}}{\text{maximize}} & \sum_{i=1}^N \gamma^{i-1} R(s^{(i)}, a^{(i)}) & \\ \text{subject to} & s^{(i+1)} = T_s s^{(i)} + T_a a^{(i)} & \text{for } i = 1:N \end{matrix}\]

Our dynamics are linear (as seen in the constraint) and our reward is quadratic:

\[R(s, a) = s^\top R_s s + a^\top R_a a\]

It turns out that efficient methods exist for solving this problem exactly (See section 7.8 of Alg4DM), and the solution does not depend on the initial state \(s^{(1)}\).

With actuator limits comes force saturation. Now, we could just use the controller that we got without considering saturation, and just clamp its output to a feasible range:

Running the saturating controller (-1 <= F <= 1) from the same initial conditions.

As we can see, our force saturates from the beginning, and is insufficient to stabilize the robot:

We can instead include the saturation constraint in our optimization problem:

\[\begin{matrix}\underset{\boldsymbol{x}}{\text{maximize}} & \sum_{i=1}^N \gamma^{i-1} R(s^{(i)}, a^{(i)}) & \\ \text{subject to} & s^{(i+1)} = T_s s^{(i)} + T_a a^{(i)} & \text{for } i = 1:N \\ & a_\text{min} \leq a^{(i)} \leq a_\text{max} & \text{for } i = 1:N \end{matrix}\]

This problem is significantly more difficult to solve. The added constraints make our problem non-linear, which messes up our previous analysis. Plus, our optimal policy now depends on the initial state.

The problem is still convex though, so we can just throw it into a convex solver and get a solution.

N = 200
a = Convex.Variable(N)
s = Convex.Variable(4, N+1)
problem = maximize(sum(γ^(i-1)*(quadform(s[:,i], Rs) + quadform(a[i], Ra)) for i in 1:N))
problem.constraints += s[:,1] == s₁
for i in 1:N
    problem.constraints += s[:,i+1] == Ts*s[:,i] + Ta*a[i]
    problem.constraints += a_min <= a[i]
    problem.constraints += a[i] <= a_max
end
optimizer = ECOS.Optimizer()
solve!(problem, optimizer)

Solving to 200 steps from the same initial state gives us:

Yellow is the original policy, red is the optimized policy

Unfortunately, the gif above shows the optimized trajectory. That’s different than the actual trajectory we get when we use the nonlinear dynamics. There, Little Tippy tips over:

What we really should be doing is re-solve our policy at every time step. Implementing this on a real robot requires getting it to run faster than 50Hz, which shouldn’t be a problem.

Unfortunately, if we implement that, we can’t quite stabilize:

As it turns out, there are states from which Little Tippy, with limited actuation power, does not stabilize with an LQR policy. We can plot the states in which the original LQR policy with infinite actuation stabilizes from \(x = v = 0\) (and does not exceed the \(x\) bounds):

All states in the center between the blue contour lines are initial states within which the cart-pole will reach the stable target state (and not exceed the position boundaries).

We can produce the same plot for the same LQR policy run with limited actuation:

States within a blue region are initial states from which the LQR policy, when run with clamped outputs, reaches the stable target state.

We can see how severe the actuator limits are! To get our robot to really work, we’re going to want to be able to operate despite these actuator limitations. We’re not going to want to assume that we start from somewhere feasible – we’ll want Little Tipple to know how to build up the momentum necessary to swing up the pole and stabilize.

A Cart Pole Search Problem

There are multiple ways that we can get Little Tippy to swing up. We could try something like iLQR, but that won’t necessarily save us from the nonlinear troughs that are the cart-pole problem. It follows a descent direction just like LQR, so unless our problem is convex or we get lucky, we’ll end up in a mere local minimum.

To help Little Tippy find his way to equilibrium, we’ll do something more basic – transform the problem into a search problem. Up until this point, we’ve been using continuous controls techniques to find optimal control sequences. These are excellent for local refinement. Instead, we need to branch out and consider multiple options, and choose between them for the best path to our goal.

We only need a few things to define a search problem:

  • A state space \(\mathcal{S}\), which defines the possible system configurations
  • An action space \(\mathcal{A}\), which defines the actions available to us
  • A deterministic transition function \(T(s,a)\) that applies an action to a state and produces a successor state
  • A reward function \(R(s,a)\) that judges a state and the taken action for goodness (higher values are better)

Many search problems additionally define some sort of is_done method that checks whether we are in a terminal (ex: goal) state. For example, in the cart-pole problem, we terminate if Little Tippy deviates too far left or right. We could technically incorporate is_done into our reward, by assigning zero reward whenever we match its conditions, but algorithms like forward search can save some compute by explicitly using it.

The cart-pole problem that we have been solving is already a search problem. We have:

  • A state space \(\mathcal{S} = \mathbb{R}^4\) that consists of cart-pole states
  • An action space \(\mathcal{A} = \{ a \in \mathbb{R}^1 | a \in [-1,1]\}\) of bounded, continuous force values
  • A transition function \(T(s,a)\) that takes a cart-pole state and propagates it with the instantaneous state derivative for \(\Delta t\) to obtain the successor state \(s’\)
  • A reward function that, so far, we’ve defined as quadratic in order to apply the LQR machinery, \(R(s,a) = s^\top R_s s + a^\top R_a a\)

Unfortunately, this search problem is made difficult by the fact that it both has an infinite number of actions (since our actions are continuous) and has very long trajectories, as \(\Delta t = 0.02\). Searching for good paths would be pretty difficult. So, what I am going to try is to create a simpler search problem that acts as a higher-level, coarser problem representation, but is powerful enough to let us find our way to our destination.

Let’s create a new problem. We’ll call it Hierarchical Cart Pole. The states in this problem are cart-pole states, but the actions are to apply an LQR controller for 1 second. We will create a number of LQR controllers, each for the cart-pole problem linearized around different control points. This gives us the nice smooth control capabilities of an LQR controller, but in a search problem that we can actually do things with.

We have:

  • A state space \(\mathcal{S} = \mathbb{R}^4\) that consists of cart-pole states
  • An action space \(\mathcal{A} = \left\{a_1, a_2, \ldots, a_K \right\}\) that consists of choosing between \(K\) controllers
  • A transition function \(T(s,a)\) that takes a cart-pole state and propagates it using the selected LQR controller for 1s
  • A reward function:
    \[R(s,a) = \begin{cases} \frac{1}{100}r_x + r_\theta & \text{if } x \in [-x_\text{min}, x_\text{max}] \\ 0 & \text{otherwise}\end{cases}\]

The reward function simply gives zero reward for out-of-bounds carts, and otherwise produces more reward the closer Little Tipple is to our target state (\(x = 0, \theta = 0\)). I used \(x_\text{max} = -x_\text{min} = 4.8\) just like the original formulation, \(r_x = x_\text{max}^2 – x^2\), and \(r_\theta = 1 + \cos \theta\).

We can now apply any search algorithm to our Hierarchical Cart Pole problem. I used a simple forward search that tries all possible 4-step action sequences. That is enough to give us pretty nice results:

Starting from rest, Little Tippy builds up momentum and swings up the pole!

We can plot the forces:

The forces are nice and bounded. We can see the quintessential riding of the rails, with the force bouncing back and forth between its limits. This is quite natural. If we have limits, we’re often going to use everything available to us to get the job done.

In this problem, I linearized the dynamics around 12 control states, all with \(v = \omega = 0\):

reference_states = [
    CartPoleState(-1.0, 0.0, 0.0, 0.0),
    CartPoleState( 0.0, 0.0, 0.0, 0.0),
    CartPoleState( 1.0, 0.0, 0.0, 0.0),
    CartPoleState(-1.0, 0.0, deg2rad(-90), 0.0),
    CartPoleState( 0.0, 0.0, deg2rad(-90), 0.0),
    CartPoleState( 1.0, 0.0, deg2rad(-90), 0.0),
    CartPoleState(-1.0, 0.0, deg2rad(90), 0.0),
    CartPoleState( 0.0, 0.0, deg2rad(90), 0.0),
    CartPoleState( 1.0, 0.0, deg2rad(90), 0.0),
    CartPoleState(-1.0, 0.0, deg2rad(180), 0.0),
    CartPoleState( 0.0, 0.0, deg2rad(180), 0.0),
    CartPoleState( 1.0, 0.0, deg2rad(180), 0.0),
]

We can plot the macro actions as well, which at each time step is the index of the selected reference state:

Notice how, at the end, the system chooses 1, which is the LQR controller linearized about \(x = -1\). This is what gets the cart to drive left while remaining stabilized.

Linearization

Alright, I know what you’re thinking. Sweet little solver, but how did you get those LQR controllers?

Before we merely linearized about zero. Now we’re linearizing about other control points.

To do this, we start by linearizing our equations of motion. This means we’re forming the 1st order Taylor approximation. The first-order Taylor approximation of some function \(g(x)\) about a reference \(x_\text{ref}\) is \(g(x) \approx g(x_\text{ref}) + g'(x_\text{ref}) (x – x_\text{ref})\). If the equations of motion are \( \dot{s} = F(s, a)\), then the 1st order Taylor approximation about a reference state \(s_\text{ref}\) and a reference action \(a_\text{ref}\) is

\[\dot{s} \approx F(s_\text{ref}, a_\text{ref}) + J_s \cdot (s – s_\text{ref}) + J_a \cdot (a – a_\text{ref})\]

where \(J_s\) and \(J_a\) are the state and action Jacobians (matrices of gradients). Here I always used a reference action of zero.

Why compute this by hand when you could have the computer do it for you? This is what auto-differentiating packages like Zygote.jl are for. I ended up using Symbolics.jl so I could see the symbolic output:

function linearize(mdp::CartPole, s_ref::CartPoleState)
    Δt = mdp.Δt
    @variables x v θ ω ℓ g m M F
    η = (F + m**sin(θ)*ω^2) / M
    pole_α = simplify((M*g*sin(θ) - M*η*cos(θ))/(*(4//3*M - m*cos(θ)^2)))
    cart_a = simplify(η - m ** pole_α * cos(θ) / M)
    # Our first order derivatives
    equations_of_motion = [
        v,
        cart_a,
        ω,
        pole_α
    ]
    state_vars = [x,v,θ,ω]
    action_vars = [F]
    ns = length(state_vars)
    na = length(action_vars)
    df = Array{Float64}(undef, ns)
    Js = Array{Float64}(undef, ns, ns)
    Ja = Array{Float64}(undef, ns, na)
    subs = Dict(x => s_ref.x, v => s_ref.v, θ => s_ref.θ, ω => s_ref.ω,
                m => mdp.mass_pole, M => mdp.mass_pole + mdp.mass_cart,
                g => mdp.gravity, ℓ => mdp.pole_length, F => 0.0)
    for (i, eom) in enumerate(equations_of_motion)
        df[i] = value.(substitute.(eom, (subs,))[1])
        for (j, var) in enumerate(state_vars)
            derivative = expand_derivatives(Differential(var)(eom))
            Js[i,j] = value.(substitute.(derivative, (subs,))[1])
        end
        for (j, var) in enumerate(action_vars)
            derivative = expand_derivatives(Differential(var)(eom))
            Ja[i,j] = value.(substitute.(derivative, (subs,))[1])
        end
    end
    return (df, Js, Ja)
end

The intermediate system, before the reference state is substituted in, is:

\[\begin{bmatrix} \dot{x} \\ \dot{v} \\ \dot{\theta} \\ \dot{\omega} \end{bmatrix} \approx \left. \begin{bmatrix} v \\ \zeta – \frac{m_p \ell \ddot{\theta} \cos(\theta)}{m_p + m_c} \\ \omega \\ \frac{g\sin(\theta) – \zeta \cos(\theta)}{\ell\left( \frac{4}{3} – \frac{m_p \cos(\theta)^2}{m_p + m_c} \right)} \end{bmatrix} \right|_{s = s_\text{ref}} + \left. \begin{bmatrix} 0 & 1 & 0 & 0 \\ 0 & 0 & \frac{-m_p g}{\frac{4}{3}M – m_p} & \frac{\partial}{\partial \omega} \ddot{x} \\ 0 & 0 & 0 & 1 \\ 0 & 0 & \frac{Mg}{\ell \left(\frac{4}{3}M – m_p\right)} & \frac{-m_p \omega \sin (2 \theta)}{\frac{4}{3}M – m_p \cos^2 (\theta)} \end{bmatrix}\right|_{s = s_\text{ref}} \begin{bmatrix} x \\ v \\ \theta \\ \omega \end{bmatrix} + \left. \begin{bmatrix} 0 \\ \frac{\frac{4}{3}}{\frac{4}{3}M – m_p} \\ 0 \\ -\frac{1}{\ell\left(\frac{4}{3}M – m_p\right)}\end{bmatrix}\right|_{s = s_\text{ref}} a\]

where \(\ell\) is the length of the pole, \(m_p\) is the mass of the pole, \(M = m_p + m_c\) is the total mass of the cart-pole system,

\[\zeta = \frac{a + m_p \ell \sin(\theta) \omega^2}{m_p + m_c}\text{,}\]

and

\[\frac{\partial}{\partial \omega} \ddot{x} = m_p \omega \ell \frac{m_p \cos (\theta)\sin(2\theta) – 2 m_p\cos^2(\theta) \sin (\theta) + \frac{8}{3}M \sin(\theta)}{M\left(\frac{4}{3}M – m_p\cos^2(\theta)\right)}\]

For example, linearizing around \(s_\text{ref} = [1.0, 0.3, 0.5, 0.7]\) gives us:

\[\begin{bmatrix} \dot{x} \\ \dot{v} \\ \dot{\theta} \\ \dot{\omega} \end{bmatrix} \approx \begin{bmatrix} \hphantom{-}0.300 \\ -0.274 \\ \hphantom{-}0.700 \\ \hphantom{-}3.704 \end{bmatrix} + \begin{bmatrix} 0 & 1 & 0 & 0 \\ 0 & 0 & -0.323 & 0.064 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 6.564 & -0.042 \end{bmatrix} \begin{bmatrix} x \\ v \\ \theta \\ \omega \end{bmatrix} + \begin{bmatrix} 0 \\ \hphantom{-}0.959 \\ 0 \\ -0.632\end{bmatrix} a\]

We’re not quite done yet. We want to get our equations into the form of linear dynamics, \(s’ = T_s s + T_a a\). Just as before, we apply Euler integration to our linearized equations of motion:

\[\begin{aligned}s’ & \approx s + \Delta t \dot{s} \\ &= s + \Delta t \left( F(s_\text{ref}, a_\text{ref}) + J_s \cdot (s – s_\text{ref}) + J_a a \right) \\ &= s + \Delta t F(s_\text{ref}, a_\text{ref}) + \Delta t J_s s – \Delta t J_s s_\text{ref} + \Delta t J_a a \\ &= \left(I + \Delta t J_s \right) s + \left( \Delta t F(s_\text{ref}, a_\text{ref}) – \Delta t J_s s_\text{ref} \right) + \Delta t J_a a \end{aligned}\]

So, to get it in the form we want, we have to introduce a 5th state that is always 1. This gives us the linear dynamics we want:

\[\begin{bmatrix} s \\ 1 \end{bmatrix} \approx \begin{bmatrix} I + \Delta t J_s & \Delta t F(s_\text{ref}, a_\text{ref}) – \Delta t J_s s_\text{ref} \\ [0\>0\>0\>0] & 1 \end{bmatrix} \begin{bmatrix} s \\ 1\end{bmatrix} + \begin{bmatrix} \Delta t J_a \\ 0 \end{bmatrix} a\]

For the example reference state above, we get:

\[\begin{bmatrix}x’ \\ v’ \\ \theta’ \\ \omega’ \\ 1\end{bmatrix} \approx \begin{bmatrix}1 & 0.02 & 0 & 0 & 0 \\ 0 & 1 & -0.00646 & 0.00129 & -0.00315 \\ 0 & 0 & 1 & 0.02 & 0 \\ 0 & 0 & 0.131 & 0.999 & 0.00903 \\ 0 & 0 & 0 & 0 & 1\end{bmatrix}\begin{bmatrix}x \\ v \\ \theta \\ \omega \\ 1\end{bmatrix} + \begin{bmatrix}0 \\ 0.019 \\ 0 \\ -0.0126 \\ 0\end{bmatrix}\]

We can take our new \(T_s\) and \(T_a\), slap an extra dimension onto our previous \(R_s\), and use the same LQR machinery as in the previous post to come up with a controller. This gives us a new control matrix \(K\), which we use as a policy:

\[\pi(s) = K \begin{bmatrix}x – x_\text{ref} \\ v – v_\text{ref} \\ \text{angle_dist}(\theta, \theta_\text{ref}) \\ \omega – \omega_\text{ref} \\ 1\end{bmatrix}\]

For our running example, \(K = [2.080, 3.971, 44.59, 17.50, 2.589]\). Notice how the extra dimension formulation can give us a constant force component.

The policy above was derived with the \(R_s\) from the previous post, diagm([-5, -1, -5, -1]). As an aside, we can play around with it to get some other neat control strategies. For example, we can set the first element to zero if we want to ignore \(x\) and control about velocity. Below is a controller that tries to balance the pole while maintaining \(v = 1\):

That is how you might develop a controller for, say, an RC car that you want to stay stable around an input forward speed.

Conclusion

In this post we helped out Little Tippy further by coming up with a control strategy by which he could get to stability despite having actuator limits. We showed that it could (partially) be done by explicitly incorporating the limits into the linearized optimization problem, but that this was limited to the small local domain around the control point. For larger deviations we used a higher-level policy that selected between multiple LQR strategies to search for the most promising path to stability.

I want to stress that the forward search strategy used isn’t world-class by any stretch of the imagination. It does, however, exemplify the concept of tackling a high-resolution problem that we want to solve over an incredibly long time horizon (cart-pole swingup) with a hierarchical approach. We use LQR theory for what it does best – optimal control around a reference point – and build a coarser problem around it that uses said controllers to achieve our goals.

The code for this little project is available here.

Cart Pole Controllers

Here we have a sad little cart-pole robot:

The cart-pole robot is sad because it can’t balance its pole. The poor thing is struggling. Look at it, just floundering around.

Let’s help it out.

Linear Control Theory

Ideally we’d just tell the robot how to balance its pole, but unfortunately we don’t exactly know how to do that. Sure – once the pole is up you kind of want to shift back and forth to counteract its sway, but that only works once its mostly balanced. Getting it up there requires other behavior. Seems kind of complicated.

But fear not! We can break out control theory and then construct a controller.

Let’s see. Okay, we start with the nonlinear dynamics given by Open AI’s cart-pole problem. Our state is given by a position \(x\), a speed \(v\), an angle from vertical \(\theta\) (in radians) and an angular speed \(\omega\). The cart has mass \(m_c\), the pole has mass \(m_p\) and the pole has length \(\ell\). The robot can use its wheels to exert a lateral force \(F\). Our transition dynamics are:

\[\zeta = \frac{F + m_p \ell \sin(\theta) \omega^2}{m_p + m_c}\]

\[\ddot{\theta} = \frac{g\sin(\theta) – \zeta \cos(\theta)}{\ell\left( \frac{4}{3} – \frac{m_p \cos(\theta)^2}{m_p + m_c} \right)}\]

\[\ddot{x} = \zeta – \frac{m_p \ell \ddot{\theta} \cos(\theta)}{m_p + m_c}\]

Once we compute these values, we can thus use Euler integration to update our state by a small timestep \(\Delta t\):

\[ \begin{bmatrix} x’ \\ v’ \\ \theta’ \\ \omega’ \end{bmatrix} = \begin{bmatrix} x \\ v \\ \theta \\ \omega \end{bmatrix} + \Delta t \begin{bmatrix} v \\ \ddot{x} \\ \omega \\ \ddot{\theta} \end{bmatrix} \]

Linear control theory requires that our dynamics be linear. As such, we linearize our system about our target stable point (\(x = v = \theta = \omega = 0\)) by computing the Jacobian and forming a linear system:

\[\begin{bmatrix} \dot{x} \\ \dot{v} \\ \dot{\theta} \\ \dot{\omega} \end{bmatrix} \approx \begin{bmatrix} \frac{\partial}{\partial x} v & \frac{\partial}{\partial v} v & \frac{\partial}{\partial \theta} v & \frac{\partial}{\partial \omega} v \\ \frac{\partial}{\partial x} \ddot{x} & \frac{\partial}{\partial v} \ddot{x} & \frac{\partial}{\partial \theta} \ddot{x} & \frac{\partial}{\partial \omega} \ddot{x} \\ \frac{\partial}{\partial x} \omega & \frac{\partial}{\partial v} \omega & \frac{\partial}{\partial \theta} \omega & \frac{\partial}{\partial \omega} \omega \\ \frac{\partial}{\partial x} \ddot{\theta} & \frac{\partial}{\partial v} \ddot{\theta} & \frac{\partial}{\partial \theta} \ddot{\theta} & \frac{\partial}{\partial \omega} \ddot{\theta} \end{bmatrix} \begin{bmatrix} x \\ v \\ \theta \\ \omega \end{bmatrix} + \begin{bmatrix} \frac{\partial}{\partial F} v \\ \frac{\partial}{\partial F} \ddot{x} \\ \frac{\partial}{\partial F} \omega \\ \frac{\partial}{\partial F} \ddot{\theta} \end{bmatrix} F \]

I used Symbolics.jl to do the heavy lifting for me, which resulted in:

\[\begin{bmatrix} \dot{x} \\ \dot{v} \\ \dot{\theta} \\ \dot{\omega} \end{bmatrix} \approx \begin{bmatrix} 0 & 1 & 0 & 0 \\ 0 & 0 & \frac{-m_p g}{\frac{4}{3}M – m_p} & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & \frac{Mg}{\ell \left(\frac{4}{3}M – m_p\right)} & 0 \end{bmatrix} \begin{bmatrix} x \\ v \\ \theta \\ \omega \end{bmatrix} + \begin{bmatrix} 0 \\ \frac{\frac{4}{3}}{\frac{4}{3}M – m_p} \\ 0 \\ -\frac{1}{\ell\left(\frac{4}{3}M – m_p\right)}\end{bmatrix} F\]

where \(M = m_p + m_c\).

We can produce a linear controller \(F = K s\) for our linear system. In this case, \(K\) is simply a vector that determines the feedback gains for each of the state components.

I am using a basic cart pole environment, with \(\Delta t = 0.02, m_c = 1, m_p = 0.1, g = 9.8\). After some experimentation, \(K = [1.0, -0.5, 100.0, 1.0]\) worked pretty well. Here is how the little cart-pole guys is doing with that:

Much better! We’ve got the little guy balancing around his stable point. As long as he starts relatively close to stability, he’ll pop right back. And we did it with some math and tuning.

Automatic Tuning

Unfortunately, our gain matrix \(K\) sort of came out of nowhere. Ideally there would be a better way to derive one.

Fortunately for us, there is. We are going to derive the gain matrix using the optimal policy for linear quadratic regulator (LQR) problems. (These are covered in section 7.8 of Alg4DM.)

An LQR problem has linear dynamics:

\[s^{(k+1)} = T_s s^{(k)} + T_a a^{(k)} + w\]

where \(T_s\) and \(T_a\) are matrices, \(s\) is our state, \(a\) is our action, and \(w\) is zero-mean noise.

We already have the equation \(s^{(k+1)} = s^{(k)} + \dot{s}^{(k)} \Delta t\), and can get our linear dynamics that way:

\[\begin{bmatrix} x \\ v \\ \theta \\ \omega \end{bmatrix}^{(k+1)} = \begin{bmatrix} x \\ v \\ \theta \\ \omega \end{bmatrix}^{(k)} + \left(\begin{bmatrix} 0 & 1 & 0 & 0 \\ 0 & 0 & \frac{-m_p g}{\frac{4}{3}M – m_p} & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & \frac{Mg}{\ell \left(\frac{4}{3}M – m_p\right)} & 0 \end{bmatrix} \begin{bmatrix} x \\ v \\ \theta \\ \omega \end{bmatrix}^{(k)} + \begin{bmatrix} 0 \\ \frac{\frac{4}{3}}{\frac{4}{3}M – m_p} \\ 0 \\ -\frac{1}{\ell\left(\frac{4}{3}M – m_p\right)}\end{bmatrix} F^{(k)}\right) \Delta t + w\]

which simplifies to:

\[\begin{bmatrix} x \\ v \\ \theta \\ \omega \end{bmatrix}^{(k+1)} = \begin{bmatrix} 1 & \Delta t & 0 & 0 \\ 0 & 1 & \frac{-m_p g \Delta t}{\frac{4}{3}M – m_p} & 0 \\ 0 & 0 & 1 & \Delta t \\ 0 & 0 & \frac{M g \Delta t}{\ell \left(\frac{4}{3}M – m_p\right)} & 1 \end{bmatrix} \begin{bmatrix} x \\ v \\ \theta \\ \omega \end{bmatrix}^{(k)} + \begin{bmatrix} 0 \\ \frac{\frac{4}{3} \Delta t}{\frac{4}{3}M – m_p} \\ 0 \\ -\frac{\Delta t}{\ell\left(\frac{4}{3}M – m_p\right)}\end{bmatrix} F^{(k)} + w\]

An LQR problem additionally has quadratic rewards:

\[R(s,a) = s^\top R_s s + a^\top R_a a\]

which is accumulated across all timesteps.

As engineers, we get to choose the reward function that best captures quality in our problem, and then the LQR machinery will produce an optimal policy with respect to that reward function. Note that the LQR process only works if \(R_s\) and \(R_a\) overall produce costs, which amounts to \(R_s\) being negative semidefinite and \(R_a\) being negative definite.

Let’s simply penalize deviations from zero, with more weight given to the linear and angular positions:

\[R_s = \begin{bmatrix} -5 & 0 & 0 & 0 \\ 0 & -1 & 0 & 0 \\ 0 & 0 & -5 & 0 \\ 0 & 0 & 0 & -1 \end{bmatrix} \qquad R_s = \begin{bmatrix} -1 \end{bmatrix}\]

and the value of \(R_s\) determines how much to penalize large magnitudes of force.

If we plug these into an LQR solver, crank up the horizon, and pick the maximum-horizon control matrix, we get \(K = [2.06942, 3.84145, 43.32, 15.745]\). And if we load that up into our little robot, we get:

Wow! That’s pretty darn stable.

Let’s show it with a more adverse starting state:

Very cool.

Conclusion

To recap, we’ve just used LQR theory to derive our controller rather than just hand tune one. We got something procedural that works a lot better than what I got by hand. (I even got the sign on velocity wrong in my original gain matrix). Notice that we still had to specify something by hand, namely the rewards, but they were much easier to specify to still get something good.

This concludes this blog post, but it doesn’t conclude our cart-pole shenanigans. We’ve helped Tippy the Robot, but only with control theory. What if we don’t have continuous force output? What if we don’t have an unbounded output force? What if we want to be able to swing up from bad starting states? What if we don’t know the equations of motion? We’ll take a stab at it next time.

Convex Duality

Convex duality is the sort of thing where you see the symbolic derivation and it sort of makes sense, but then you forget the steps and a day or two later you just sort of have to trust yourself that the theorems and facts hold but forget where they really come from. I recently relearned convex duality from the great lectures of Stephen Boyd, and in so doing, also learned about a neat geometry interpretation that helps solidify things.

Working with Infinite Steps

Everything starts with an optimization problem in standard form:

\[\begin{matrix}\underset{\boldsymbol{x}}{\text{minimize}} & f(\boldsymbol{x}) \\ \text{subject to} & \boldsymbol{g}(\boldsymbol{x}) \leq \boldsymbol{0} \\ & \boldsymbol{h}(\boldsymbol{x}) = \boldsymbol{0} \end{matrix}\]

This problem has an objective \(f(\boldsymbol{x})\), some number of inequality constraints \(\boldsymbol{g}(\boldsymbol{x}) \leq \boldsymbol{0}\), and some number of equality constraints \(\boldsymbol{h}(\boldsymbol{x}) = \boldsymbol{0}\).

Solving an unconstrained optimization problem often involves sampling the objective and using its gradient, or several local values, to know which direction to head in to find a better design \(\boldsymbol{x}\). Solving a constrained optimization problem isn’t as straightforward – we cannot accept a design that violates any constraints.

We can, however, view our constrained optimization problem as an unconstrained optimization problem. We simply incorporate our constraints into our objective, setting them to infinity when the constraints are violated and setting them to zero when the constraints are met:

\[\underset{\boldsymbol{x}}{\text{minimize}} \enspace f(\boldsymbol{x}) + \sum_i \infty \cdot \left( g_i(\boldsymbol{x}) > 0\right) + \sum_j \infty \cdot \left( h_j(\boldsymbol{x}) \neq 0\right)\]

For example, if our problem is

\[\begin{matrix}\underset{x}{\text{minimize}} & x^2 \\ \text{subject to} & x \geq 1 \end{matrix}\]

which looks like this

then we can rewrite it as

\[\underset{x}{\text{minimize}} \enspace x^2+\infty \cdot \left( x < 1\right)\]

which looks like

We’ve simply raised the value of infeasible points to infinity. Problem solved?

By doing this, we made our optimization problem much harder to solve with local methods. The gradient of designs in the infeasible region is zero, so we cannot use those to know which way to progress toward feasibility. It is also discontinuous. Furthermore, infinity is rather difficult to work with – it just produces more infinities when added or multiplied to most numbers. This new optimization problem is perhaps simpler, but we’ve lost a lot of information.

What we can do instead is reinterpret our objective slightly. What if, instead of using this infinite step, we simply multiplied our inequality function \(g(x) = 1 – x\) by a scalar \(\lambda\)?

\[\underset{x}{\text{minimize}} \enspace x^2+ \lambda \left( 1 – x \right)\]

If we have a feasible \(x\), then \(\lambda = 0\) gives us the original objective function. If we have an infeasible \(x\), then \(\lambda = \infty\) gives us the penalized objective function.

We’re taking the objective and constraint, which are separate functions:

and combining them into a single objective based on \(\lambda\):

Let’s call our modified objective function the Lagrangian:

\[\mathcal{L}(x, \lambda) = x^2+ \lambda \left( 1 – x \right)\]

If we fix \(\lambda\), we can minimize our objective over \(x\). I am going to call these dual values:

We find that all of these dual values are below the optimal value of the original problem:

This in itself is a big deal. We can produce a lower bound on the optimal value simply by minimizing the Lagrangian. Why is this?

Well, it turns out that we are guaranteed to get a lower bound any time that \(\lambda \geq 0\). This is because a positive \(\lambda\) ensures that our penalized objective is at or below the original objective for all feasible points. Minimizing is thus guaranteed to give us a design at or below the original value.

This is such a big deal, that we’ll call this minimization problem the Lagrange dual function, and give it a shorthand:

\[\mathcal{D}(\lambda) = \underset{x}{\text{minimize}} \enspace \mathcal{L}(x, \lambda)\]

which for our problem is

\[\mathcal{D}(\lambda) = \underset{x}{\text{minimize}} \enspace x^2 + \lambda \left( 1 – x \right)\]

Okay, so we have a way to get a lower bound. How about we get the *best* lower bound? That gives rise to the Lagrange dual problem:

\[\underset{\lambda \geq 0}{\text{maximize}} \enspace \mathcal{D}(\lambda)\]

For our example problem, the best lower bound is given by \(\lambda = 2\), which produces \(x = 1\), \(f(x) = 1\). This is the solution to the original problem.

Having access to a lower bound can be incredibly useful. It isn’t guaranteed to be useful (the lower bound can be \(-\infty\)), but it can also be relatively close to optimal. For convex problems, the bound is tight – the solution to the Lagrange dual problem has the same value as the solution to the original (primal) problem.

A Geometric Interpretation

We’ve been working with a lot of symbols. We have some plots, but I’d love to build further intuition with a geometric interpretation. This interpretation also comes from Boyd’s lectures, and textbook.

Let’s work with a slightly beefier optimization problem with a single inequality constraint:

\[\begin{matrix}\underset{\boldsymbol{x}}{\text{minimize}} & \|\boldsymbol{A} \boldsymbol{x} – \boldsymbol{b} \|_2^2 \\ \text{subject to} & x_1 \geq c \end{matrix}\]

In this case, I’ll use

\[\boldsymbol{A} = \begin{bmatrix} 1 & 0.3 \\ 0.3 & 2\end{bmatrix} \qquad \boldsymbol{b} = \begin{bmatrix} 0.5 \\ 1.5\end{bmatrix} \qquad c = 1.5\]

Let’s plot the objective, show the feasible set in blue, and show the optimal solution:

Our Lagrangian is again comprised of two terms – the objective function \( \|\boldsymbol{A} \boldsymbol{x} – \boldsymbol{b} \|_2^2\) and the constraint \(\lambda (c – x_1)\). We can build a neat geometric interpretation of convex duality by plotting (constraint, objective) pairs for various values of \(\boldsymbol{x}\):

We get a parabolic region. This isn’t a huge surprise, given that we have a quadratic objective function.

Note that the horizontal axis is the constraint value. We only have feasibility when \(g(\boldsymbol{x}) \leq 0\), so our feasible designs are those left of the vertical line.

I’ve already plotted the optimal value – it is at the bottom of the feasible region. (Lower objective function values are better, and the bottom has the lowest value).

Let’s see if we can include our Lagrangian. The Lagrangian is:

\[\mathcal{L}(\boldsymbol{x}, \lambda) = f(\boldsymbol{x}) + \lambda g(\boldsymbol{x})\]

If we fix \(\boldsymbol{x}\), it is linear!

The Lagrange dual function is

\[\mathcal{D}(\lambda) = \underset{\boldsymbol{x}}{\text{minimize}} \enspace f(\boldsymbol{x}) + \lambda g(\boldsymbol{x})\]

If we evaluate \(\mathcal{D}\) for a particular \(\lambda\), we’ll minimize and get a design \(\hat{\boldsymbol{x}}\). The value of \(\mathcal{D}(\lambda)\) will be \(f(\hat{\boldsymbol{x}}) + \lambda g(\hat{\boldsymbol{x}})\). Remember, for \(\lambda \geq 0\) this is a lower bound. Said lower bound is the y-intercept of the line with slope \(-\lambda\) that passed through our (constraint, objective) at \(\hat{\boldsymbol{x}}\).

For example, with \(\lambda = 0.75\), we get:

We got a lower bound, and it is indeed lower than the true optimum. Furthermore, that line is a supporting hyperplane of our feasible region.

Let’s try it again with \(\lambda = 3\):

Our \(\lambda\) is still non-negative, so we get a valid lower bound. It is below our optimal value.

Now, in solving the Lagrange dual problem we get the best lower bound. This best lower bound just so happens to correspond to the \(\lambda\) that produces a supporting hyperplane at our optimum (\(\lambda \approx 2.15\)):

Our line is tangent at our optimum. This ties in nicely with the KKT conditions (out of scope for this blog post), which require that the gradient of the Lagrangian be zero at the optimum (for a problem with a differentiable objective and constraints), and thus the gradients of the objective and the constraints (weighted by \(\lambda\)) balance out.

Notice that in this case, the inequality constraint was active, with our optimum at the constraint boundary with \(g(\boldsymbol{x}) = 0\). We needed a positive \(\lambda\) to balance against the objective function gradient.

If we change our problem such that the unconstrained optimum is feasible, then we do not need a positive dual value to balance against the objective function gradient. Let’s adjust our constraint to max our solution feasible:

We now get an objective / constraint region with a minimum inside the feasible area:

Our best lower bound is obtained with \(\lambda = 0\):

This demonstrates complementary slackness, the idea that \(\lambda_i g_i(\boldsymbol{x})\) is always zero. If constraint \(i\) is active, \(g_i(\boldsymbol{x}) = 0\);\ and if constraint \(i\) is inactive, \(\lambda_i = 0\).

So with this geometric interpretation, we can think of choosing \(\lambda\) as picking the slope of a line that supports this multidimensional volume from below and to the left. We want the line with the highest y-intercept.

Multiple Constraints (and Equality Constraints)

I started this post off with a problem with multiple inequality and equality constraints. In order to be thorough, I just wanted to mention that the Lagrangian and related concepts extend the way you’d expect:

\[\mathcal{L}(\boldsymbol{x}, \boldsymbol{\lambda}, \boldsymbol{\nu}) = f(\boldsymbol{x}) + \sum_i \lambda_i g_i(\boldsymbol{x}) + \sum_j \nu_j h_j(\boldsymbol{x})\]

and

\[\mathcal{D}(\boldsymbol{\lambda}, \boldsymbol{\nu}) = \underset{\boldsymbol{x}}{\text{maximize}} \enspace \mathcal{L}(\boldsymbol{x}, \boldsymbol{\lambda}, \boldsymbol{\nu})\]

We require that all \(\lambda\) values be nonnegative, but do not need to restrict the \(\nu\) values for the equality constraints. As such, our Lagrangian dual problem is:

\[\underset{\boldsymbol{\lambda} \geq \boldsymbol{0}, \>\boldsymbol{\nu}}{\text{maximize}} \enspace \mathcal{D}(\boldsymbol{\lambda}, \boldsymbol{\nu})\]

The geometric interpretation still holds – it just has more dimensions. We’re just finding supporting hyperplanes.

Embedded Contours with Marching Squares

I recently started working on new chapters for Algorithms for Optimization, starting with a new chapter on quadratic programs. We already have a chapter on linear programs, but quadratic programs are sufficiently different and are ubiquitous enough that giving them their own chapter felt warranted.

A simple example quadratic program.

The chapter starts with a general formulation for quadratic programs and transforms the problem a few times into simpler and simpler formulations, until we get a nonnegative least squares problem. This can be solved, and then we can use that solution to back our way up the transform stack back to a solution of our original solution.

In texts, these problem transforms are typically solely the realm of mathematical manipulations. While math is of course necessary, I also tried hard to depict what these various problem representations looked like to help build an intuitive understanding of what some of the transformations were. In particular, there was one figure where I wanted to show how the equality constraints of a general-form quadratic program could be removed when formulating a lower-dimensional least squares program:

A 3d general-form quadratic program with an equality constraint on the left, and a 2d least squares program on the right.

This post is about this figure, and more specifically, about the blue contour lines on the left side.

What is this Figure?

The figure on left shows the following quadratic program:

general-form quadratic program

As we can see, the quadratic program is constrained to lie within a square planar region \(0 \leq x \leq 1\), \(0 \leq x_3 \leq 1\), and \(x_2 = x_3\). The feasible set is shown in blue.

The right figure on the right shows the least-squares program that we get when we factor out the equality constraint. No need to get into the details here, it isn’t really important for this post, but we can use our one equality equation to remove one variable in our original problem, producing a 2d problem without any equality constraints:

least squares program

The original feasible set is rotated and stretched, and the objective function contours are rotated and stretched, but other than that we have a similar topology and a pretty standard problem that we can solve.

How do we Make the Figures?

I make the plots with PGFPlots.jl, which uses the pgfplots LaTeX package under the hood. It is a nice synergy of Julia coding and LaTeX within the LaTeX textbook.

Right Plot

The right plot is pretty straight forward. It has a contour plot for the objective function:

Plots.Contour(x -> norm(A_lsi*x - b_lsi, 2),
              xdomain, ydomain, xbins=600, ybins=600)

some over + under line plots and a fill-between for the feasible region:

x1_domain_feasible = (0.0, (0.5+√2)/2)
push!(plots,
      Plots.Linear(x1->max(-x1, x1 - sqrt(2)), x1_domain_feasible,
                   style="name path=A, draw=none, mark=none"))
push!(plots,
      Plots.Linear(x1->min(x1, 1/2-x1), x1_domain_feasible,
                   style="name path=B, draw=none, mark=none"))
push!(plots,
      Plots.Command("\\addplot[pastelBlue!50] fill between[of=A and B]"))

and a single-point scatter plot plus a label for the solution:

push!(plots,
      Plots.Scatter([x_val[1]], [x_val[2]],
      style="mark=*, mark size=1.25, mark options={solid, draw=black, fill=black}"))
push!(plots,
      Plots.Command("\\node at (axis cs:$(x_val[1]),$(x_val[2])) [anchor=west] {\\small \\textcolor{black}{\$y_2^\\star\$}}"))

These plots get added to an Axis and we’ve got our plot.

Left Plot

The left plot is in 3D. If we do away with the blue contour lines, its actually even simpler than the right plot. We simply have a square patch, a single-point scatter plot, and our label:

Axis([
   Plots.Command("\\addplot3[patch,patch type=rectangle, faceted color=none, color=pastelBlue!40] coordinates {(0,0,0) (1,0,0) (1,1,1) (0,1,1)}"),
   Plots.Linear3([x_val[1]], [x_val[2]], [x_val[3]], mark="*", style="mark size=1.1, only marks, mark options={draw=black, fill=black}"),
   Plots.Command("\\node at (axis cs:$(x_val[1]),$(x_val[2]),$(x_val[3])) [anchor=west] {\\small \\textcolor{black}{\$x^\\star\$}}"),
  ], 
  xlabel=L"x_1", ylabel=L"x_2", zlabel=L"x_3", width="1.2*$width", 
  style="view={45}{20}, axis equal",
  legendPos="outer north east"
)

The question is – how do we create that contour plot, and how do we get it to lie on that blue patch?

How do you Draw Contours?

Contour plots don’t seem difficult to plot, until you have to sit down and plot them. They are just lines of constant elevation – in this cases places where \(\|Ax – b\| = c\) for various constants \(c\). But, unlike line plots, where you can just increase \(x\) left to right and plot successive \((x, f(x))\) pairs, for contour plots you don’t know where your line is going to be ahead of time. And that’s assuming you only have one line per contour value – you might have multiple:

A contour plot with multiple contour lines per level.

So, uh, how do we do this?

The first thing I tried was trying to write my function out the traditional way – as \((x_1, f(x_1))\) pairs. Unfortunately, this doesn’t work out, because the contours aren’t 1:1.

The second thing I briefly considered was a search method. If I fix my starting point, say, \(x_1 = 0, x_2 = 0.4\), I can get the contour level \(c = \|Ax – b\| \). Then, I could search toward the right in an arc to find where the next contour line was, maybe bisect over an arc one step distance away and proceed that way:

An iterative search procedure for following a contour line, proceeding left to right.

I think this approach would have worked for this figure, since there were never multiple contour lines for the same contour value, but it would have been more finicky, especially around the boundaries, and it would not have generalized to other figures.

Marching Squares

I happened upon a talk by Casey Muratori on YouTube, “Papers I have Loved“, a few weeks ago. One of the papers Casey talked about was Marching Cubes, A High Resolution 3D Surface Construction Algorithm. This 1987 paper by W. Lorenson and H. Cline presented a very simply way to generate surfaces (in 3D or 2D) in places of constant value: \(f(x) = c\).

Casey talked about it in the context of rendering metaballs, but when I found myself presented with the contour problem, it was clear that marching cubes (or squares, rather, since its 2d) was relevant here too.

The algorithm is a simple idea elegantly applied.

The contour plot we want to make is on a 2d surface in a 3d plot, but let’s just consider the 2d surface for now. Below we show that feasible square along with our objective function, \(f(x) = \|Ax – b\|\):

An image plot showing our object function. We want to find contour lines.

Our objective is to produce contour lines, the same sort as are produced when we make a PGFPlots contour plot:

We already know that there might be multiple connected contour lines for a given level, so a local search method won’t be sufficient. Instead, marching squares uses a global approach combined with local refinement.

We begin by evaluating a grid of points over our plot:

We’ll start with a small grid, but for higher-quality images we’d make it finer.

Let’s focus on the contour line \(f(x) = 3\). We now determine, for each vertex in our mesh, whether its value is above or below our contour target of 3. (I’ll plot the contour line too for good measure):

Blue points are where the objective function <= 3, and yellow points are where it is > 3.

Marching squares will now tile the image based on each 4-vertex square, and the binary values we just evaluated. Any squares with all vertices equal can simply be ignored – the contour line doesn’t pass through them. This leaves the following tiles:

The white tiles are those that contain our contour line(s).

Now, each remaining tile comes in multiple possible “flavors” – the various ways that its vertices can be above or below the objective value. There are \(2^4\) possible tiles, two of which we’ve already discarded, leaving 14 options. The Wikipedia article has a great set of images showing all of the tile combinations, but here we will concern ourselves with two more kinds: tiles with one vertex below or one vertex above, and tiles with two neighboring vertices above and two neighboring vertices below:

The white squares have one vertex above or one vertex below. The gray squares have two neighboring vertices above and two neighboring vertices below.

Let’s start with the first type. These squares have two edges where the function value changes. We know, since we’re assuming our objective function is continuous, that our contour line must pass through those edges. As such, if we run a bisection search along each of those edges, we can quickly identify those crossing points:

The red points are where bisection search has found a crossing point.

We then get our contour plot, for this particular contour value, by drawing all of the line segments between these crossing points:

We can get higher resolution contour lines by using a finer grid:

The same contour line with double the resolution.

To get our full contour plot, we simply run marching squares on multiple contour levels. To get the original plot, I run this algorithm to get all of the little tile line segments, then I render them as separate Linear3 line segments on top of the 3d patch. Pretty cool!

The created figure

What does the Code Look Like?

The code for marching squares is pretty elegant.

We start off with defining our grid:

nbins = 41
x1s = collect(range(0.0, stop=1.0, length=nbins))
x2s = collect(range(0.0, stop=1.0, length=nbins))
is_above = falses(length(x1s), length(x2s))

Next we define a bisection method for quickly interpolating between two vertices:

function bisect(
    u_target::Float64, x1_lo::Float64, x1_hi::Float64, 
    x2_lo::Float64, x2_hi::Float64, k_max::Int=8)
    f_lo = norm(A*[x1_lo, x2_lo, x2_lo]-b)
    f_hi = norm(A*[x1_hi, x2_hi, x2_hi]-b)
 
    if f_lo > f_hi # need to swap low and high
        x1_lo, x1_hi, x2_lo, x2_hi = x1_hi, x1_lo, x2_hi, x2_lo
    end
 
    x1_mid = (x1_hi + x1_lo)/2
    x2_mid = (x2_hi + x2_lo)/2
    for k in 1 : k_max
        f_mid = norm(A*[x1_mid, x2_mid, x2_mid] - b)
        if f_mid > u_target
            x1_hi, x2_hi, f_hi = x1_mid, x2_mid, f_mid
        else
            x1_lo, x2_lo, f_lo = x1_mid, x2_mid, f_mid
        end
        x1_mid = (x1_hi + x1_lo)/2
        x2_mid = (x2_hi + x2_lo)/2
    end
 
    return (x1_mid, x2_mid)
end

In this case I’ve baked in the objective function, but you could of course generalize this and pass one in.

We then iterate over our contour levels. For each contour level (u_target), we evaluate the grid and then evaluate all of our square patches and add little segment plots as necessary:

for (i,x1) in enumerate(x1s)
    for (j,x2) in enumerate(x2s)
        is_above[i,j] = norm(A*[x1, x2, x2] - b) > u_target
    end
end
for i in 2:length(x1s)
    for j in 2:length(x2s)
        # bottom-right, bottom-left, top-right, top-left
        bitmask = 0x00
        bitmask |= (is_above[i,j-1] << 3)
        bitmask |= (is_above[i-1,j-1] << 2)
        bitmask |= (is_above[i,j] << 1)
        bitmask |= (is_above[i-1,j])
        plot = true
        x1_lo1 = x1_hi1 = x2_lo1 = x2_hi1 = 0.0
        x1_lo2 = x1_hi2 = x2_lo2 = x2_hi2 = 0.0
        if bitmask == 0b1100 || bitmask == 0b0011
            # horizontal cut
            x1_lo1, x1_hi1, x2_lo1, x2_hi1 = x1s[i-1], x1s[i-1], x2s[j-1], x2s[j]
            x1_lo2, x1_hi2, x2_lo2, x2_hi2 = x1s[i], x1s[i], x2s[j-1], x2s[j]
        elseif bitmask == 0b0101 || bitmask == 0b1010
            # vertical cut
            x1_lo1, x1_hi1, x2_lo1, x2_hi1 = x1s[i-1], x1s[i], x2s[j-1], x2s[j-1]
            x1_lo2, x1_hi2, x2_lo2, x2_hi2 = x1s[i-1], x1s[i], x2s[j], x2s[j]
        elseif bitmask == 0b1000 || bitmask == 0b0111
            # cut across bottom-right
            x1_lo1, x1_hi1, x2_lo1, x2_hi1 = x1s[i-1], x1s[i], x2s[j-1], x2s[j-1]
            x1_lo2, x1_hi2, x2_lo2, x2_hi2 = x1s[i], x1s[i], x2s[j-1], x2s[j]
        elseif bitmask == 0b0100 || bitmask == 0b1011
            # cut across bottom-left
            x1_lo1, x1_hi1, x2_lo1, x2_hi1 = x1s[i-1], x1s[i], x2s[j-1], x2s[j-1]
            x1_lo2, x1_hi2, x2_lo2, x2_hi2 = x1s[i-1], x1s[i-1], x2s[j-1], x2s[j]
        elseif bitmask == 0b0010 || bitmask == 0b1101
            # cut across top-right
            x1_lo1, x1_hi1, x2_lo1, x2_hi1 = x1s[i-1], x1s[i], x2s[j], x2s[j]
            x1_lo2, x1_hi2, x2_lo2, x2_hi2 = x1s[i], x1s[i], x2s[j-1], x2s[j]
        elseif bitmask == 0b0001 || bitmask == 0b1110
            # cut across top-left
            x1_lo1, x1_hi1, x2_lo1, x2_hi1 = x1s[i-1], x1s[i], x2s[j], x2s[j]
            x1_lo2, x1_hi2, x2_lo2, x2_hi2 = x1s[i-1], x1s[i-1], x2s[j-1], x2s[j]
        else
            plot = false
        end
        if plot
            x1a, x2a = bisect(u_target, x1_lo1, x1_hi1, x2_lo1, x2_hi1)
            x1b, x2b = bisect(u_target, x1_lo2, x1_hi2, x2_lo2, x2_hi2)
            push!(contour_segments, 
                Plots.Linear3([x1a, x1b], [x2a, x2b], [x2a, x2b],
                               style="solid, pastelBlue, mark=none")
            )
        end
    end
end

In this code, all we ever do is figure out which two line segments to bisect over, and if we get then, run the bisection and plot the ensuing line. Its actually quite neat and compact.

Conclusion

So there you have it – a home-rolled implementation for contour plotting turns out to be useful for embedding 2d contours in a 3d plot.

Note that the implementation here wasn’t truly general. In this case, we didn’t handle some tile types, namely those where opposing corners have the same aboveness, but neighboring corners have differing aboveness. In those cases, we need to bisect across all four segments, and we typically need to remove ambiguity by sampling in the tile’s center to determine which way the segments cross.

I find it really satisfying to get to apply little bits of knowledge like this from time to time. Its part of computer programming that is endlessly rewarding. I have often been amazed by how a technique that I learned about in one context surprised me by being relevant in something else later on.

Delaunay Triangulations

I recently read a fantastic blog post about Delaunay triangulations and quad-edge data structures, and was inspired to implement the algorithm myself. The post did a very good job of breaking the problem and the algorithm down, and had some really nice interactive elements that made everything make sense. I had tried to grok the quad-edge data structure before, but this was the first time where it really made intuitive sense.

This post covers more or less the same content, in my own words and with my own visualizations. I highly recommend reading the original post.

What is a Delaunay Triangulation

A triangulation is any breakdown of a set of vertices into a set of non-overlapping triangles. Some triangulations are better than others, and the Delaunay triangulation is a triangulation that makes each triangle as equilateral as possible. As a result, it is the triangulation that keeps all of the points in a given triangle as close to each other as possible, which ends up making it quite useful for a lot of spacial operations, such as finite element meshes.

As we can see above, there are a lot of bad triangulations with a lot of skinny triangles. Delaunay triangulations avoid that as much as possible.

One of the neatest applications of these triangulations is in motion planning. In constrained Delaunay triangulation, you for all walls in your map to be edges in your triangulation, and otherwise construct a normal Delaunay triangulation. Then, your robot can pass through any edge of sufficient length (i.e., if its a wide enough gap), and you can quickly find shortest paths through complicated geometry while avoiding regions you shouldn’t enter:

How to Construct a Delaunay Triangulation

So we want a triangulation that is as equilateral as possible. How, technically, is this achieved?

More precisely, a triangulation is Delaunay if no vertex is inside the circumcircle of any of the triangles. (A circumcircle of a triangle is the circle that passes through its three vertices).

The circumcircles for every triangle in a Delaunay triangulation. No circle contains a vertex.

One way to construct a Delaunay triangulation for a set of points is construct one iteratively. The Guibas & Stolfi incremental triangulation algorithm starts with a valid triangulation and repeatedly adds a new point, joining it into the triangulation and then changing edges to ensure that the mesh continues to be a Delaunay triangulation:

Iterative construction of a Delaunay triangulation

Here is the same sequence, with some details skipped and more points added:

More iterations of the Delaunay triangulation algorithm

This algorithm starts with a bounding triangle that is large enough to contain all future points. The triangle is standalone, and so by definition is Delaunay:

We start with a bounding triangle

Every iteration of the Guibas & Stolfi algorithm inserts a new point into our mesh. We identify the triangle that new point is inside, and connect edges to the new point:

When adding a new point, first join it to its enclosing triangle

Note that in the special case where our new vertex is on an existing edge (or sufficiently close to an existing edge) we remove that edge and introduce four edges instead:

By introducing these new edges we have created three (or four) new triangles. These new triangles may no longer fulfill the Delaunay requirement of having vertices that are within some other triangle’s circumcircle.

Guibas & Stolfi noticed that the only possible violations occur between the new point and those opposite its surrounding face. That is, we need only walk around the edges of the triangle (or quadrilateral) that we inserted it into, and check whether we should flip those border edges in order to preserve the Delaunay condition.

For example, for the new vertex above, we only need to check the four surrounding edges:

Here is an example where we need to flip an edge:

Interestingly, flipping an edge simply adds a new spoke to our new vertex. As such, we continue to only need to check the perimeter edges around the hub. This guarantees that we won’t propagate out and have to spend a lot of time checking edges, but instead just iterate around our one new point.

That’s one iteration of the algorithm. After that you just start the next iteration. Once you’re done adding points you remove the bounding triangle and you have your Delaunay triangulation.

The Data Structure

So the concept is pretty simple, but we all know that the devil is in the details. What really excited me about the Delaunay triangulation was the data structure used to construct it.

If we look at the algorithm, we can see what we might need our data structure to support:

  • Find the triangle enclosing a new point
  • Add new edges
  • Find vertices across from edges
  • Walk around the perimeter of a face

Okay, so we’ve got two concepts here – the ability to think about edges in our mesh, but also the ability to reason about faces in our mesh. A face, in this case, is usually a triangle, but it could also sometimes be a quadrilateral, or something larger.

The naive approach would be to represent our triangulation (i.e. mesh) as a a graph. That is, a list of vertices and a way to represent edge adjacency. Unfortunately, this causes problems. The edges would need to be ordered somehow to allow us to navigate the mesh.

The data structure of choice is called a quad-edge tree. It gives us the ability to quickly navigate both clockwise or counter-clockwise around vertices, or clockwise or counter-clockwise around faces. At the same time, quad-edge trees make it easy for new geometry to be stitched in or existing geometry to be removed.

Many of the drawings and explanations surrounding quad-edge trees are a bit complicated and hard to follow. I’ll do my best to make this explanation as tractable as possible.

A quad-edge tree for 2D data represents a vertex mesh on a sphere:

A 2D mesh consisting of three vertices and three edges, on a 2D closed surface.

Said sphere is mostly irrelevant for what we are doing. In fact, we can think of it as very large. However, there is one important consequence of being on a sphere. Now, every edge between two vertices is between two faces. For example, the simple 2D mesh above has two faces:

Our 2D mesh has two faces – an outer face and an inner face. Each edge separates these two faces.

The quad-edge data structure can leverage this by associating each edge both with its two vertices and with its two faces, hence the name. Each real edge is represented as four directional edges in the quad-edge data structure: two directional edges between vertices and two directional edges between faces.

The quad-edge data structure represents each edge with four directional edges, two between vertices and two between faces.

We can thus think of each face as being its own kind of special vertex:

The purple vertices and arrows here are associated with the faces. We have one vertex for the inner face and one for the outer face.

As such, the quad-edge data structure actually contains two graphs – the directed graph of mesh edges and the dual graph that connects mesh faces. Plus, both of these graphs have inter-connectivity information in that each directed edge knows its left and right neighbors.

Okay, so pretty abstract. But, this gives us what we want. We can easily traverse the mesh by either flipping between the directed edges for a given mesh edge, or rotating around the base vertices for a given directed edge:

So now we are thinking about a special directed graph where each directed edge knows its left and right-hand neighbors and the other three directed edges that make up the edge in the original mesh:

struct QuarterEdge
    data::VertexIndex # null if a dual quarter-edge
    next_in_same_edge_on_right::QuarterEdgeIndex
    next_in_same_edge_on_left::QuarterEdgeIndex
    next_with_same_base_on_right::QuarterEdgeIndex
    next_with_same_base_on_left::QuarterEdgeIndex
end
struct Mesh
    vertices::Vector{Point}
    edges::Vector{QuarterEdge}
end

We have to update these rotational neighbors whenever we insert new edges into the mesh. It turns out we can save a lot of headache by dropping our left-handed neighbors:

struct QuarterEdge
    data::VertexIndex # null if a dual quarter-edge
    rot::QuarterEdgeIndex # next in same edge on right
    next::QuarterEdgeIndex # next with same base on right
end

We can get the quarter-edge one rotation to our left within the same edge simply by rotating right three times. Similarly, we can get the quater-edge one rotation to our left with the same base vertex by following rot – next – rot:

Rotating left around a vertex by following rot – next – rot.

Now we have our mesh data structure. When we add or remove edges, we have to be careful (1) to always construct new edges using four quarter edges whose rot entries point to one another, and (2) to update the necessary next entries both in the new edge and in the surrounding mesh. There are some mental hurdles here, but it isn’t too bad.

Finding the Enclosing Triangle

The first step of every Delaunay triangulation iteration involves finding the triangle enclosing the new vertex. We can identify a triangle using a quarter edge q whose base is associated with that face. If we do that, its three vertices, in counter-clockwise order, are:

  • A = q.rot.data
  • B = q.next.rot.data
  • C = q.next.next.rot.data

(Note that in code this looks a little different, because in Julia we would be indexing into the mesh arrays rather than following pointers like you would in C, but you get the idea).

Suppose we start with some random face and extract its vertices A, B, and C. We can tell if our new point P is inside ABC by checking whether P is to the left of the ray AB, to the left of the ray BC, and to the left of the ray CA:

Checking whether P is left of a ray AB is the same as asking whether the triangle ABP has a counter-clockwise ordering. Conveniently, the winding of ABP is given by the determinant of a matrix:

\[\text{det}\left(\begin{bmatrix} A_x & A_y & 1 \\ B_x & B_y & 1 \\ P_x & P_y & 1 \end{bmatrix} \right)\]

If this value is positive, then we have right-handed (counter-clockwise) winding. If it is negative, then we have left-handed (clockwise) winding. If it is zero, then our points are colinear.

If our current face has positive windings then it encloses our point. If any winding is negative, the we know we can cross over that edge to get closer to our point.

Here triangle ABP and BCP have positive windings, but triangle CAP has a negative winding. As such, we need to cross the blue edge to get closer to P.

We simply iterate in this manner until we find our enclosing face. (I find this delightfully clean.)

Is a Point inside a Circumcircle

Checking whether vertices are inside circumcircles of triangles is central to the Delaunay triangulation process. The inspiring blog post has great coverage of why the following technique works, but if we want to check whether a point P is inside the circumcircle that passes through the (counter-clockwise-oriented) points of triangle ABC, then we need only compute:

\[\text{det}\left(\begin{bmatrix} A_x & A_y & A_x^2 + A_y^2 & 1 \\ B_x & B_y & B_x^2 + B_y^2 & 1 \\ C_x & C_y & C_x^2 + C_y^2 & 1 \\ P_x & P_y & P_x^2 + P_y^2 & 1 \end{bmatrix}\right)\]

If this value is positive, then P is inside the circumcircle. If it is negative, then P is external. If it is zero, then P is on its perimeter.

Some Last Implementation Details

I think that more or less captures the basic concepts of the Delaunay triangulation. There are a few more minor details to pay attention to.

First off, it is possible that our new point is exactly on an existing edge, rather than enclosed inside a triangle. When this happens, or when the new point is sufficiently close to an edge, it is often best to remove that edge and add four new edges. I illustrate this earlier in the post. That being said, make sure to never remove / split an original bounding edge. It is, after all, a bounding triangle.

Second, after we introduce the new edges we have to consider flipping edges around the perimeter of the enclosing face. To do this, we simply rotate around until we arrive back where we started. The special cases here are bounding edges, which we never flip, and edges that connect to our bounding vertices, which we don’t flip if it would produce new triangles with clockwise windings:

This newly added point (yellow) has an edge (also yellow) that cannot be flipped.

Once you’re done adding vertices, you remove the bounding triangle and any edges connected to it.

And all that is left as far as implementation details are concerned is the implementation itself! I put this together based on that other blog post as a quick side project, so it isn’t particularly polished. (Your mileage may vary). That being said, its well-visualized and I generally find reference implementations to be nice to have.

You can find the code here.

Conclusion

I thought the quad-edge data structure was really interesting, and I bet with some clever coding they could be really efficient as well. I really liked how solving this problem required geometry and visualization on top of typical math and logic skills.

Delaunay triangulations have a lot of neat applications. A future step might be to figure out constrained Delaunay triangulations for path-finding, or to leverage the dual graph to generate Voronoi diagrams.

How We Wrote Another Textbook

Hard copies of Algorithms for Decision Making (available in PDF form here for free) are shipping on August 16th! This book, the second one I have been fortunate to help write, was a large undertaking that spanned 4.5 years.

Back when Algorithms for Optimization came out, I wrote a blog post that detailed aspects of the special nature of our development setup. This blog post, 4 years later, is a look at how those aspects have changed for the second book.

But first, what is Algorithms for Decision Making, and how is it different than Algorithms for Optimization?

What Alg4DM is About

From the book website. I made the nifty cover art based on a POMDP solver evaluation tree.

Algorithms for Optimization covers the sort of decision making theory that one thinks of when considering modern artificial intelligence. Think Go AIs, aircraft collision avoidance, and using drones for wildfire suppression.

The book’s central focus is on the Markov Decision Process, or MDP. The MDP framework and its variants are simply general ways to frame decision making problems. For a problem to be represented as an MDP, you have to specify the problem’s states, its actions, how actions take you to successor states, and what rewards (or costs) you incur when doing so. Simple. The game 2048 can be formulated as an MDP, but so too can the problem of balancing investment portfolios.

Without going into too much detail (there’s an entire book, after all), the core insight of the MDP framework is that problems can be solved with dynamic programming according to the Bellman equation. All sorts of techniques stem from this, from optimal solvers for small discrete problems to fancy reinforcement learning techniques for large deep neural network policies.

The book has the following progression:

  • Starts with the basics of probability and utility, and introduces Bayesian networks and how to train them. (This is because BNs were used by Mykel for ACAS-X, and are a great way to represent transition models).
  • MDPs, including all sorts of fancy things including policy gradient optimization and actor-critic methods.
  • Model uncertainty, where we do not know the transition and reward functions ahead of time. There is some great content on imitation learning here too.
  • POMDPs, which extend MDPs to domains where we don’t observe the full state.
  • Multiagent systems, which involve problems with multiple decision-making agents. Think a bunch of robots jointly solving a problem together.

How Alg4DM is Different

Alg4DM took about 4.5 years between first commits to hard-copy. There were 3,417 commits to our git repository along with 826 filed issues. The printed book is 700 pages, with 225 color figures. In comparison, Alg4Opt is 500 pages, which makes the new book 40% longer.

While sheer size is a significant factor, there were other contributing complexities that made this book more challenging to write. I am proud to say that we still define the entire book in a LaTeX document and do all of figure generation and typesetting ourselves. We continued to use the Tufte LaTeX template. We typeset the book ourselves with LaTeX (as opposed to the publisher doing the typesetting), control our own lexing and styles, and have a suite of tests. We just had to overcome some additional hurdles to do so.

Interconnected Concepts (and Code)

Alg4DM crucially differs in how interconnected the material is. The chapters build on one another, with code from later chapters using or expanding on code from earlier chapters. These dependencies vastly increased the complexity of the work. We had to find unified representations that worked across the entire book, and any changes in one place could propagate to many others.

For example, the rollout method has an MDP version that accepts states and a POMDP method that accepts beliefs. Many POMDP methods call the state version, and thus depend on previous chapters.

Textbook-wide Codebase

When writing Alg4Opt, every chapter can be compiled alone, with standalone code. This made development pretty easy. We didn’t have any central shared code and could simply rely on pythontex blocks to execute Julia code for us:

A jlcode block that executes Julia code and produces a plot in the first book.

For Alg4DM, we needed later chapters to use the code from earlier chapters. As such, we needed a new tactic.

The approach we settled on was to construct a Julia package that would contain all of our code. This package would include both utility methods like plotting functions and example problem formulations, but also all of the code presented in the textbook. Algorithm blocks like the one below would be automatically detected and their contents exported into our package:

Blocks like the one above existed in Alg4Opt and resulted in typeset code, but that code never actually ran. We usually had extra copies of every algorithm that were used to _actually_ run and produce our figures. By scraping the book for our code, we could ensure that the code that was typeset was the code that we were actually using to generate our figures and run our tests. Furthermore, we could still compile chapters individually, as each chapter could just load the Julia package.

The end result looked basically like this:

Furthermore, we improved our test setup by writing tests directly inside the .tex files alongside everything else. These test blocks would also be parsed out during extraction, and could then be used to test our code:

Pretty simple, and it solved a lot of problems.

Now, it wasn’t without its trade-offs. First, whenever I was working on the book, I would have to remember to first run the extraction script to make sure that I had the latest and greatest. This added overhead, and forgetting to do it caused me some hours of pain.

Second, our base support code was stored in a different repository than the LaTeX for the book. This meant that another author could update the base code and I would have to remember to check out the latest Julia package to get those updates. Again, a pain point. Now that I know how to have Julia packages that aren’t stored in the standard Julia location, we can actually move our Julia package into the main book repo and avoid this issue.

Third, loading one large central package significantly increased book, and individual chapter, compile time. For Alg4Opt, each chapter only included the packages and code it needed. For Alg4DM, every chapter includes the entire support package, which has some hefty dependencies (such as Flux.jl). While we could do more work to split out the biggest dependencies, this fundamental increase in size would definitely still be noticeable. (Yes, package precompilation helps, but only so much).

Overall, I think our solution ended up being sufficient, but there are some improvements we could make moving forward.

Bigger, Fancier Problems

Many of the example problems and implementations behind the figures in Alg4Opt were relatively simple. They were mostly all 2D optimization problems, which are very quick to run. We could compile the book quickly and regenerate the figures every time without much hassle.

In Alg4DM that was no longer the case. Training an agent policy to solve a POMDP problem takes time, even for something as simple as the mountain-car problem. Compiling the book from scratch, including recompilation of all figures from the original Julia code, would take a very, very long time. As such, there are multiple cases where we had to save the .tex code produced for a particular figure, comment out the code that generated it, and include that instead:

An example of how a figure is loaded from its intermediary .tex file rather than being regenerated from source code every time.

This change was pretty seamless and easy to execute. It did not disrupt our workflow much. Furthermore, because we were still producing the PDF from .tex input, the figure would still be recompiled with the latest colors and formatting. The primary incurred overhead was remembering when to regenerate a figure because something underneath changed, such as the algorithm, something it calls, or the problem definition it was run on.

Breaking it Down in ways it hasn’t been Broken Down Before

The content covered in the optimization book was more intuitive for me, in many ways. While there certainly was some synthesis / repackaging of ideas, the amount of ground that had to be covered with respect to the new book was far greater.

The decision making book covers a large span of academic literature. We’ve got Bayesian networks, deep learning, traditional MDPs and POMDPs, imitation learning, reinforcement learning, multiagent planning, Markov games, and more. We needed consistent nomenclature that didn’t have overlapping concepts, while staying as true as possible to the literature. (For example, we spent a long time deciding what the symbol for a conditional plan would be. That issue took over a year to resolve.)

Furthermore, as with the first book, the 2nd book tries to cover the principle components of the algorithmic ideas. That is, we break down the little nuggets of progress. An algorithm like SARSOP has a lot of ideas inside of it, and we break those apart rather than presenting the SARSOP algorithm as a whole. Similarly, 1D optimization is often solved with Brent’s method, but you won’t find Brent’s method in our first book. You will, however, find the bisection, secant, and inverse quadratic interpolation methods – all methods that are used inside of Brent’s method.

Perhaps the greatest contribution this text makes is how we cover policy gradient optimization. We tease many of the concepts apart and present them individually in a way I haven’t seen before. The restricted gradient update naturally evolves into the natural gradient update, which evolves into a trust region update, etc. That chapter, and the actor-critic chapter thereafter, where the hardest ones for me to write, but I am quite happy with the result.

Conclusion

And that’s it! The new book is something I worked very hard on and it came out great. Mykel and Kyle worked very hard on it to, we got a ton of feedback from all sorts of helpful individuals, and we had great support from MIT Press and their editors.

I’ve been really fortunate to have been able to write this book with Mykel and Kyle. Decision-making is Mykel’s specialty, and Kyle has a lot of valuable multiagent knowledge (and is extremely detail-oriented and perceptive). I learned a lot doing the course of Alg4DM’s development, and I hope that its readers can learn a lot as well.