Global Path Planning and Local Refinement

We often want robots, drones, or video game mobs to navigate from one location to another as efficiently as possible, without running into anything.

In such times of need, many of us turn to A*, in which we discretize our problem and make path planning about finding a sequence of discrete steps. But that only solves half the problem – we often end up with a chunky, often zig-zaggy solution:

The actual shortest path doesn’t look like that. It hugs the corners and moves off-axis:

This post will cover one way to solve this problem. It will continue to leverage the power of graph algorithms like A*, but will conduct local path optimization in the second phase to get a shortest path.

Phase 1: Global, Discrete Path Planning

In the first phase, we find a coarse shortest path through a graph. Graphs are incredibly useful, and let us apply discrete approaches to the otherwise continuous world.

For this to work, the planning environment has to be represented as a graph. There are many ways to achieve this, the simplest of which is to just use a grid:

From Optimal Target Assignment and Path Finding for Teams of Agents, showing a Kiva warehouse layout.

For more general terrain it is more common to have tessellated geometry, often in the form of a navigation mesh. Below we see a path for a marine in a StarCraft 2 navmesh:

From RTS Pathfinding 3 by jdxdev, who in turn got it from Pathing it’s not a solved Problem

I first talked about navigation meshes in my post about Delaunay triangulations, and I’ve been using them to represent my game map geometry in TOOM.

Navigation meshes, typically represented using doubly connected edge lists (DCELs), are graphs that represent both the geometry (via primal vertices and edges), and connections between faces (via dual vertices and edges). This sounds more complicated than it is. Below we see the E1M1 map from DOOM, after I’ve loaded it into a DCEL. The map walls are black and the remaining triangle edges are shown in purple:

This image above only shows the primal graph, that is, the actual map geometry. The cool thing about DCELs is that they also contain a dual graph that represents connectivity between faces (triangles). Below I additionally show that dual graph in red. I’ve ignored any edges that are not traversible by the player (i.e. walls or large z changes):

The first phase of path planning is to find the shortest path in this dual graph from our source triangle to our destination triangle. We can drop nodes that aren’t reachable:

We can run a standard graph-search algorithm on this dual graph. Each node lies at the centroid of its triangular face, and the distance between nodes can simply be the distance between the centroids. As the problem designer, you can of course adjust this as needed – maybe walking through nukage should cost more than walking over normal terrain.

In Julia this can be as simple as tossing the graph to Graphs.jl:

shortest_paths = dijkstra_shortest_paths(g, [i_src], edge_distances)

This gives us a shortest path through the triangulated environment. It is typically jagged because it can only jump between centroids. Below we see a shortest path that traverses the map:

The discrete phase of motion planning is typically very efficient, because graph search algorithms like A* or Dijkstra’s algorithm are very efficient. If our environment is more or less static, we can even precompute all shortest paths. This makes the discrete phase of motion planning very cheap.

Our problem, however, is that the resulting path is not refined. To rectify that, we need a second, local phase.

Phase 2: Local Refinement

The second phase of path planning takes our discrete path and cleans it up. Where before we were constrained to the nodes in the dual graph, in this phase we will be able to shape our path more continuously. However, we will restrain our path to continue to traverse the same sequence of faces as it did before.

Because we are restricting ourselves to the same sequence of faces, we can ignore the rest of the graph. The first thing we’ll do is switch from traversing between centroids to traversing between edges along a specific corridor. Below we show the path slightly modified (red), with traversals between the centers of the crossed faces (gray):

That is already somewhat better, but it is not good enough. Next we will shift these crossing points along the crossed edges in order to shorten the path. This can be thought if as laying down a long rubber band along the center of the traversed corridor, and then stretching it out to tighten it so that it hugs the edges. As such, this process is often called elastic band optimization or string pulling.

This is an optimization problem. How do we formulate it?

Let \(\boldsymbol{p}^{(0)}\) be our starting point, \(\boldsymbol{p}^{(n+1)}\) be our destination point, and \(\boldsymbol{p}^{(i)}\) for \(i\) in \(1:n\) be the \(n\) vertices at the edges that our path traverses. We can slide each \(\boldsymbol{p}^{(i)}\) along its edge, between its leftmost and rightmost values:

\[\boldsymbol{p}^{(i)} = \alpha_i \boldsymbol{p}_\ell^{(i)} + (1-\alpha_i)\boldsymbol{p}_r^{(i)} \]

Each interpolant \(\alpha_i\) ranges from 0 to 1. Below we depict a handful of these:

We are trying to minimize the total path length, which is the sum of the distances between sequential points:

\[\sum_{i=0}^n \| \boldsymbol{p}^{(i+1)} – \boldsymbol{p}^{(i)} \|_2\]

which in 2D is just:

\[\sum_{i=0}^n \sqrt{(\boldsymbol{p}^{(i+1)}_x – \boldsymbol{p}^{(i)}_x)^2 + (\boldsymbol{p}^{(i+1)}_y – \boldsymbol{p}^{(i)}_y)^2}\]

Our optimization problem is thus:

\[\begin{aligned}\underset{\boldsymbol{\alpha}_{1:n}}{\text{minimize}} & & \sum_{i=0}^n \| \boldsymbol{p}^{(i+1)} – \boldsymbol{p}^{(i)} \|_2 \\ \text{subject to} & & \boldsymbol{p}^{(i)} = \alpha_i \boldsymbol{p}_\ell^{(i)} + (1 – \alpha_i) \boldsymbol{p}_r^{(i)} \\ & & \boldsymbol{0} \leq \boldsymbol{\alpha} \leq \boldsymbol{1} \end{aligned}\]

This problem is convex because \(\| \boldsymbol{p}^{(i+1)} – \boldsymbol{p}^{(i)} \|_2\) is convex and the sum of convex functions is convex, the first constraint is linear (which is convex), and the inequality constraints are simple inequality constraints. We can thus solve this problem exactly.

In a game engine or robotics product we might roll our own solver, but here I’m just going to use Convex.jl:

α = Variable(n)     # interpolants
ps = Variable(2, n) # optimized points, including endpoints

problem = minimize(sum(norm(ps[:,i+1] - ps[:,i]) for i in 1:n-1))
for i in 1:n
    problem.constraints += ps[:,i] == p_ls[i] + α[i]*(p_rs[i] - p_ls[i])
end
problem.constraints += α >= 0
problem.constraints += α <= 1

solve!(problem, () -> ECOS.Optimizer(), verbose=false, silent_solver=true)

α_val = evaluate(α)
ps_opt = evaluate(ps) 

Easy!

Solving our optimization problem gives us our locally-optimized shortest path:

We can see how the optimized path hugs the corners and walls and heads straight across open areas. That is exactly what we want!

Agents have Width

Okay, so this gives us an optimized path, but it treats our robot / drone / enemy unit as a point mass. Our agents typically are not point masses, but have some sort of width to them. Can we handle that?

Let’s treat our agent as being a circle with radius \(r\). You can see such a circle in the earlier image for the StarCraft 2 marine.

We now modify both phases of our path planning algorithm. Phase 1 now has to ignore edges that are not at least \(2r\) across. Our agent cannot fit through anything more narrow than that. We set the path traversal cost for such edges to infinity, or remove them from the graph altogether.

Phase 2 has the more radical change. We now adjust the leftmost and rightmost points \(\boldsymbol{p}_\ell^{(1:n)}\) and \(\boldsymbol{p}_r^{(1:n)}\) to be a distance \(r\) from the line segment endpoints. This reflects the fact that our agent, with radius \(r\), cannot get closer than \(r\) to the edge of the crossed segment.

Below we show an optimized path using a small player radius. We can see how the new path continues to try to hug the walls, but maintains at least that fixed offset:

Notice that this naive augmentation introduced a few artifacts into the smoothed path, particularly near the start and end vertices. This stems from the fact that the map geometry in those location includes edges that are not actually constrained by walls on both sides. For example, the room in the bottom right has some extra triangles in order to be able to carve out that exit sign in the ceiling. It doesn’t actually hamper movement:

This can be fixed by using a navigation mesh that more accurately reflects player navigation. In this case, we’d remove the exit sign from the nav mesh and end up with an exit room where every edge has vertices that end at the walls.

Conclusion

This post covered a method for path planning applicable to basic robotics motion planning or for AI path planning in video games. It is a two-phase approach that separates the problem into broadly finding the homotopy of the shortest path (which side of a thing to pass on) and locally optimizing that path to find the best continuous path within the identified corridor. This method of breaking down the problem often shows up when tackling complicated problems – a discrete solver is often better at the big-picture and a local optimizer is often better for (often much-needed) local refinement.

There are many variations to the path planning problem. For example, path planning for a robot car may require restricting our path to allow for a certain turning radius, and path planning for an airplane may require planning in 3D and restricting our climb or descent rates. We may wish to minimize things other than the path distance, such as the amount of energy required, or find the shortest path that gets us there within a given energy budget. The real world is full of complexity, and having the necessary building blocks to construct your problem and the various tools at your disposal to solve them in the way that makes most sense for you makes all the difference.

TOOM with Non-Euclidean Geometry

This month continues our foray into the world of DOOM and C software renderers. In the last post, we moved from a blocky, axis-aligned world to general 2D geometry. In this post, not only will we manipulate the 3rd dimension, thereby reaching DOOM-geometry-parity, we will leverage our raycasting approach to unlock something the original DOOM couldn’t do.

Variable Floor and Ceiling Heights

In Wolfenstein 3D, the floors are always at z = 0 and the ceilings are always at z = 1. DOOM could adjust these floor and ceiling heights, and its players were thus able to ascend stairs, traverse next to chasms, and generally enjoy a much richer gaming environment.

This gif shows the basic z-manipulation I’m talking about:

Here we’re adjusting the floor and ceiling height of our corridor. When we raise the floor, we need to render a wall texture between our floor and the raised corridor floor. When we lower the floor, we need to render a wall texture between the lowered floor and the normal floor on the other end.

Changing the ceiling height can also expose a bit of wall texture:

Our renderer has to be able to model these floor sections in order to keep track which of piece of floor has which floor and ceiling height, and we have to keep track of the textures to draw when they are exposed.

Like DOOM, I will denote regions of the map with a particular floor and ceiling height (and later, floor and ceiling textures) to be called a sector. Sectors will naturally be closed polygons in our top-down 2D map.

Here you can see a top-down view of the TOOM level editor, showing the same region as rendered above. The red triangle is the camera, which looks toward the right.

Here is the same shot, filled in a bit to make it easier to interpret:

As covered in the last post, the world is defined using a triangle mesh. Our corridor, for example, is comprised of two triangles. The sector for that corridor must be referenced by the lines along the corridor’s boundary:

The sector itself just stores the z heights right now. It doesn’t know which edges it corresponds to. Instead, we have the side geometry refer to the sector it should correspond to. This approach is again, very much like DOOM.

We are using a Doubly Connected Edge List (DCEL) to represent our triangle mesh. Edges in a DCEL are bi-directional . If we think of edges as right-handed, and an edge A->B as being on the right-hand side of AB, then we can just have every directed edge at the boundary of a sector keep track of which sector it helps enclose.

The sides that refer to the corridor are yellow, whereas the sides that refer to the primary sector are orange. When we raycast from the camera down the corridor, we can check for z changes when we traverse over edges where the sector changes. Here we see a raycast with two locations where the sector heights might change:

The data structures for this are pretty simple. A directed edge may have a SideInfo:

struct SideInfo {
    u16 flags;
    u16 sector_id;
    TextureInfo texture_info_lower;
    TextureInfo texture_info_middle;
    TextureInfo texture_info_upper;
};

A SideInfo is simply a struct that contains a lower, middle, and upper texture, the index of the sector to refer to, and some flags. Right now, the only flag I use is one to determine whether a wall is passable, and if so, we can avoid rendering the middle texture.

The TextureInfo entry is pretty simple as well:

struct TextureInfo {
    u16 texture_id; // Index in the texture atlas
    i16 x_offset;   // Texture x offset
    i16 y_offset;   // Texture y offset
};

And Sectors are just basic data as well:

struct Sector
{
    u16 flags;
    f32 z_floor;
    f32 z_ceil;
};

Raycasting with Varying Heights

With these data structures in-hand, rendering becomes a little more complicated. We continue to cast our ray out, as before. Now, however, we need to keep track of how much we’ve already drawn.

To illustrate, let’s look at a horizontal cross-section of geometry:

Here I have drawn the area swept out as we cast a single ray to the right, down a single pixel-column. The light purple boxes show all of the regions in our field-of-view where we need to render ceiling pixels, the dark purple boxes show where we need to render floor pixels, and the green boxes show where we need to render wall pixels.

When we start raycasting, we start having not drawn any pixels. The bounds on where we can draw contains the whole vertical pixel column, between zero and the number of pixels:

\[y_\text{lo} = 0 \qquad y_\text{hi} = \text{SCREEN_SIZE_Y}\]

We start to traverse the mesh, and hit our first sector change:

At this point we calculate the y-location (i.e., location in the vertical pixel column) of the ends of the floor and ceiling. We render the floor and ceiling into our pixel column. We’re already pretty used to this trigonometry by now.

\[\begin{aligned} y_\text{ceil} & = \frac{\text{SCREEN_SIZE_Y}}{2} + \gamma \left( z_\text{ceil} – z_\text{camera} \right) \\ y_\text{floor} & = \frac{\text{SCREEN_SIZE_Y}}{2} + \gamma \left( z_\text{floor} – z_\text{camera} \right)\end{aligned}\]

where

\[\gamma = \frac{\text{SCREEN_SIZE_Y}}{\text{y field of view}} \cdot \frac{\text{cam length}}{\text{ray length}}\]

The y yield of view is how many units up the camera views for every unit we raycast outward. I’m using about 0.84. The camera length is close to 1, and adjusts for the fact that our screen is flat and not curved, so rays travelling left or right of center have to have the length adjusted. Other than that, its just a function of the heights and how far our ray has traveled.

In this case, the next sector has an increase in the ceiling height and a decrease in the floor height. That means we do not have any local wall textures to render. We set \(y_\text{lo} = y_\text{floor}\) and \(y_\text{hi} = y_\text{ceil}\) and keep on raycasting:

At the next sector transition, we once again compute \(y_\text{floor}\) and \(y_\text{ceil}\). This time, however, the ceiling drops and the floor rises, so we also have wall segments to render. We compute their pixel extents in the same way:

\[\begin{aligned} y_\text{upper} & = \frac{\text{SCREEN_SIZE_Y}}{2} + \gamma \left( z_\text{upper} – z_\text{camera} \right) \\ y_\text{lower} & = \frac{\text{SCREEN_SIZE_Y}}{2} + \gamma \left( z_\text{lower} – z_\text{camera} \right)\end{aligned}\]

where \(z_\text{upper}\) and \(z_\text{lower}\) are the new sector ceiling and floor heights, respectively.

When we continue our raycast, we find that the next sector transition only increases the ceiling height. The new \(y_\text{floor}\) is greater than \(y_\text{lo}\), so we render the patch of floor. The new \(y_\text{ceil}\) is not any lower than our current \(y_\text{hi}\), due to the ceiling being parallel to our camera view, so we don’t need to render anything there.

We finally proceed and hit a closed sector. We render the last bit of floor and then render its center texture:

So that’s how the renderer works when you have variable z heights and upper, middle, and lower textures. Each raycast has to do more work to determine what to draw, and it keeps track of where it is within its y pixel bounds to make sure we don’t mistakenly draw something on top of what we’ve already rendered.

Rendering Textures

In this previous blog post, I covered sprites, and how they use the DOOM picture format. In short, sprites like monsters are stored in vertical pixel strips. Why? Because when we render the vertical portions of the world, we’re rendering convenient vertical sections of these sprites. The vertical orientation forms a right-triangle, which makes the math of figuring out which pixel to draw more efficient. DOOM uses the same format for wall textures (both upper and lower). I ended up doing the same thing.

So, wall textures are the same as sprites. What about horizontal textures? We’d like to draw textures on our ceilings and floors too:

(Yeah, these don’t fit thematically. I haven’t loaded my own floor and ceiling textures yet. These just come from DOOM.)

We can again leverage our previous work to figure out how to draw floor and ceiling textures. Unlike the vertically-oriented wall textures, these “flat” textures are most efficiently drawn horizontally across the screen. As such, in DOOM they are called flats.

Our render is most adept at rendering things vertically. It works on one pixel column at a time. So how do we hope to render anything horizontally?

Well, the raycasts start with the leftmost pixel column and then work their way to the right. I keep track of active spans, which are horizontal sections of the screen that need to eventually be rendered with a single flat at a single floor height. Whenever we need to start a new span, we close out the previous one and render it.

The gif above shows how the renderer produces a scene. We draw vertical sections as we raycast, but start/maintain horizontal spans that only get rendered (1) when another span in the same y-coordinate starts or (2) we are done raycasting.

Each active span contains all of the information needed to render it efficienctly, once it needs to be rendered:

struct ActiveSpanData {
    // Location of ray_dir_lo's intersection with the ground.
    common::Vec2f hit;
    // Shift 'hit' by one pix column moved toward hi's intersection.
    common::Vec2f step;
    u16 sector_id;
    u16 flat_id;
    bool is_active;
    // The camera x pixel index where this span starts.
    int x_start;
    // The most recent camera x pixel where this span currently ends.
    int x_end;
};

We can do the computation work of calculating the hit and step values when we first create the span. We set x_start at that time, and initially set x_end to the same value. If we go to render the same sector in the same location in the next column, we increment x_end.

Colors and Light Levels

You may have noticed that these images get darker the further we’re looking. This is an intentional effect from DOOM that enhances the sense of depth.

This is easiest to see without the floor textures:

Conceptually, this is really simple. Render pixels darker when they are further away.

The light level of a pixel column drops off as the raycast distance increases. Or rather, the darkness level increases with the raycast distance.

Pixel values are written to the screen as RGB values. How does one take an RGB value and make it darker? Well, white is 0xFFFFFF and black is 0x000000, so the simplest way is to decrease each channel by some percentage. To do it “more correctly”, you can convert your RGB color into the Lab colorspace, reduce its lightness, and then convert it back. That is an expensive operation.

The original DOOM didn’t have access to RGB values. Or rather, it didn’t have access to all of them. According to the DOOM Black Book, PC video games of the 90s had 320×200 resolution with 1 byte per pixel drawn from a 256-color palette. This meant that programmers could set the 256 colors in the palette, and drawing a pixel to the screen effectively meant indexing into this palette by setting a 1-byte pixel index rather than a full RGB value.

So how does one change a pixel’s brightness?

The DOOM developers precomputed mappings from pixel values at full brightness to pixel values at reduced brightness. There were many such colormaps, 32 that were computed for all light levels from 100% brightness all the way down to 0%. There were even some additional colormaps, such as an inverted colormap used when the player becomes invulnerable:

All wall textures and flats store pixel indices rather than RGB values. These pixel indices pass through the chosen colormap and then index into the palette to get the RGB value. In the original DOOM system, indexing into the palette was done for you by the video card. I’ll be doing it myself here.

u8 palette_index = patch.post_data[column_offset];
palette_index = colormap.map[palette_index];
u32 abgr = *(u32*)(palette.colors + 3 * palette_index) | 0xFF000000;

The palette stores sequential BGR values, so we just grab the section we need and then overwrite the FF for the alpha channel.

struct Palette {
    u8 rgbs[768];
};

struct Colormap {
    u8 map[256];
};

As a final note, there is one additional lighting effect that helps make the map more legible. Like the trick used in Wolfenstein, walls closer to being x-aligned are drawn 1 point brighter than normal, and walls closer to being y-aligned are drawn 1 point darker than normal. Furthermore, each sector has its own baseline light level. This lets the map creator make some areas naturally darker, and allows for some nice floor shadow effects:

DOOM Levels

Ah, that last image gave it away. The geometry changes and lighting changes now bring us more-or-less to parity with the original DOOM. We can load and traverse the original DOOM levels!

This rendering is done using the raycasting method, via our DCEL, rather than using DOOM’s BSP. This means we can totally avoid loading SEGS, SSECTORS, NODES, and the BLOCKMAP.

E1M1 loaded into the TOOM editor.

Work was required to be able to load the geometry into our DCEL correctly. Mainly in being able to properly flip mesh edges in order to be able to produce edges that match the map geometry, and then constrain them. I might cover that in a future post.

Portals

Okay, we’ve reached geometric parity with DOOM. Is there something this DCEL approach can do that DOOM cannot?

So how does that work?

When we raycast or when the player walks around, we traverse between faces in our triangle mesh. We can create a portal by linking two edges in the triangle mesh and adding the necessary logic such that, when traversed, we come out the linked edge instead.

Below we see the top-down view of the portal setup playing out. The red triangle shows the player camera and its approximate view cone.

The player movement for when we traverse a portal is:

// Suppose we are crossing the line segment AB from pos to pos_next,
// where pos_next now lies on AB.
Vec2f v_face = b - a;

// Calculate where along the segment we intersected.
f32 v_face_len = Norm(v_face);
f32 x_along_texture = v_face_len - Norm(pos_next - a);

// Update our quarter edge using the destination quarter edge
// stored in the side info.
camera_state->qe = mesh.Tor(side_info->qe_portal);

// The vector along the new face is
Vec2f c = mesh.GetVertex(side_info->qe_portal);
Vec2f d = mesh.GetVertex(mesh.Sym(side_info->qe_portal));
Vec2f cd = d - c;
cd = Normalize(cd);

// Our new position is:
camera_state->pos = c + x_along_texture * cd;

f32 angle_ab = std::atan2(v_face.y, v_face.x);
f32 angle_cd = std::atan2(cd.y, cd.x);
f32 angle_rot = CounterClockwiseAngleDistance(angle_ab, angle_cd) + M_PI;
f32 cos_rot = std::cos(angle_rot);
f32 sin_rot = std::sin(angle_rot);

// Now to get the new direction, we need to rotate out of the new face
dir = {cos_rot * dir.x - sin_rot * dir.y,
       sin_rot * dir.x + cos_rot * dir.y};

// We also need to rotate our speed
vel = {cos_rot * vel.x - sin_rot * vel.y,
       sin_rot * vel.x + cos_rot * vel.y};

Geometrically, we are figuring out the rotation angle between our portals (which could be precomputed) and then rotating our movement direction and speed when we traverse.

Note that the +pi term compensates for the fact that the destination portal side is oriented 180 degrees with respect to the source portal side.

When we raycast through portals, we have properly account for any height changes on the receiving sector. We want to view the other side as if we were the appropriate height:

camera_z += (sector_dst->z_floor - sector_src->z_floor);

There is a bit of extra bookkeeping we have to do with portals – namely that for flats to render correctly we have to update the camera position to the place it would be as if the portal did not exist and we were peering toward our destination:

We then have to recompute some of the vectors used to create spans. This is all because flats are rendered on a fixed grid, and we compute where we are in that grid based on the camera location. We have to fake the camera being in the proper location “behind” where the portal surface is. Please take a look at the source code for the nitty-gritty.

There is a lot of neat gamemap stuff that one can do with portals and the non-Euclidean geometry they unlock. Some of it is more blatant, like the first portal, and some of it can be quite subtle, like a hallways that is shorter or longer than expected.

Conclusion

This blog post covered adding variable z heights for floors and ceilings, which then required being able to render wall sections revealed at floor or ceiling height changes. We rendered those vertically, because that is most efficient. We then figured out a way to render horizontal textures too, which were most efficient to render in horizontal strips. We added some lighting effects. We then had everything we needed to represent the original DOOM level geometry. We concluded with an overview of portals – sides in our mesh that are linked together such that crossing one results in you exiting out the other.

I made the code for TOOM and TOOMED available in separate repos. TOOM is the “game”, written in pure C without graphics libraries. TOOMED is the TOOM Editor, written in C++. It basically doesn’t use graphics libraries either, but it does use Dear ImGui with an accelerated backend, as well as have more liberal data structures for easier manipulation of game data and assets.

That’s more or less it for TOOM, for now. There are a few things the renderer doesn’t support that DOOM has, such as animated nukage, pickups, or proper handling for partially-transparent walls. There obviously isn’t any gameplay. No shooting or doors or anything like that. Those things are neat, of course, but I feel like I already understand how they work, and it’d just be pretty tedious work for me to get TOOM running with all that. Maybe I’ll do it, but for now I’m thinking I’ll be turning to other interests. Thanks for reading!

A Note on Efficiency

One of the reasons that this TOOM project interested me in the first place was the promise that I might learn how to program something “for real” and really have to care about performance. I admire folks like Casey Muratori and Jonathan Blow who ardently advocate for efficient programming, and that software engineers should understand what goes on under modern abstractions.

I tried to keep TOOM running at over 30fps. I was able to track this using a profiling technique covered by Casey in his performance-aware programming class.

For a moderately-expensive scene in E1M1, TOOM clocks in at about 130fps:

Frame dt:   28.9459ms
Total time: 7.6722ms, 130.3 fps (CPU freq 1992267920)
  Poll:   32405 (0.21%)
  Tick:   7097 (0.05%)
  Render: 15245367 (99.74%)
  DKBS:   249 (0.00%)

As we can see, 99.74% of the time is spent rendering. That is to be expected, since that is where all the work is.

Here the frame dt is the 30Hz enforced by my monitor, because SDL is syncing to my monitor refresh rate. This is with a 640 x 320 window, which is double the resolution of the original DOOM. Yes, this is like 30 years after DOOM came out, but I am nevertheless quite happy with these results.

TOOMED, however, is a bit slower:

Total time: 35.2314ms (CPU freq 1991906430)
  GetRightHandedness  [ 25801]:     792837 (1.13%)
  RenderPatchColumn   [  1760]:    8381240 (11.94%)
  RenderSpan          [   318]:    3498674 (4.99%)
  Raycast             [   640]:   35425448 (50.48%, 72.23% w/children)
  Raycast_EdgeChecks  [  8600]:    2322161 (3.31%, 4.43% w/children)
  Raycast_WithValidSideInfo[  2720]:    3599229 (5.13%, 17.32% w/children)
  RenderScene         [     1]:     146851 (0.21%, 77.17% w/children)

It takes about 4.6x as long for TOOMED to render the same scene, giving it a frame rate of about 28fps. That is not great.

Yes, TOOMED is doing a little more, in that it also renders the top-down GUI. However, that GUI stuff is pretty negligible. I believe the real culprit here more-or-less boils down to two things: memory layout and dictionary lookups.

In TOOM, everything is accessed directly in the binary save file. Entries are all nicely indexed and can be quickly accessed. In TOOMED, since we’re editing things, we store them in std::vectors and std::maps, which results in disparate structures spread out in different locations. In many cases it simply takes more effort for the CPU to get access to what it needs. Furthermore, dictionary lookups, while being O(1), are still doing more work than a simple array access. The delta here, 4.6x, is pretty substantial, and this is the sort of takeaway that I think serious gamedev folk mean when they say paying attention to your data structures and memory strategies matters.

As a final fun learning point, my raycast method was initially recursive. That is, every time we step our ray and transfer over a triangle edge, I’d recurse down in order to process the next triangle step. I figured that this would make it easier to later back-track and render things like monsters or item pickups. However, TOOMED was running _extremely_ slowly, around 70ms per frame. Changing from recursion to a loop cut that time in about half. Its the same amount of work, theoretically, only now we don’t have to make a bunch of copies every time we push to the stack.

Taking TOOM Off-Grid, without BSPs

I’ve continued tinkering with my little C software renderer covered in previous blog posts. So far we’ve rendered a blocky, textured environment reminiscent of Wolfenstein 3D:

You’ll notice from the window header that the project is called TOOM, as in, Tim DOOM. It isn’t Timenstein 3D, so we have some work to do.

In this blog post we’ll go off-grid and move to general 2D geometry.

Binary Space Partitioning

John Carmack’s great innovation in creating DOOM, the design change that made rendering general 2D geometry in a constrained computing environment possible, was the BSP tree. BSP trees make it efficient to walk over the surfaces in your world in a front-to-back order, based on the camera location.

Each node in the tree represents a splitting plane (or line, rather, as this is 2D) that subdivides the level. All faces below that node lie entirely on one or the other side of that dividing line. At the bottom of the tree you get leaf nodes, which each represent a single convex map region:

DOOM’s E1M1 map, image from the doomwiki

While the BSP tree is a genius addition, it does come with its disadvantages. The main one being its construction. For DOOM this meant running Node Builder after level design, a process that does a lot of iteration to find the best splitting planes for a well-balanced tree. Not only does this take a while, it also requires the level geometry to remain static. This means no 2D geometry changes. (DOOM’s doors all move vertically).

I kind of want to be able to support geometry changes. Plus, I kind of like the idea of figuring out something different. So, we’re going to take another approach.

Keep on Raycasting

In Wolfenstein Raycasting in C, we talked about how the software renderer casts rays out into the world for every pixel column in the screen:

This raycasting was fairly efficient because we limited our geometry to a regular grid:

Now, we’re going to do the exact same thing, except on a triangle mesh:

Moving to a triangle mesh has several consequences. On the plus side, it supports general 2D geometry, so we no longer have to constrain ourselves to a grid of blocks. We can fairly easily move vertices and mutate our environment. On the other hand, casting a ray through a set of triangles tends to be more computationally expensive, we have to develop and maintain a more complicated data structure, and we aren’t entirely free to mutate our environment – our triangles have to remain valid (right-handed).

It isn’t 1993 anymore, so I’m hoping that the performance hit from raycasting on a mesh isn’t too bad on a modern processor. I’ve already been tinkering with triangle meshes, so I’m eager to adapt what I learned there to this new application. I also think we can overcome the editing limitations.

Below is a GIF from TOOM. The left half of the image is rendered with the new triangle meshes. The right half of the image is rendered with the old blocky technique. (The new map was made to be identical to the old one, for now). The textures assigned to the walls aren’t the same just yet, so you can see a fairly clear division.

The debug view gives us a better sense of what is going on:

Raycasting through triangles is the same, conceptually, as raycasting through a grid. We keep casting through cells until we reach a solid cell. Now, instead of square cells with regular spacing, we have arbitrarily-sized, triangular cells.

Each raycast starts with a ray at the camera location \(\boldsymbol{p}\) heading in a direction \(\hat{\boldsymbol{d}}\). As we iterate \(\boldsymbol{p}\) will be shifted out into the world. In the first iteration, \(\boldsymbol{p}\) will be the camera location and can lie inside a triangle face. In future iterations, \(\boldsymbol{p}\) is lies on an edge that we crossed in the previous iteration.

We take the three bounding vertices \(\boldsymbol{a}\), \(\boldsymbol{b}\), and \(\boldsymbol{c}\) and intersect the line through \(\boldsymbol{p}\) with the direction \(\hat{\boldsymbol{d}}\) with AB, BC, and CA, representing each intersection point as \(\boldsymbol{p}’ = \boldsymbol{p} + t \hat{\boldsymbol{d}}\). We step forward to the edge that we hit first. That is, the edge for which \(t\) is smallest (but positive). We ignore any edge we are already on.

We start by computing \(\boldsymbol{q}\), a point far enough along \(\boldsymbol{d}\) that it should exit the triangle:

Before we calculate an intercept, we can disregard checking an edge EF if the triangle EFQ is not left-handed. This helps us avoid cases where an intersection will either never occur or lie behind our ray. For example, we can skip checking AB because ABQ is right-handed (counter-clockwise):

We should check BC, because the triangle BCQ is left-handed (clockwise):

We can now calculate the interection point’s \(t\)-value. For the edge BC, this is:

\[t_{BC} = \frac{(\boldsymbol{c}-\boldsymbol{b}) \times (\boldsymbol{p}-\boldsymbol{b})}{(\boldsymbol{q}-\boldsymbol{p}) \times (\boldsymbol{c}-\boldsymbol{b})}\]

In code, this works out to:

if (GetRightHandedness(b, c, q) < -eps) {
   // We would cross BC
   v2 v = c - b;
   v2 w = p - b;
   f32 t_bc = cross(v, w) / cross(q - p, v);
   if (t_bc < t_min) {
      t_min = t_bc;
      qe_side = qe_bc;
      v_face = v;
   }
}

We just repeat this block three times, once for each edge. Each time we get an intersection, we keep it if it is earlier than any previous intersection. When we do, we also store the edge that we pass over (via its quarter edge in the DCEL).

Just like before, we propagate our point \(\boldsymbol{p}\) up to the new edge and then either iterate again if the new face is empty, or break out if the new face is solid.

We do have to make some minor modifications in how we access the face texture. With the grid we could safely assume that every wall was square, with a corresponding 64×64 pixel texture. Now walls can have arbitrary aspect ratios, so we have to do a tiny bit of math to index into the texture based on our intersection location along the edge:

f32 PIX_PER_DISTANCE = TEXTURE_SIZE / TILE_WIDTH;
QuarterEdge *qe_face_src = qe_dual->rot->rot->rot;
f32 x_along_texture = length(v_face) - length(pos - qe_face_src->vertex);
u32 texture_x = (int)(PIX_PER_DISTANCE * x_along_texture) % TEXTURE_SIZE;

Collision Detection

One immediate advantage of using the DCEL is that we can leverage it for collision detection. In DOOM, collision detection is done via a blockmap, which is just a discrete 128×128 pixel grid. Each grid cell stores the lines in that block. An entity moving in a cell can just check for collision against all of the lines in the cell. The blockmap is constructed on level export, and again assumes static 2D geometry.

We don’t have to use a bockmap, because we have the DCEL. We can simply store the face that our player is in and check for collisions every time we move them. Our player movement code is the same as the player movement code in this previous blog post. Any time we would cross over a solid edge between two faces, we simply move up to the edge instead, and lose any velocity into the edge.

The current approach assumes the player is a single point. If we want, we can create a second collision mesh that uses inflated solid geometry to compensate for the player’s collision volume. We would have to keep that second mesh synchronized with any changes to the geometry mesh.

DCELs make pathing and shooting computations easy too. DCELs are primarily used in games for pathing (via Nav Meshes) and collision checking for a gunshot is just a raycast where we also care about intersections with enemies.

Game Map Representation

The biggest difference here was not so much the rendering code, but more the integration of the mesh data structure and how we represent the game map. With the Wolfenstein-style renderer we were able to represent the map using 2D arrays:

tiles = UInt8[
    1 1 1 1 1 1 1 1 2 1 1 1 2;
    1 0 0 0 0 0 0 1 2 0 0 0 2;
    1 0 0 3 0 0 2 2 2 0 0 0 2;
    1 0 0 0 0 0 0 0 0 0 0 0 2;
    1 0 0 0 0 0 2 2 2 0 0 0 2;
    1 0 2 0 0 0 0 1 2 0 0 0 2;
    1 0 0 0 0 0 0 1 2 0 0 0 2;
    1 1 1 1 1 1 1 1 2 3 3 3 2]

floor = UInt8[
    1 1 1 1 1 1 1 1 2 1 1 1 2;
    1 4 4 4 4 4 4 1 2 5 5 5 2;
    1 4 4 3 4 4 2 2 2 5 5 5 2;
    1 4 4 4 4 4 2 2 2 5 5 5 2;
    1 4 4 4 4 4 2 2 2 5 5 5 2;
    1 4 2 4 4 4 4 1 2 5 5 5 2;
    1 4 4 4 4 4 4 1 2 5 5 5 2;
    1 1 1 1 1 1 1 1 2 3 3 3 2]

ceiling = UInt8[
    1 1 1 1 1 1 1 1 2 1 1 1 2;
    1 4 4 4 4 4 4 1 2 4 4 4 2;
    1 4 4 3 4 4 2 2 2 4 4 4 2;
    1 4 4 4 4 4 2 2 2 4 4 4 2;
    1 4 4 4 4 4 2 2 2 4 4 4 2;
    1 4 2 4 4 4 4 1 2 4 4 4 2;
    1 4 4 4 4 4 4 1 2 4 4 4 2;
    1 1 1 1 1 1 1 1 2 3 3 3 2]

Each cell could thus be marked as solid or not depending on whether the cell’s entry in tiles is 0. Nonzero values corresponded to texture indices. Similar arrays were used for the floor and ceiling textures.

With the general 2D geometry, we no longer have a convenient text short-hand. The DCEL has a lot of pointers and is no fun to edit manually. We have to store all vertices, quarter edges, and the associations between them. We then optionally mark edges as being solid, and when we do, set edge texture information.

At this point the game map consists of two things:

  • a triangle mesh
  • optional side information

My representation for side information is pretty simple. It is just a structure with some flags, a texture id, texture x and y offsets, and the associated quarter edge:

// The information associated with one side of an edge between vertices in 
// the map. If this represents the directed edge A->B, then it describes the
// edge viewed on the right side of A->B.
struct SideInfo {
    u16 flags;
    u16 texture_id;
    i16 x_offset;
    i16 y_offset;
    QuarterEdgeIndex qe;
};

The map data is packed into the binary assets blob (similar to a DOOM WAD file) and loaded by the game at start.

Editing this game data is rather hard to do by hand, so I started working on an editor on the side. The editor is written in C++, and I allow myself the use of whatever libraries I want. I might go over that in more detail in a future post, but for now I just want to show how it can be used to adjust the map:

The gif here is fairly large. It may have to be viewed separately.

Having an editors gives us a place to put supporting code and lets us use more convenient data structures that aren’t streamlined for the C software renderer. The editor can then do the lifting required to export our game data into the form expected by the game.

Conclusion

The TOOM source code for this post is available here. I have been overwriting my assets file, and the new ones are no longer compatible, so unfortunately do not have a good assets file to share.

In this post we took our blocky Wolfenstein-style environment and moved to general 2D geometry. We didn’t follow DOOM and use a BSP, but instead stuck with raycasting and used a DCEL.

What we have now is much closer to DOOM, but still lacks a few things. Geometrically, we don’t have variable floor heights, which means we also don’t yet worry about wall surfaces between variable floor heights. We also aren’t animating doors or elevators, and we don’t have textures with transparency. We aren’t doing any lighting, other than manually setting some faces to be darker. DOOM really was groundbreaking.

A Handcrafted Wedding

I’m now married!

We had a wonderful wedding, with loving friends and family members, many of which traveled pretty far to celebrate with us. It was a pretty small wedding, but it was a heartfelt one, and we put a lot of effort into things to make it the wedding we dreamed of. In this blog post I’ll cover some of the personal touches we were able to add.

TikZ Place Cards

Photo by @wendyandgeorgephoto

We had guests RSVP via a Google form, which asked them information like their name, which meal they wanted, and whether they had any food allergies. This produces a Google sheet, which I exported to CSV and loaded into a little Julia notebook. My notebook then typeset place cards for each guest, displaying their name, table assignment, meal choice, and any food allergies. This used TikZ via TikzPictures.jl, and inverted each image in order to get a fold-able card:

The script did some basic layout to produce paper sheets with multiple cards such that we could print them out with maximum effectiveness:

We had to address some corner cases, such as babies (which don’t have meals), guests with multiple allergies, and guests with long names.

We used the same approach to make a few other place cards for things like the desserts table:

Table Numbers

Each dining table had a number, and we needed a way to indicate which table was which. My partner had a fantastic idea, and made table signs with table numbers that included photos of us at that age. So, table 2 had photos of each of us at 2 years of age, table 3 had each of us at 3 years of age, etc:

Photo by @wendyandgeorgephoto

These were made in Keynote, and were printed out on photo paper and cut out by hand.

Succulent Centerpieces

We also wanted table centerpieces, but balked at the extreme price of floral arrangements. Again, my partner had a fantastic idea. She had been learning pottery, and decided that she would throw her own bowls, glaze and fire them, and then make succulent arrangements. She gradually worked on these bowls all year, learning to make bigger and better bowls as she progressed, and made some very beautiful works of art. Many of the succulents were clippings from my mom’s garden. The end result was amazing!

Photo by @wendyandgeorgephoto

Portable Piano

We wanted our best man to play the piano during the ceremony. The only problem was that we had to use a portable piano that wasn’t plugged into an electrical outlet.

Thankfully, we were able to use the best man’s electric keyboard, but said electric keyboard had to be modified to use a portable power source. After getting on YouTube, including watching a delightful video of a man playing music outdoors, we determined that building a battery was pretty straightforward. The piano takes 12v, so I ordered a 12v, 20 Ah battery.

The battery life in hours is its capacity (in amp-hours) divided by the current draw (in amps). The current draw is watts / volts, which for us is around 40 W / 12 v ~ 3.33 amps. Thus we expect the piano to last 20 Ah / 3.33 A ~ 6 hours. We figured that should be more than plenty.

I also ordered a charger, and a Yamaha piano wall wart. I cut the plug off of the wall wart and connected that, plus a fuse, to the battery:

(with lots of protective electric tape – don’t want to get shocked.)

We put the battery in the cardboard box it came in, spray painted it black, and 3d-printed a nice little cover:

We additionally 3d-printed a little wedge so that the battery could sit on the X-frame stand that supports the piano. Unfortunately, I forgot to get a photo of that.

And then there was the problem that the piano was initially dead after we borrowed it. After much troubleshooting, my partner’s internet investigations suggested that we replace one particular capacitor that was known to wear out on this brand of electric keyboard. So (with permission) we ordered it, took the piano apart, and replaced the capacitor.

Ring Box

We used the 3d printer for other things as well! My partner printed a wonderful ring box whose design she found:

Photo by @wendyandgeorgephoto

Our names are inlaid on the other side.

Homemade Desserts

You know what’s harder than making desserts for your own wedding? Transporting them to the wedding.

Enter – the Macaron Transport Vehicle (MTV). My partner is fantastic at baking macarons, and has spent something like 10 years perfecting the recipes. She baked the macarons, and also came up with the method for getting them to the venue.

Each MTV consisted of an Ikea cooler bag with a cooler-bag-bottom-sized ice pack. This could be loaded with trays of macarons, with 3d-printed spacers to allow for tray stacking. The macarons made it safe and sound, and cold!

Photo by @wendyandgeorgephoto

Making the macarons was a 3-day process. The fillings were made on Day 1, the shells on Day 2, and they were assembled on Day 3.

Wedding Favors

We wanted to have something for our guests to take home with them. We thought about giving out succulents, but figured those would be hard to travel with, and maybe guests wouldn’t want them. We figured that pretty much everyone likes chocolate.

There are services that let you make your own custom chocolate wrappers. We tried a sample or two and weren’t really blown away by the chocolate itself. So we decided to just buy chocolate bars that we liked and make our own basic sleeves:

These were made in Keynote, and were printed on heavier paper.

Funny story about the text on the back. I initially wrote something much longer, and my partner had the great idea to use ChatGPT to consolidate it a bit. It worked really well, with some minor manual adjustments.

We then tried asking it to add some chocolate puns. ChatGPT 3.5 was decidedly terrible at adding puns. It basically just inserted “choco” into the text in random places. Ah, well.

Website

Lastly, rather than use a wedding website service, I figured I had my own website already, and could just add to that. It was just a simple page based on the same HTML5 Up template that I use for my main website today. Per a friend’s suggestion, I used Leaflet and Open Street Maps to display the location of the venue and the hotel. This was also the page that guests could access the RSVP Google Form from.

Conclusion

We had a wonderful wedding, with wonderful guests and a lot of wholesome, positive energy. Things went incredibly smoothly, and the worst thing about it was how quickly it all flew by.

I think the little touches that we added made the wedding feel more personal.

Billboard Sprites

I’m still on a Wolfenstein 3D kick, so this month we’re going to further extend the software render. We’re doing it in C and are manually setting individual pixels (no 3D lib). As before, I’m deriving this myself, so while we should end up with something similar to Wolfenstein 3D, things are probably not going to be a perfect match.

The last post left us with a textured blocky world. I felt like the next thing to do was to be able to populate that world.

Wolfenstein 3D (and DOOM) were before the time of 3d meshes. Enemies and set pieces (like barrels) were rendered as flat images. These billboard sprites always face the camera, since that simplifies the rendering math:

In Wolfenstein 3d, all wall textures are 64×64 images. They kept things simple and made all sprites 64×64 images as well. Unlike wall textures which were always solid, sprites have a bunch of fully transparent pixels. This was indicated using magenta.

Behold my fantastic programmer art:

We can load this image in like we do with our wall textures. Now we just have to figure out how to render it.

Render Math

Each entity has a position in 3D space. First we transform the object’s location in game space to a camera-relative space:

We can do this with:

\[ \begin{bmatrix} e’_x \\ e’_y \end{bmatrix} = \begin{bmatrix} \hphantom{-}\cos \theta & \sin \theta \\ -\sin \theta & \cos \theta \end{bmatrix} \begin{bmatrix} e_x – c_x \\ e_y – c_y \end{bmatrix} \]

We can skip rendering the entity at this point if it is behind the camera (\(e’_x < 0\)).

Next we calculate its extents in the screen’s horizontal dimension:

The billboard’s width is the same as the tile width. We can do some math to get the camera x-pixel locations:

\[\begin{aligned} x_\text{lo} = \left(\frac{1}{2} – \frac{e’_y + w_\text{sprite}/2}{e’_x \, \text{fov}_x}\right) \text{screenpix}_x \\ x_\text{hi} = \left(\frac{1}{2} – \frac{e’_y – w_\text{sprite}/2}{e’_x \, \text{fov}_x}\right) \text{screenpix}_x \end{aligned}\]

We’re just moving half the sprite width in either direction, then dividing by the screen’s width at the entity’s location. We have to flip about 0.5 because pixels go left to right in screen space, and then scale up by the number of pixels.

The image’s height is determined by the distance the entity is from the camera. The calculation is the same one we used for determining the height of walls. This gives us \(y_\text{lo}\) and \(y_\text{hi}\).

Now we can iterate from \(x_\text{lo}\) to \(x_\text{hi}\) in 1-pixel increments, rendering each column of pixels as we go between \(y_\text{lo}\) and \(y_\text{hi}\). For every \(x\)-increment of 1 pixel we move \(64 / (x_\text{hi} – x_\text{lo})\) pixels horizontally in the sprite image. Similarly, for every \(y\)-increment of 1 pixel we move \(64 / (y_\text{hi} – y_\text{lo})\) pixels vertically in the sprite image.

This gives us a billboard sprite in the correct location that scales as we move closer or farther away. Unfortunately, we’re rendering the magenta pixels and we’re drawing over walls that should be blocking our view.

We can trivially skip rendering magenta pixels with an if statement.

Fixing the wall occlusions also turns out to be pretty simple. We are rendering the entity after we render the walls. When we render the walls we have to calculate the raycast distance. We keep track of this distance for each pixel column, and only render a given entity pixel column when it is closer to the camera.

So there we have it – billboard sprites with occlusions. We first render the floors and ceilings, we then render the walls, and we then finally render all entity sprites.

DOOM Sprites

Unfortunately, my programmer art is pretty terrible. It works okay for the walls, ceilings, and floors, but it doesn’t work so well for entities.

I considered using Wolfenstein 3D sprites, but I don’t want to go with the Nazi theme. I decided to go with DOOM sprites since the assets are easily available (and have been used by modders for decades).

We have to do some work to load from the DOOM assets (stored in a WAD file). I was surprised to find that the WAD format is quite similar to the asset format I made up that I was already using. They both consist of a header, followed by a series of blobs, terminated with a table of contents. We can just load the whole thing into memory and then point to the data we need.

Wolfenstein 3D stored everything in 64×64 images. DOOM was a bigger project, with things of different sizes, and so went for a more compressed format (at the expense of added complexity). Sprites are stored as patches, which slice the image up by pixel column, and only specify continuous pixel column regions that should be drawn (ignoring transparent pixels).

We don’t have to waste any space on transparent columns:

Many columns simply consist of a single continuous range of pixels:

Some columns have several pixel intervals:

Each patch has a header defining its width and height, and its position relative to the entity’s origin. (If a monster has an arm stretched out to the left, we want to know to offset the image to the left a bit). This is followed by byte offsets to pixel data. One of the cool things here is that if two pixel columns happen to be the same, we don’t have to store them twice, but can just point to the same data.

Pixel columns are each a sequence of pixel intervals with a given offset from vertical, some number of pixels, and then the pixel values. We know we’ve reached the end when the vertical offset is 0xFF.

DOOM is restricted to a 256-color palette. The pixel data in each patch stores 1-byte indices into this palette. When rendering, we take the index and look up the appropriate color. (DOOM also has a colormap that it uses to darken things further from the camera, but I’m not doing that quite yet).

While DOOM does use billboard sprites, they also rendered each frame from 8 different viewing angles. We can determine which sprite to use for a given frame based on our viewing angle:

f32 monster_heading_rel = 
                 fmod(monster_heading+camera_heading+PI+PI/8.0,2*PI);
int monster_frame = (int)(monster_heading_rel * 8.0/(2.0*PI)) & 0x07;

This gives us some nice visual continuity as we walk around the monster:

Conclusion

The code for this post is available here. The zip file contains my assets, but you’ll have to separately find the DOOM assets, since I didn’t think it’s okay to host them myself. Thank you DOOM team for making those available!

I ended up splitting my source code into multiple files, since the code was steadily growing and becoming less manageable. You’ll notice that I use fewer static functions and I pass in the objects that I use instead.

The code now generates a second window, in which I render a top-down debug view. I allow myself to use SDL_RenderDrawLine and SDL_RenderFillRect here. (They’d be pretty easy to implement myself, but whatever).

I’m not sure how far I’ll take this project. It certainly has been fun so far! Fully fleshed games are pretty involved, and as we’ve seen I am pretty limited when it comes to asset creation. I do have a neat idea that I want to try out with respect to more general 3D rendering and collision detection, and I kind of want to have some rudimentary monster fighting interactions. We’ll see.

Textures for Wolfenstein

In the previous post we derived raycasting for a game in the style of Wolfenstein 3D. That is, a blocky game world with a regular grid and with the player’s view restricted to looking out horizontally. These restrictions let us simplify rendering significantly. At the end of the last post we were drawing colored blocks by casting a ray out into the world for every pixel column, and drawing the appropriate color after some geometry based on the distance the ray traveled before hitting the wall:

This gave us frames that consisted of solid floor and ceiling colors, and solid wall colors for boxes:

In this post we’ll add some textures. As before, this is all in C.

Wall Textures

We’ve already done most of the work required to render walls. Now, rather than drawing a column of pixels in a solid wall color, we’re instead going to render a column of pixels based on a wall texture.

After we’ve done the raycasting, we have our distance to the wall that we intersect and we have calculated the y-interval on the screen for the column of pixels we need to draw the wall in. In order to texture that wall, we need to find the corresponding column of pixels in the texture for that wall and render it into that column:

In Wolfenstein 3D, all textures are 64 x 64 pixels. This uniformity simplifies the problem.

We have the collision location after intersecting with our world geometry. This collision location is decomposed into a block index (x and y) and the remainder \((x_\text{ref}, y_\text{ref})\) – the position within that tile:

In this case, our tiles have width 1.0, to keep things simple. The world coordinate (3.152,1.000) is decomposed into tile indices (3,0) to give us the blue wall tile that we hit, and remainders (0.152, 1.000) to tell us where within the tile we hit. Because we always intersect at a point on the grid lines, one of our remainders is always either zero or one.

In this case, we use the \(x\) remainder to map into our texture. We’re viewing the texture from the north, so \(x_\text{rem} = 0.0\) corresponds to \(x_\text{texture} = 63\), and \(x_\text{rem} = 1.0\) corresponds to \(x_\text{texture} = 0.0\). The mapping in between is linear:

\[x_\text{texture} = 64 \frac{w_\text{tile} – x_\text{rem}}{w_\text{tile}}\]

The remainder we use, and whether we invert it, depends on the north/south/east/west direction the wall is facing. We can get this information from our decomposed coordinate:

f32 rem = 0.0f;
if (dx_ind == 0) {
   rem = dy_ind < 0 ? TILE_WIDTH - x_rem : x_rem;
} else {
   rem = dx_ind < 0 ? y_rem : TILE_WIDTH - y_rem;
}
u32 texture_x = min((int)(TEXTURE_SIZE*rem/TILE_WIDTH),TEXTURE_SIZE-1);

We have to cast \(x_\text{texture}\) to an integer to get our pixel index, and ‘and’ it by 64-1 to prevent accidental bounds issues.

Now that we have located our texture pixel column, we need to scan vertically through the screen’s pixel column and identify the corresponding pixel \(y\) positions in the texture as we go:

The pixel column on the screen is bounded between \(y_\text{hi}\) and \(y_\text{lo}\). The texture is a matrix, and so the \(y\) coordinate increases down the texture. Thus, our mapping is:

\[\begin{aligned} y = y_\text{hi} & \mapsto y_\text{texture} = 0 \\ y = y_\text{lo} & \mapsto y_\text{texture} = 63 \end{aligned}\]

with a linear mapping in between:

\[y_\text{texture} = 64 \frac{y_\text{hi} – y}{y_\text{hi} – y_\text{lo}} \]

(Note that we use 64 rather than 63 since we’ll be flooring this when we code it).

Our column-rendering code is thus:

u32 denom = max(1, y_hi - y_lo);
int y_lo_capped = max(y_lo, 0);
int y_hi_capped = min(y_hi, SCREEN_SIZE_Y-1);
for (int y = y_hi_capped; y >= y_lo_capped; y--) {
   u32 texture_y = (y_hi - y) * TEXTURE_SIZE / denom;
   u32 pixel_index = texture_y + texture_x*bitmap->n_pixels_per_column;
   u32 color = BITMAP_WALL.abgr[pixel_index];
   state.pixels[(y * SCREEN_SIZE_X) + x] = color;
}

We can make a slight optimization by factoring out the column lookup, and just increasing our pixel baseline:

u32 baseline = texture_y + texture_x*bitmap->n_pixels_per_column;
u32 denom = max(1, y_hi - y_lo);
int y_lo_capped = max(y_lo, 0);
int y_hi_capped = min(y_hi, SCREEN_SIZE_Y-1);
for (int y = y_hi_capped; y >= y_lo_capped; y--) {
   u32 texture_y = (y_hi - y) * TEXTURE_SIZE / denom;
   u32 color = BITMAP_WALL.abgr[texture_y+baseline];
   state.pixels[(y * SCREEN_SIZE_X) + x] = color;
}

We can calculate how many \(y\)-pixels to step every frame, and just use that instead:

u32 baseline = texture_y + texture_x*bitmap->n_pixels_per_column;
u32 denom = max(1, y_hi - y_lo);
f32 y_loc = (f32)((y_hi - y_hi_capped) * TEXTURE_SIZE) / denom;
f32 y_step = (f32)(TEXTURE_SIZE) / denom;
for (int y = y_hi_capped; y >= y_lo_capped; y--) {
   u32 texture_y = min((u32) (y_loc), TEXTURE_SIZE-1);
   u32 color = BITMAP_WALL.abgr[texture_y+baseline];
   state.pixels[(y * SCREEN_SIZE_X) + x] = color;
   y_loc += y_step;
}

This give us some nice texturing:

We can introduce a shading effect (also done in Wolfenstein 3D), by having different textures for north/south faces than east/west faces. I just have darkened copies of every texture. This makes the walls pop more, and subtly adds to the feeling of 3d immersion:

Speaking of multiple textures, let’s go ahead and do that:

How do we achieve this? Well, first, rather than having a single 64 x 64 pixel texture, here I am using a sheet of textures:

Right now this sheet has 2 columns, one for the lighter faces and one for the darker faces. It also currently has 3 wall types. To use a particular texture, we thus need to know its y index and whether it is light or dark. The wall type is based on the block type, which we store in our map data. We then apply a multiple of TEXTURE_WIDTH in the x and y directions when accessing this larger image:

u32 texture_x_offset = dx_ind == 0 ? 0 : TEXTURE_SIZE;
u32 texture_y_offset = (GetMapDataAt(x_ind,y_ind) - 1) * TEXTURE_SIZE;

Aside – Precompiled Texture Scans

I bought Fabien Sanglard’s Little Black Book on Wolfenstein 3D since writing the last post. It is a great overview of the original game, both technically and holistically. Now, the code written there looks and feels very different than the code we’re writing here, mainly because of the large differences in the hardware. Large parts of the book are about explaining how Wolfenstein 3D handled interrupts, was able to pipe pixel data through the VGA interface, and how fixed point arithmetic is used throughout the game’s logic.

I had a vague idea going into this project that the primary achievement behind Wolfenstein 3D was its raycasting engine. Yes, the raycasting engine was a breakthrough, but it is quite a small aspect in between myriad other engineering tricks that had to be pulled to get the game to run at a playable framerate back in the day.

One of the book’s tricks had to do with the texture column drawing code we just covered. Namely, this loop here:

for (int y = y_hi_capped; y >= y_lo_capped; y--) {
   u32 texture_y = min((u32) (y_loc), TEXTURE_SIZE-1);
   u32 color = BITMAP_WALL.abgr[texture_y+baseline]
   state.pixels[(y * SCREEN_SIZE_X) + x] = color;
   y_loc += y_step;
}

John Carmack (I’m assuming) had figured out a more efficient way to draw these pixel columns. In short, the machine code for these loops were precompiled for specific cases, and the renderer would just call the appropriate case. This would avoid all of the bookkeeping and for-loop if statements.

I learned that Wolfenstein 3D had hard-coded the player height to 1/2 the wall height, so y_lo_capped = y_hi_capped and y_lo = y_hi, always. This somewhat simplifies things. They could then precompile a loop based on the number of pixels in the column (which is 2*y_lo_capped). For example, if the column has height 2 (for a very distant wall), the precompiled function would be:

void draw_wall_2pix(
  uint32* screen_pixels,
  uint32* texture_pixels,
  uint32 screen_baseline,
  uint32 texture_baseline,
  ) {
  screen_pixels[screen_baseline++] = texture_pixels[texture_baseline];
  screen_pixels[screen_baseline++] = texture_pixels[texture_baseline+63];
}

Or, for 4 pixels:

void draw_wall_4pix(
  uint32* screen_pixels,
  uint32* texture_pixels,
  uint32 screen_baseline,
  uint32 texture_baseline,
  ) {
  screen_pixels[screen_baseline++] = texture_pixels[texture_baseline];
  screen_pixels[screen_baseline++] = texture_pixels[texture_baseline+16];
  screen_pixels[screen_baseline++] = texture_pixels[texture_baseline+32];
  screen_pixels[screen_baseline++] = texture_pixels[texture_baseline+63];
}

I thought that was a pretty neat trick. Sure, these precompiled functions take up space. (If I recall correctly, they generated one of these for every pixel column that was a multiple of 2, and used the closest). Drawing pixels is what the game is spending most of its time doing, so it makes sense to optimize. At least, it did back then. Right now I have no trouble hitting 60 Hz, so I don’t (yet) have to do this. (I’m also don’t have a full game – just some basic texture rendering).

Floor and Ceiling Textures

Wolfenstein 3D did not have floor and ceiling textures. Instead, it just filled them in with solid colors. In fact, it actually just filled in the floor and ceiling first and then drew the walls over that afterwards.

Ceiling and wall textures are similar to walls, we just have to work out the math again:

Here we have a ray cast through a pixel above the center of the screen such that it eventually hits the ceiling. The horizontal distance that the ray travels is \(r\), but the actual ray length is \(\ell\). From triangle similarity we get the same relation we used when working on walls in the previous post:

\[\frac{z”}{r”} = \frac{H-z}{r} \qquad \mapsto \qquad r = \frac{H-z}{z”} r”\]

We can calculate \(z”\) for a given pixel by calculating it from the pixel’s \(y\) index:

\[z” = \frac{y\, – \frac{1}{2}n_y}{n_y} h_\text{camera} \]

where \(n_y\) is the number of pixels in our screen’s y dimension, and \(h_\text{camera}\) is the camera’s physical height in world dimensions (a parameter we choose and can vary).

We’ve been using \(r” = 1\) for horizontal raycasting. Unfortunately, that’s only right when we cast a ray directly in the direction the player is facing. We have to adjust this by a small factor based on how far our pixel’s \(x\)-location is off from the screen’s center:

\[r” = \sqrt{1 + x_\delta^2}, \qquad x_\delta = \frac{x – \frac{1}{2}n_x}{n_x} w_\text{camera}\]

where \(n_x\) is the number of pixels in our screen’s x dimension and \(w_\text{camera}\) is the camera’s physical width in world dimensions (It should be related to \(n_y\) by our screen’s aspect ratio).

Knowing these, we can calculate the horizontal radius using the first equation:

int x, y; // pixel x and y coordinates, y > SCREEN_SIZE_Y/2
f32 x_delta = (x-SCREEN_SIZE_X/2.0f)/SCREEN_SIZE_X*state.camera_width; 
f32 rpp = sqrt(1.0f + x_delta*x_delta);
f32 zpp = (y-(SCREEN_SIZE_Y/2.0f))*(state.camera_height/SCREEN_SIZE_Y);
f32 r = (WALL_HEIGHT - state.camera_z)*rpp/zpp;

The horizontal location where we intersect with the ceiling is simply \(r\) in the direction through the screen’s pixel relative to the camera’s direction. If the direction the camera is facing is \(\boldsymbol{\hat{b}}\), and \(\texttt{rotr}(\boldsymbol{\hat{b}})\) is the right-hand 90-degree rotated camera direction, then the horizontal raycast direction \(\boldsymbol{\hat{d}}\) is:

\[\begin{aligned} \boldsymbol{\hat{d}} &= \texttt{normalize}( \boldsymbol{\hat{b}} + x_\delta \texttt{rotr}(\boldsymbol{\hat{b}}) ) \\ &= \frac{1}{\sqrt{1 + x_\delta^2}} \left(\boldsymbol{\hat{b}} + x_\delta \texttt{rotr}(\boldsymbol{\hat{b}}\right) \\ &= \frac{1}{r”} \left(\boldsymbol{\hat{b}} + x_\delta \texttt{rotr}(\boldsymbol{\hat{b}}\right) \end{aligned}\]

This makes the horizontal 2D intersection point:

\[\begin{aligned} \boldsymbol{p}_\text{hit} &= \boldsymbol{p}_\text{camera} + r \boldsymbol{\hat{d}} \\ &= \boldsymbol{p}_\text{camera} + r \frac{1}{r”} \left(\boldsymbol{\hat{b}} + x_\delta \texttt{rotr}(\boldsymbol{\hat{b}})\right) \\ & = \boldsymbol{p}_\text{camera} + \left((H – z_\text{camera}) \frac{r”}{z”}\right) \frac{1}{r”} \left(\boldsymbol{\hat{b}} + x_\delta \texttt{rotr}(\boldsymbol{\hat{b}})\right) \\ &= \boldsymbol{p}_\text{camera} + \frac{H – z_\text{camera}}{z”} \left(\boldsymbol{\hat{b}} + x_\delta \texttt{rotr}(\boldsymbol{\hat{b}})\right) \end{aligned}\]

It turns out, we don’t need to calculate \(r”\) at all!

Once we have the location of the intersection, we can use fmod based on the tile width to get the position in that texture relative to its local coordinates. We then use that texture location to set the pixel color:

u32 texture_x = (int)(fmod(hit_x,TILE_WIDTH)/TILE_WIDTH*TEXTURE_SIZE)
                      & (TEXTURE_SIZE-1);
u32 texture_y = (int)(fmod(hit_y,TILE_WIDTH)/TILE_WIDTH*TEXTURE_SIZE)
                      & (TEXTURE_SIZE-1);
u32 color = GetColumnMajorPixelAt(&BITMAP_FLOOR,
               texture_x+texture_x_offset, texture_y+texture_y_offset);
state.pixels[(y * SCREEN_SIZE_X) + x] = color;

While this is technically enough to get ceiling pixels on the screen, the current method requires a raycast for every pixel. We can greatly simplify this.

We can use the fact that all of the intersection points for a given row of pixels lie on a line:

This means that, for every pixel row, we only need to compute the intersection points for the leftmost and rightmost pixel, and then we can interpolate between them.

// Ray direction for x = 0
f32 half_camera_width = state.camera_width/2.0f;
f32 ray_dir_lo_x = state.camera_dir.x +
        half_camera_width*state.camera_dir_rotr.x;
f32 ray_dir_lo_y = state.camera_dir.y +
        half_camera_width*state.camera_dir_rotr.y;

// Ray direction for x = SCREEN_SIZE_X
f32 ray_dir_hi_x = state.camera_dir.x -
        half_camera_width*state.camera_dir_rotr.x;
f32 ray_dir_hi_y = state.camera_dir.y -
        half_camera_width*state.camera_dir_rotr.y;

// Draw ceiling
for (int y = SCREEN_SIZE_Y/2 + 1; y < SCREEN_SIZE_Y; y++) {
   // Radius
   f32 zpp = (y-(SCREEN_SIZE_Y/2.0f))*(state.camera_height/SCREEN_SIZE_Y);
   f32 radius = (WALL_HEIGHT-state.camera_z)/zpp;

   // Location of the 1st ray's intersection
   f32 hit_x = state.camera_pos.x + radius * ray_dir_lo_x;
   f32 hit_y = state.camera_pos.y + radius * ray_dir_lo_y;

   // Each step toward hit2
   f32 step_x = radius*(ray_dir_hi_x-ray_dir_lo_x)/SCREEN_SIZE_X;
   f32 step_y = radius*(ray_dir_hi_y-ray_dir_lo_y)/SCREEN_SIZE_X;

   for (int x = 0; x < SCREEN_SIZE_X; x++) {
      int x_ind_hit = (int)(floorf(hit_x/TILE_WIDTH));
      int y_ind_hit = (int)(floorf(hit_y/TILE_WIDTH));
      f32 x_rem_hit = hit_x - TILE_WIDTH*x_ind_hit;
      f32 y_rem_hit = hit_y - TILE_WIDTH*y_ind_hit;

      u32 texture_x = (int)(x_rem_hit/TILE_WIDTH * TEXTURE_SIZE);
      u32 texture_y = (int)(y_rem_hit/TILE_WIDTH * TEXTURE_SIZE);
      u32 color = GetColumnMajorPixelAt(&BITMAP,texture_x,texture_y);
      state.pixels[(y * SCREEN_SIZE_X) + x] = color;
      // step
      hit_x += step_x;
      hit_y += step_y;
      }
} 

We can render the floor in the same way.

Combining the two, and then rendering our walls on top of it, gives us:

Right now we’re just rendering a single repeated texture for both the floor and the ceiling. We can, if we want, specify floor and ceiling textures on a per-cell basis in much the same way that we specify wall types. This just involves figuring out the tile index of the intersection (handling cases where we see pixels outside of the map’s bounds) and then rendering from the corresponding texture.

Conclusion

This post took our colored blocky Wolfenstein level and textured it. We started with wall textures, which we can draw at the appropriate height given our raycast distance. We have to figure out which column to render from our texture, and then scale it appropriately as we render it. For this reason, the wall textures are stored in column-major order.

We then added floor and ceiling textures, which we render first to avoid having to figure out occlusion with the walls. That makes raycasting trivial. However, unlike walls, the floors and ceiling are at a weird angle with respect to our camera heading. The big simplification here comes from being able to linearly interpolate across the screen.

Once again we were able to leverage what basically amounts to trigonometry and some indexing in order to get some neat interactive results.

This post did not cover texture creation and loading. In short, I created them in Gimp, saved them to .tif, and wrote a simple Julia script to export them into an assets binary. The code can thus load a single assets file and just point into that. I might talk about this in a future post.

The code for this demo is available here (with assets here).

Wolfenstein 3D Raycasting in C

I recently happened upon a video on YouTube in which JDH implements a Doom clone. He does this in C, without any graphics libraries. Just SDL and a pixel buffer:

u32 pixels[SCREEN_SIZE_X * SCREEN_SIZE_Y]; 

The video blew me away. I have long admired how John Carmack et. al. were the first to create playable fast-paced 3D games. It was a technical achievement and a true breakthrough. There is also something clean and pure about restricting oneself to pure C. It is closer to the hardware. You’re setting each pixel byte by byte. You’re totally in control.

I had another project in the works, but this video just kept turning over in my head. I decided to give the first part, rendering for Wolfenstein 3D in pure C, a stab myself.

This post is my attempt at doing Wolfenstein-style rendering, based on gameplay I saw at the beginning of JDH’s video. It is probably not exactly what happens in Wolfenstein, but I’m hoping its somewhat close. I plan to check back with a future post to see where / how things diverge.

Setup

The biggest hurdle to getting started was setting myself up to program in C. I have no pure-C projects. Heck, even pure C++ projects can be a pain to set up.

I typically program in Sublime Text and compile in the terminal. However, for this I wanted to use Visual Studio Code. It seems to have the best debugger integration, and I feel like all game-programming folks that I follow use it. I have not used VS code on personal projects, so I figured this would be a great opportunity to change that.

As per usual, getting this all to work requiring some Googling and messing around for a while. I figure it’s worth talking about the process, at the very least so that I can refer to it myself in the future when I set up new projects. I’ll cover that at the end of this post, but for now let’s jump straight to C programming.

SDL Game Loop

We have to get a window we can render to and establish a basic game loop. This has the form:

int main(int argc, char *argv[]) {
   ... init things like the window ...
   
   // main loop
   while (state.quit == 0) {
      ProcessEvents();
      TickGame(dt);
      RenderToPixelBuffer();
      PresentPixelBuffer();
   }
}

One iteration through the main loop corresponds to one frame of our game. If the game runs at 30Hz, then we looping through this 30 times per second.

  • ProcessEvents – Games and other programs need to react to player inputs. Rather than have callbacks that interrupt your game in the middle of what its doing, SDL collects everything that happens in an event queue. We start every frame by emptying the queue and processing all of the events to determine how they impact our game.
    For example, if the player hits the ‘w’ key, we might want to store that information to later use it to move the player forward.
  • TickGame – This method updates the game state. In a larger game this would do a whole lot more work. Here we’re basically just updating the player’s position, orientation, and velocity based on keyboard inputs.
  • RenderToPixelBuffer – This is the core part of what makes this project exciting. Here we render the world state by setting color values in our pixel buffer. We have to render based on the camera’s current location and orientation, and use that information to get the pixels right.
  • PresentPixelBuffer – Here we ask SDL to take our pixels and show them on the screen. In my case, this presentation code has vsync enabled, so it is sychronized with the monitor refresh rate. This automatically waits the appropriate amount of time to update at that rate. In my case, that means the program ends up executing at 30Hz, despite being able to technically run much faster.

As far as SDL goes, we have to initialize a window, a texture, and a renderer. The window controls the actual window panel that shows up on our screen. The texture associated with the window contains the pixel data that is presented in the window. The renderer is needed for transferring our pixel buffer to the texture.

We copy our pixel buffer into this texture at the end of every frame. JDH created his texture with the pixel format SDL_PIXELFORMAT_ABGR8888, which means that every byte is an unsigned 32-bit number with 1 byte each for alpha, blue, green, and red. Thus, 0xFF00FF00 is opaque green.

We additionally maintain a pixel buffer:

u32 pixels[SCREEN_SIZE_X * SCREEN_SIZE_Y];

This pixel buffer is what we will manipulate in order to depict the 3D world “seen” by our player camera. Each pixel is also a 32-bit ABGR value.

For example, the following produces:

int x = 52;
int y = 75;
state.pixels[(y * SCREEN_SIZE_X) + x] = 0xFFFF00FF;
(we’ve got one magenta pixel here close to the bottom left)

We’ve got a lot of pixel work ahead of us to go from this to 3D graphics. But therein lies the challenge! No rendering libraries, just algorithms and math 😀

Problem

A Wolfenstein level is defined on a 2D grid. Grid tiles can be solid or empty, with walls taking up exactly one grid tile. Our camera only ever rotates horizontally, never vertically. These facts make rendering 3D scenes so much easier than general 3D rendering.

The grid dimensions are defined by two numbers – the tile width and the wall height:

In this case I’m using 1.0 and 1.2 units of length, respectively.

Our player exists at some 2D location within this world, and is facing in some direction. Our camera is coincident with the player and faces the same direction:

The camera has a horizontal and vertical field of view. In the top-down 2D perspective we can see its horizontal field of view:

In a side 2D perspective we can see its vertical field of view:

We can imagine our game screen being a plane lying in this field of view, fairly close to the player. Our objective is to set each pixel based on what the camera “sees” at that pixel’s location. The color is determined by the object we intersect with if we cast a ray from the camera’s origin, through the pixel, and then intersect with the world:

In the case of a 2D cross-section, a bottom stretch of pixels are colored the floor’s color, a top stretch of pixels are colored the ceiling’s color, and everything in between is colored the block’s color:

Wolfenstein’s blocky world and consistent floor and ceiling height mean we can get all the information we need to render a column of pixels by doing a single horizontal ray cast:

We can use the fact that the larger triangles formed with the full raycast distance are similar to the triangles that from the camera plane a distance \(d=1\) from the player. If the raycast distance is \(\ell\), the player’s height is \(z\), and the wall height is \(H\), then:

\[z’ = \frac{d z}{\ell}, \qquad z” = \frac{d(H-z)}{\ell}\]

If we have \(n\) pixels per column (in my case, 360), then we need to fill pixels according to:

\[\begin{cases} \text{floor color} & \text{if } y < n/2 – \frac{d z}{\ell} n \\ \text{wall color} & \text{if } y < n/2 + \frac{d(H-z)}{\ell} n \\ \text{ceiling color} & \text{otherwise} \end{cases}\]

// Calculate the pixel bounds that we fill the wall in for
int y_lo = (int)(SCREEN_SIZE_Y/2.0f - cam_len*state.camera_z/ray_len * SCREEN_SIZE_Y / state.camera_height);
int y_hi = (int)(SCREEN_SIZE_Y/2.0f + cam_len*(WALL_HEIGHT - state.camera_z)/ray_len * SCREEN_SIZE_Y / state.camera_height);
y_lo = max(y_lo, 0);
y_hi = min(y_hi, SCREEN_SIZE_Y-1);

fill_column(x, 0, y_lo-1, color_floor);
fill_column(x, y_lo, y_hi, color_wall);
fill_column(x, y_hi + 1, SCREEN_SIZE_Y-1, color_ceil);

Next we have to figure out how to do the ray cast to get \(\ell\). We need to perform a ray cast for every column of pixels on our screen. Each such ray will originate at the player’s location, and then head in the direction of its pixel column until it strikes a wall. The distance traveled is \(\ell\). For example:

We can leverage the fact that we have a grid. We know that the intersection point, when we eventually find it, will lie on one of the grid’s horizontal or vertical lines.

Let’s start by decomposing our initial coordinate, the camera position \((p_x, p_y)\), into its tile indices and offsets from the tile’s bottom-left corner:

int x_ind_cam = (int)(floorf(state.camera_pos.x / TILE_WIDTH));
int y_ind_cam = (int)(floorf(state.camera_pos.y / TILE_WIDTH));
f32 x_rem_cam = state.camera_pos.x - TILE_WIDTH*x_ind_cam;
f32 y_rem_cam = state.camera_pos.y - TILE_WIDTH*y_ind_cam;

The raycast direction can be calculated from the pixel column position. We assume the camera screen is a unit distance \(d = 1\) from the camera, \(\boldsymbol{c}\). The raycast direction is a function of the camera’s width \(w\), the number of screen pixels \(m\), and the direction the camera is facing, \(\hat{\boldsymbol{b}}\):

We can locate the pixel’s location \(\boldsymbol{p}\) as:

\[\boldsymbol{p} = \boldsymbol{c} + d \hat{\boldsymbol{b}} + \left(\frac{w}{2} – w \frac{x}{m}\right) \texttt{rotr}(\hat{\boldsymbol{b}}) = \boldsymbol{c} + d \hat{\boldsymbol{b}} + \left(\frac{w}{2} – w’\right) \texttt{rotr}(\hat{\boldsymbol{b}})\]

Here \(\texttt{rotr}(\hat{\boldsymbol{b}})\) is the camera direction rotated by 90 degrees in the right-hand direction (counter clockwise).

The raycast direction is aligned with \(\bar{\boldsymbol{pc}}\):

\[\bar{\boldsymbol{pc}} = \boldsymbol{p} – \boldsymbol{c} = \boldsymbol{c} + d \hat{\boldsymbol{b}} + \left(\frac{w}{2} – w \frac{x}{m}\right) \texttt{rotr}(\hat{\boldsymbol{b}})\]

We can normalize \(\bar{\boldsymbol{pc}}\) by dividing by its length to produce our raycast direction \(\hat{\boldsymbol{r}}\).

Raycasting

Now that we have our ray origin and unit direction, we can cast it out into the map. If we imagine that the ray is travelling at unit speed, then its position vs. time within its current cell is:

\[\begin{aligned} x(t) & = x_\text{rem} + \hat{r}_x dt \\ y(t) & = y_\text{rem} + \hat{r}_y dt \end{aligned} \]

The direction the ray is facing determines when it will cross certain tile boundaries:

  • We cross \(x_\text{rem} = 0\) if \(\hat{r}_x <0\), at \(dt = -x_\text{rem} / \hat{r}_x\)
  • We cross \(x_\text{rem} = w_\text{TILE}\) if \(\hat{r}_x > 0\), at \(dt = (w_\text{TILE} – x_\text{rem}) / \hat{r}_x\)
  • We cross \(y_\text{rem} = 0\) if \(\hat{r}_y <0\), at \(dt = -y_\text{rem} / \hat{r}_y\)
  • We cross \(y_\text{rem} = w_\text{TILE}\) if \(\hat{r}_y > 0\), at \(dt = (w_\text{TILE} – y_\text{rem}) / \hat{r}_y\)

We can generalize this to avoid having a bunch of if statements in our code:

  • if \(\hat{r}_x < 0\), then \(dt = -1/\hat{r}_x \cdot x_\text{rem} + 0\) and \(x_\text{ind}\) will decrease by 1
  • if \(\hat{r}_x > 0\), then \(dt = -1/\hat{r}_x \cdot x_\text{rem} + w_\text{TILE}/\hat{r}_x\) and \(x_\text{ind}\) will increase by 1
  • if \(\hat{r}_x = 0\), then \(dt = 0 \cdot x_\text{rem} + \infty\) and \(x_\text{ind}\) will not change

We can use the same statements for the y direction.

This simplifies walking across the grid to calculating \(dt\) for x and y, and selecting whichever one is smaller (the earlier crossing). We then update our decomposed position (tile index and remaining offset) appropriately. This process continues until we enter a solid tile.

while (true) {
   f32 dt_best = INFINITY;
   dx_ind = 0;
   dy_ind = 0;
            
   f32 dt_x = dx_a*x_rem + dx_b;
   f32 dt_y = dy_a*y_rem + dy_b;
   if (dt_x < dt_y) {
       dt_best = dt_x;
       dx_ind = dx_ind_dir;
       dy_ind = 0;
   } else {
       dt_best = dt_y;
       dx_ind = 0;
       dy_ind = dy_ind_dir;
   }

   // Move up to the next cell
   x_ind += dx_ind;
   y_ind += dy_ind;
   x_rem += dir.x * dt_best - TILE_WIDTH*dx_ind;
   y_rem += dir.y * dt_best - TILE_WIDTH*dy_ind;

   // Check to see if the new cell is solid
   if (MAPDATA[y_ind*8 + x_ind] > 0) {
      break;
   }
}

Once we’ve collided, we can back out the raycast length. We know from the tile and from which direction we entered it what color we should render it, if we have different colors for different blocks. We can even employ a basic lighting trick where we color x-sides differently than y-sides:

The right image employs a basic lighting trick where y-faces are lighter.

And that’s it! We’ve got basic 3d graphics on the screen.

Profiling

I was curious how efficient this code is, and tried to measure it. I initially used SDL_GetTicks to measure the framerate in milliseconds, only to discover that SDL_RenderPresent is automatically vsyncing with the display to 30 Hz. So I wrapped everything before the SDL calls between sys/time.h gettimeofday calls. I am currently getting about 4.5ms per frame, which is 222 frames per second.

I’m certainly happy with that. I’m sure there are plenty of additional tricks that could be employed, and if things grow in complexity that headroom could very quickly be eaten away. I am planning on looking into how Wolfenstein rendering is _actually_ done.

Conclusion

Rendering in general 3D spaces with arbitrary object geometry and orientations is pretty tricky. Rendering when you constrain yourself to a grid world where you can only look horizontally is a lot easier. Granted, this demo only scratched the basics, but its still pretty cool to see how quickly one can get something neat up and running with some math and code.

The code for this demo is available here.

Special thanks to JDH for the inspiration to pursue this and for the reference code to get up and running.

Getting Started with VS Code on Ubuntu

In this section I wanted to give some pointers on how to set oneself up with Visual Studio Code on Ubuntu.

To start, I downloaded VS Code. For Ubuntu, that means downloading the .deb file. I then installed it with:

sudo apt install ./<file>.deb

I then installed SDL2:

sudo apt install libsdl2-dev libsdl2-2.0-0

I then installed some recommended VS code extensions. To do that, you have to navigate via the left-hand bar to the tetris-like icon:

Then, use the provided search bar to find C/C++:

You can click on that extension and install it.

I similarly installed:

  • C/C++ Makefile Project by Adriano Markovic
  • C/C++ Runner by franneck94
  • Code Runner by Jun Han
  • CodeLLDB by Vadim Chugunov

Each VS Code project is contained inside a workspace. I created a new folder on my machine and navigated to it inside the VS code terminal. I then initialized a new project with Ctrl + Shift + P, and selected “C/C++ Make: INIT Project”. This creates a new C/C++ project with a Makefile.

I then edited the Makefile to link to SDL2 and math.h:

CXXFLAGS = -std=c11 -Wall -g # Note: -g adds proper debugging with symbols
LDFLAGS = -lSDL2 -lm

and named my program:

APPNAME = TOOM

I created a new directory, src, and created a new file, main.c inside of it. That’s where the coding starts!

#include <math.h>
#include <stdio.h>
#include <SDL2/SDL.h>
int main(int argc, char *argv[]) {
    printf("Hello World!\n");
    return 0;
}

At this point I could make the project in the terminal by typing make, and could execute it by typing ./TOOM. However, to use the debugger, I needed to set up a tasks.json file in the generated .vscode directory:

Then, opening the debugger via the left-column icon and hitting the settings gear icon creates a launch.json file. I set this to run the prelaunch task “build” which we just set up in tasks.json. The overall config looks like this:

This lets you set breakpoints in your code and execute in debug mode. You can use VS code’s fantastic introspection tools to see variable values and see what your program is up to.

Subsumption Architectures and Transformers

I’ve run into subsumption architectures a few times now. They’re basically a way to structure a complicated decision-making system by letting higher-level reasoning modules control lower-level reasoning modules. Recently, I’ve been thinking about some parallels there with the transformer architecture that has dominated large-scale deep learning. This article is a brief overview of subsumption architectures and attempts to tie them in with attention models, with the hope to better understand both.

What is a Subsumption Architecture?

The Wikipedia entry on subsumption architecture describes it as an approach to robotics from the 80s and 90s that composes behavior into hierarchical modules. The modules operate in parallel, with higher-level modules providing inputs to lower-level modules, thereby being able to operate and reason at a higher level.

These architectures were introduced by Rodney Brooks (of Baxter fame, though also much more) and his colleagues in the 80s. Apparently he “started creating robots based on a different notion of intelligence, resembling unconscious mind processes”. Brooks argued that rather than having a monolithic symbolic reasoner that could deduce what to do for the system as a whole, a subsumption approach could allow each module to be tied closer to the inputs and outputs that it cared about, and thus conduct local reasoning. This argument is often tied together with his view that, rather than making a big symbolic model of the world, which is a hopeless endeavor, that “the world is its own best model” and submodules should be able to directly interact with the portion of the world they interact with.

Perhaps this is best described using an example. Consider a control system for a fixed-wing aircraft. A basic software engineering approach might architect the aircraft’s code to look something like this:

Basic decomposition for the lowest level of a fixed-wing autopilot

where, for most flight dynamics, we can decouple forward speed / pitching / climbing from yaw / roll. Unfortunately, we run into some complexity as we consider the behavior of the longitudinal controller. When taking off, we want full throttle and to maintain a fixed target pitch, when flying below our target altitude, we want to use full throttle and regulate airspeed with pitch control, when near our target altitude we want to control our altitude with pitch and control our airspeed with throttle, and when above our target altitude we want zero throttle and to regulate airspeed with pitch control:

Longitudinal control strategies based on altitude. Effectively Figure 6.14 from Small Unmanned Aircraft.

As such, our longitudinal controller has quite a lot of internal complexity! Note that this is only for takeoff to cruise; landing would involve more logic. We want to control our aircraft with different strategies depending on the context.

Subsumption architectures are a means of tackling this sort of context-based control. Rather than programming a bunch of if/then statements in a single longitudinal control module, we can further break apart our controller and allow independent modules to operate concurrently, taking control when needed:

Breaking down the Longitudinal Controller into Hierarchical Modules

This architecture works by splitting our monolithic logic into independent modules. Each module can receive inputs from any number of upstream modules. For example, the throttle output module, which forwards the final throttle command to the motor, can receive direct throttle commands from the airspeed-via-throttle-controller, the descent controller, the climb controller, or the takeoff controller.

Each module thus needs a method for knowing which input to follow. There are multiple ways to achieve this, but the simplest is to assign different modules different priorities, and their outputs inherit those priorities. The highest propriety input into any given model would be the one acted upon. In our little example above, the cruise controller could just be running continuously, with its output being overwritten by the takeoff controller when it activates during takeoff.

Our longitudinal controller modules running in cruise, where only the cruise controller is active. The descent, climb, and takeoff controllers are dormant. As such, the airspeed-via-pitch and pitch controllers are dormant.

We get our emergent takeoff behavior by allowing the takeoff controller, when it activates, to override the cruise controller’s output:

Our longitudinal controller modules running in takeoff. The cruise controller is always running, but its lower-priority outputs are overwritten by the higher-priority takeoff controller.

This modularization has some major advantages. First, modules are self-contained, and it is much easier to define their desired behavior and to test it. It is much easier to validate the behavior of a simple pitch controller than the complicated behavior of the monolithic longitudinal controller. Second, the self-contained nature means that it is much easier to extend behavior by swapping or implementing new modules. Good luck having a new member of the team hook up everything properly in a monolithic code base.

We can similarly take this approach further and take it farther up the stack, with our longitudinal and lateral controllers taking inputs from trajectory trackers, or waypoint followers, a loitering controller, or a collision-avoidance module. All of the same principles apply. Finally, it is easier to monitor and debug the behavior of a modular system, especially if we make the modules as stateless as possible. We can log their inputs and outputs, and then can (more) easily figure out what the system was doing at any given time.

The approach is not without its disadvantages. The most obvious change is that we now have a bunch of disparate modules that can all act at various times. In the monolithic case we can often guarantee (with if statements) that only certain modules are active at certain times, and curtail complexity that way. Another issue is added complexity in handoff – how do we handle the transitions between module activations? We don’t want big control discontinuities. Finally, we have the issue of choosing priority levels. For simple systems it may be possible to have a strict priority ordering, but in more complicated situations the relative ordering becomes trickier.

Subsumption architectures were originally invented to be able to easily introduce new behaviors to a robotic system. Our previous controller, for example, could be extended with multiple higher-level modules that use the same architectural approach to command the lower-level modules:

Composition lets us subsume lower-level behaviors in higher-level modules. This is the main idea of subsumption architectures.

The running example is for a fixed-wing controller, but you could imagine the same approach being applied to a robot. The Wikipedia article suggests a higher-level exploration behavior that can call lower-level walk-around or avoid-object modules.

Unconscious Mind Processes

Brooks was originally motivated by subsumption architectures because it allowed for “unconscious mind processes”. By this I think Brooks meant that higher level controllers typically need not concern themselves with low level details, and could instead rely on lower level controllers to handle those details for them. In our example above, a trajectory follower can just direct the cruise controller to follow the trajectory, and does not need to think about elevator or throttle rates directly.

One might argue that claiming that this sort of an approach resembles how the human mind works is a stretch. We are all familiar with sensational articles making big claims on AI approaches. However, I think there are core ideas here that are in line with how human reasoning might come to be.

For example, Brooks argues that subsumption models have emergence. We don’t think of any of the lower-level modules as being particularly intelligent. A pitch controller is just a PID loop. However, the combined behavior of the symphony of higher-level modules knowing when to jump in and guide behavior causes the overall system to exhibit quite sophisticated reasoning. You, a human, also have some “dumb” low-level behavior – for example what happens when you put your hand on a hot stove top. Your higher-level reasoning can choose to think at the level of individual finger movements, but more often thinks about higher-level behavior like picking up a cup or how far you want to throw a baseball.

I think this line of reasoning bears more fruit when we consider the progress we have made with deep learning architectures. The field has more or less converged on a unified architecture – the transformer. Transformers depart from prior sequence learning architectures by doing away with explicit recurrence and convolution layers in favor of attention. Attention, it just so happens, is the process of deciding which of a set of inputs to pay attention to. Subsumption architectures are also very much about deciding which of a set of inputs to pay attention to.

We could restructure our subsumption architecture to use attention:

The transformer (attention-based) architecture is basically a deep learning counterpart to subsumption architectures. Here we have attention blocks between layers of fidelity.
Alternatively, every module could just attend to whatever inputs it chooses. There are still multiple levels of fidelity, but the adherence need not be as strict.

It would make a lot of sense for a robust deep learning model to have produced, inside of its layers, a diverse set of modules that can spring into action at appropriate moments and make use of other modules in lower levels. Such an outcome would have big wins in robustness (because lower-level modules become general tools used by higher-level modules), have the benefit of progressive learning (we’d refine our basic behaviors first, like when a child first learns how to walk before reasoning about where to walk to), and we’d be able to develop more nuanced behaviors later in life by building new higher-level modules.

Interestingly, this sort of emergence is the opposite of what a classical deep neural network typically produces. For perception, it has been shown that the first layers of a neural network produce high level features like edge detectors, and then the deeper you go, the more complicated the features get. Eventually, at the deepest levels, these neural networks have features for dogs or cats or faces or tails. For the subsumption architecture, the highest fidelity reasoning happens closest to the output, and the most abstract reasoning happens furthest from the output.

The Tribulations of Learning

Unfortunately, the deep-learning view of things runs into a critical issue – our current reliance on gradient information to refine our models.

First off, the way we have things structured now requires gradient information from the output (i.e., motor controllers) to propagate back up the system to whichever cause produced it. That works pretty well for the modules closest to the output, but modules further from the output can suffer from vanishing gradients.

Vanishing gradients affect any large model. Modern architectures include skip connections (a la ResNet) to help mitigate this effect. It helps some.

Second, these models are typically trained with reinforcement learning, and more specifically, with policy gradient methods. These are notorious for being high variance. It can take a very long time for large-scale systems to learn to generalize well. Apparently Alpha Go played 4.9 million times in order to train itself to play well, which is way more than any human master ever played. Less structured tasks, like robot grasping, seem to also require massive quantities of training time and data.

This second problem is more or less the main stickler of reinforcement learning. We know how to build big systems and train them if we are willing and able to burn a whole lot of cash to do so. But we haven’t figured out how to do it efficiently. We don’t even know how to leverage some sort of general pre-training. There are certainly projects that attempt this, it just isn’t there yet.

The third problem is more specific to the hierarchical module architecture. If we get sparse activation, where some modules are only affected by certain parent modules, then there are presumably a bunch of modules that currently do not contribute to the output at all. The back-propagated gradient to any inactive module will be zero. That is bad, because a module with zero gradient will not learn (and may degrade, if we have a regression term).

The white modules here, being inactive, would not get any gradient information backpropagated to them if we stick to traditional policy gradient methods

I think the path forward is to find a way past these three problems. I suspect that subsumption architectures have untapped potential in the space of reinforcement learning, and that these problems can be overcome. (Human babies learn pretty quickly, after all). There is something elegant about learning reliable lower-level modules that can be used by new higher level modules. That just seems like the way we’ll end up getting transfer learning and generalizable reasoning. We just need a way to get rewards from places other than our end-of-the-line control outputs.

Conclusion

So we talked about subsumption architectures, that’s cool. What did we really learn? Well, I hope that you can see the parallel that we drew between the subsumption architectures and transformers. Transformers have rapidly taken over all of large-scale deep learning, so they are something to pay attention to. If there are parallels to draw from them to a robotics architecture, then maybe that robotics architecture is worth paying attention to as well.

We also covered some of the primary pitfalls. These challenges more or less generally need to be overcome to figure out scalable / practical reinforcement learning, at least practical enough to actually apply to complex problems. I don’t provide specific suggestions here, but oftentimes laying out the problem helps us get a better perspective in actually solving it. I remain hopeful that we’ll figure it out.

2D Player Collision against Static Geometry

A few weeks ago I got an email asking me about a game I worked on ages ago. I had to inform the sender that, unfortunately, my inspired dreams of creative development had not come to full fruition.

The message did get me thinking. What had I been doing back then, about 8 years ago? What problems did I run into? Why did I stop?

Now, I took a look at my old source code, and I’m pretty sure I stopped because of the mounting complexity of classes and cruft that happens in a growing codebase, especially when one has comparatively little idea of what one is doing. That, and I’m pretty sure this was around the time of my qualification exams, and well, I think that took over my life at the time.

Anyhow, I was thinking about this top-down 2D game, and the thoughts that kept circling around my brain centered around 2D character movement and collision. This was only magnified when I played some iPhone games to supplement my entertainment when I flew long distance over the holidays, and grew dismayed at some of the movement systems. I thought player movement was hard back then, and even today a fair number of games don’t really cut the mustard.

The Problem

So this post is borne out of mulling over this old problem of how to move a player around in a 2D game without them running into things. We want a few things:

  • Nice continuous movement
  • To not intersect static geometry
  • To nicely slide along static geometry
  • To not get hung up on edges, and nicely roll around them

And, as bonus objectives, we’d love to:

  • Do all of these things efficiently
  • Be able to handle non-tile maps with arbitrary geometry

The central issue with many games is that you have things you don’t want to run into. For example, below we have a screenshot from my old game-in-development Mystic Dave. We have a player character (Mystic Dave) and some objects (red) that we don’t want Dave to run into:

The player typically has a collision volume centered on their location. This collision volume is not allowed to intersect with any of the level collision geometry:

So what we’re really dealing with is moving this player collision volume around with the player, and making sure that it doesn’t intersect with anything.

Basic Player Movement

The player movement part is pretty simple. We can just look at whatever direction the player input is getting us to move in, and move ourselves in that direction:

void UpdatePlayer(Vec2 input_dir) {
   Vec2 pos_delta = input_dir * kPlayerStepSize;
   pos_player_ += pos_delta;
}

We’ll run into tuning problems later on if we don’t make this a function of the time step, so let’s go ahead and do that now:

void UpdatePlayer(Vec2 input_dir, float dt) {
   Vec2 pos_delta = input_dir * kPlayerStepSizePerTime * dt;
   pos_player_ += pos_delta;
}

This will give us some pretty basic movement:

That movement is a bit boring. It stutters around.

We can make it smoother by storing the player’s velocity and adding to that instead. We enforce a maximum velocity and apply some air friction to slow the player down as we stop our inputs:

void UpdatePlayer(Vec2 input_dir, float dt) {
   // Update the player's velocity
   vel_player_ += input_dir * kPlayerInputAccel * dt;
   
   // Clamp the velocity to a maximum magnitude
   float speed = Norm(vel_player_);
   if (speed > kPlayerMaxSpeed) {
       vel_player_ *= kPlayerMaxSpeed / norm;
   }

   // Update the player's position
   Vec2 pos_delta = vel_player_ * dt;
   pos_player_ += pos_delta;

   // Apply air friction
   vel_player_ *= kFrictionAir;
}

Alright, now we’re zooming around:

So how do we do the colliding part?

Naive Collision

The way I first approached player collision, way back in the day, was to say “hey, we have a top-down 2D tile-based game, so our collision objects are all squares, and we can just check for collision against squares”. A reasonable place to start.

Okay, so we have a rectangle for our player and consistently-sized squares for all of our static geometry. That can’t be too hard, right?

Checking for collisions between axis-aligned rectangles isn’t all that hard. You can use something like the separating axis theorem. A quick Google search gives you code to work with.

Our strategy now is to calculate the new player position, check it against our collision geometry for intersections, and to not move our player if that fails. If we do that, we find that we need (1) to zero out the player’s speed on collision, and (2) we still actually want to move the player, except we want to move them up against the colliding body:

Doing this part is harder. We can do it, but it takes some math. In the case of an axis-aligned rectangle, if we know which side our collision point is closest to, we can just shift our new position \(P’\) out of the closest face by half the player-rect’s width in that axis. The code looks something like:

void UpdatePlayer(Vec2 input_dir, float dt) {
   // Update the player's velocity, clamp, etc.
   ... same as before

   // Update the player's position
   Vec2 pos_delta = vel_player_ * dt;
   pos_player_ += pos_player_ + pos_delta;
   Box player_box_next = Box(pos_next, player_height, player_width);
   for (const Box& box : collision_objects_) {
      auto collision_info = CheckCollision(box, player_box_next)
      if (collision_info.collided) {
         // Collision!
         vel_player_ = Vec2(0,0);
         pos_player_ = box.center;
         if (collision_info.nearest_side == NORTH) {
            pos_player_ += Vec2(0, box.height/2 + player_height/2);
         } else if (collision_info.nearest_side == EAST) {
            pos_player_ += Vec2(box.width/2 + player_width/2, 0);
         } else if (collision_info.nearest_side == SOUTH) {
            pos_player_ -= Vec2(0, box.height/2 + player_height/2);
         } else if (collision_info.nearest_side == WEST) {
            pos_player_ -= Vec2(box.width/2 + player_width/2, 0);
         }
      }
   }

   // Apply air friction
   vel_player_ *= kFrictionAir;
}

So this works, but has some issues. First off, we’re iterating over all boxes. That’s a big inefficiency right off the bat, especially on a large level with a lot of filled tiles. That particular problem is pretty easy to avoid, typically by using a more efficient data structure for keeping track of our collision geometry, like a quad tree for general collision objects and searching around a simple tile-based index for something like Mystic Dave.

Second, we project out of the nearest face. This can lead to all sorts of glitches. If we start on one side of the box, but move such that our closest face is a different one, we get shunted out the wrong side:

This effect is particularly problematic when we have multiple collision bodies. The code above processes them all in series, which could be totally wrong. We want to collide with the first thing we’d hit first.

Alternatively we could use a while loop, and keep on checking for collisions after we get shunted around by our collision logic, but this could end up causing weird infinite loops:

After we process any collisions, we set the player’s speed to zero. This causes the player to just stop. There is no nice sliding along the surface, let alone nicely rolling around corners:

Thinking in Points

Okay, maybe we can do this differently. Let’s take a step back and refine our problem.

We have a point \(P\), which we are moving a small step to some \(P’\). We assume \(P\) is valid (the player is not colliding with anything).

We don’t want to move to \(P’\) if it collides with anything. Furthermore, we don’t want to move to \(P’\) if we collide with anything on the way there:

So far we’ve been talking about points (\(P\) and \(P’\)), but what complicates things are the bounding boxes around them. Can we do something to remove the boxes from the equation?

Let’s look at the set of points where \(P’\) collides with our collision volume:

We can change our strategy to simply test \(P’\), a single point, against this larger collision volume. By using it, we remove the complexity of the player’s bounding rectangle, and instead get to work directly with point-in-shape tests.

When checking for a collision, we just have to check whether the ray segment \(P\rightarrow P’\) intersects with the extended volumes, and if it does, only move the player up to the earliest intersection.

If we have multiple collision bodies, this approach will ensure that we stop at the right location:

So we are now working with extended volumes. These are obtained by using the Minkowski sum of a collision body with the player’s collision volume. The idea is quite general, and works for arbitrary polygons:

Computing the Minkowski sum of two polygons is actually pretty straightforward:

// Compute the Minkowski sum of two convex polygons P and Q, whose
// vertices are assumed to be in CCW order.
vector<Vec2i> MinkowskiSum(const vector<Vec2i>& P, const vector<Vec2i>& Q) {
    // Vertices (a,b) are a < b if a.y < b.y || (a.y == b.y && a.x < b.x)
    size_t i_lowest = GetLowestVertexIndex(P);
    size_t j_lowest = GetLowestVertexIndex(Q);

    std::vector<Vec2i> retval;

    size_t i = 0, j = 0;
    while (i < P.size() || j < Q.size()) {
        // Get p and q via cyclic coordinates
        // This uses mod to allow to indices to wrap around
        const Vec2i& p = GetCyclicVertex(P, i_lowest + i);
        const Vec2i& q = GetCyclicVertex(Q, j_lowest + j);
        const Vec2i& p_next = GetCyclicVertex(P, i_lowest + i + 1);
        const Vec2i& q_next = GetCyclicVertex(Q, j_lowest + j + 1);

        // Append p + q
        retval.push_back(p + q);

        // Use the cross product to gage polar angle.
        // This implementation works on integers to avoid floating point
        // precision issues. I just convert floats say, in meters, to
        // integers by multiplying by 1000 first so we maintain 3 digits
        // of precision.
        long long cross = Cross(p_next - p, q_next - q);

        if (cross >= 0 && i < P.size()) {
            i++;
        }
        if (cross <= 0 && j < Q.size()) {
            j++;
        }
    }

    return retval;
}

This post isn’t really about how the Minkowski sum works, so I won’t go too deep into it, but you basically store your polygon vertices in a counter clockwise order, start with a vertex on each polygon that is furthest in a particular direction, and then rotate that furthest direction around in a circle, always moving an index (or both if the directions match) to keep your indices matching that furthest direction.

Sliding and Rolling around Corners

Before we get too deep, let’s quickly cover how to slide along the collision face. Our code above was zeroing out the player’s velocity every time we hit a face. What we want to do instead is zero out the velocity into the face. That is, remove the component of the velocity that points into the face.

If we know the face unit normal vector \(u\), then we just project our velocity \(v\) into the direction perpendicular to \(u\).

Because running into things slows us down, we typically then multiply our resulting speed vector by a friction coefficient to mimic sliding friction.

// Lose all velocity into the face.
// This results in the vector along the face.
player_vel_ = VectorProjection(player_vel_, v_face);
player_vel_ *= player_friction_slide_;  // sliding friction

To roll around edges we can just trade out our rectangular player collision volume for something rounder. Ideally we’d use a circle, but a circle is not a polygon, so the Minkowski sum would result in a more awkward shape with circular segments. We can do that, but its hard.

We can instead approximate a circle with a polygon. I’m just going to use a diamond for now, which will give us some rolling-like behavior around box corners.

Thinking in Triangles

Now that we took that brief aside, let’s think again about what this Minkowski sum stuff means for the code we prototyped. We could maintain a list of these inflated collision objects and check our points against them every time step. That’s a bit better before, because we’re primarily doing point-in-polygon checks rather than polygon-polygon intersection, but it is still rather complicated, and we want to avoid having to repeatedly iterate over all objects in our level.

I had a big aha! moment that motivated this entire project, which elegantly avoids this list-of-objects problem and ties in nicely with a previous blog post.

What if we build a triangulation of our level that has edges everywhere the inflated collision polygons have edges, and we then mark all triangles inside said polygons as solid. Then all we have to do is keep track of the player’s position and which triangle they are currently in. Any time the player moves, we just check to see if they pass across any of the faces of their current triangle, and if they cross into a solid triangle, we instead move them up to the face.

TLDR: 2D player movement on a constrained Delaunay mesh with solid cells for Minkowksi-summed collision volumes.

To illustrate, here is the Mystic Dave screenshot with some inflated collision volumes:

And here is a constrained triangulation on that, with solid triangles filled in:

And here it all is in action:

We see the static collision geometry in red, the player’s collision volume in blue, and the inflated collision volumes in yellow. We render the constrained Delaunay triangulation, which has edges at the sides of all of the inflated collision volumes. For player movement we simply keep track of the player’s position, the player’s speed, and which triangle they are in (highlighted).

We don’t have to limit ourselves to tile-centered boxes for this. We don’t even have to limit ourselves to tile-aligned boxes:

Player movement now has to track both the player’s position and the triangle that the player is in. We can find the enclosing triangle during initialization, and can use the same techniques covered the Delaunay Triangulation post. I’m going to repeat that section here, since it is so relevant.

Suppose we start with some random face and extract its vertices A, B, and C. We can tell if a point P is inside ABC by checking whether P is to the left of the ray AB, to the left of the ray BC, and to the left of the ray CA:

Checking whether P is left of a ray AB is the same as asking whether the triangle ABP has a counter-clockwise ordering. Conveniently, the winding of ABP is given by the determinant of a matrix:

\[\det\left(\begin{bmatrix}A_x & A_y & 1 \\ B_x & B_y & 1 \\ P_x & P_y & 1\end{bmatrix}\right)\]

If this value is positive, then we have right-handed (counter-clockwise) winding. If it is negative, then we have left-handed (clockwise) winding. If it is zero, then our points are colinear.

If our current triangle has positive windings then it encloses our point. If any winding is negative, the we know we can cross over that edge to get closer to our point.

Here triangle ABP and BCP have positive windings, but triangle CAP has a negative winding. As such, we need to cross the blue edge to get closer to P.

We simply iterate in this manner until we find our enclosing triangle. 

Every tick, when we update the player’s position, we run a similar procedure. We similarly are interested in whether the player’s new position has crossed any triangle edges.The basic concept is that, if the new position enters a new triangle, and the new triangle is solid, we need to prevent movement across the edge. If the new position enters a new triangle, and the new triangle is not solid, we can simply move the player and update the enclosing triangle.

Below we see the simplest case, where the move from P to P’ leads to a new triangle:

If we check the winding of CAP’ and find that it is negative, then we know that we have crossed AC.

There are a few corner cases that make simply accepting the new triangle the wrong thing to do. First off, a single move could traverse multiple triangles:

We need to make sure we iterate, and continue to check for edge transitions.

Second, we may cross multiple edges in the first triangle:

In the above example, the new point P’ is across both the line through BC and the line through CA. The order in which we choose to process these traversals matters, because some intermediate triangles may be solid.

The correct traversal order is to process them in the same order that the ray P -> P’ crosses them. In this case, we first cross CA:

Every time we cross a segment, we find the intersection with the edge and move our player up to that edge. If the new triangle is not solid, we can update our current triangle and iterate again, now with a segment from P” to P’.

If the new triangle is solid, we don’t update our triangle and we process a collision on the colliding face. This means removing all velocity into the face (perpendicular to it) and adding sliding friction along the face (parallel to it). We continue to iterate again, now with a reduced step in our new direction along the edge:

The code structure is thus a while loop that, in each iteration, checks for traversal across all three edges of the enclosing triangle, and retains the earliest collision (if any). If there is a collision, we move the player up to the crossed edge (P”) and potentially handle collision physics or update the enclosing triangle, and iterate again. If there is no collision, then we stay inside our current triangle and are done.

Finding the intersection point for a crossed segment (suppose it is AB) can be done as follows:

\[\begin{aligned}V & = B – A \\ W &= P – A \\ \alpha &= \frac{V \times W }{P’ \times V} \\ P^{\prime\prime} & = P + \alpha (P’ – P) \end{aligned}\]

The interpolant \(\alpha \in [0,1]\) tells us how far we’ve moved toward P’, so is a great way to track which edge we cross first. We first cross the edge with the lowest \(\alpha\).

The code for this is thus:

void UpdatePlayer(Vec2 input_dir, float dt) {
   // Update the player's velocity, clamp, etc.
   ... same as before

   // Update the player's position
   float dt_remaining = dt;
   while (dt_remaining > 0.0f) {
        Vec2f player_pos_delta = player_vel_ * dt_remaining;
        Vec2f player_pos_next = player_pos_ + player_pos_delta;

        // Exit if the step is too small, to avoid math problems.
        if (Norm(player_pos_delta) < 1e-4) {
            break;
        }

        // Check to see if we cross over any mesh edges
        QuarterEdge* qe_ab = qe_enclosing_triangle_->rot;
        QuarterEdge* qe_bc = qe_enclosing_triangle_->next->rot;
        QuarterEdge* qe_ca = qe_enclosing_triangle_->next->next->rot;

        // Get the vertices
        const auto& a = *(qe_ab->vertex);
        const auto& b = *(qe_bc->vertex);
        const auto& c = *(qe_ca->vertex);

        // The fraction of the distance we will move
        float interp = 1.0f; // (alpha)
        QuarterEdge* qe_new_triangle = nullptr;
        Vec2f v_face;

        if (GetRightHandedness(a, b, player_pos_next) < -1e-4) {
            // We would cross AB
            Vec2f v = b - a;
            Vec2f w = player_pos_ - a;
            float interp_ab = Cross(v, w) / Cross(player_pos_delta, v);
            if (interp_ab < interp) {
                interp = interp_ab;
                qe_new_triangle = qe_ab->rot;
                v_face = v;
            }
        }
        if (GetRightHandedness(b, c, player_pos_next) < -1e-4) {
            // We would cross BC
            Vec2f v = c - b;
            Vec2f w = player_pos_ - b;
            float interp_bc = Cross(v, w) / Cross(player_pos_delta, v);
            if (interp_bc < interp) {
                interp = interp_bc;
                qe_new_triangle = qe_bc->rot;
                v_face = v;
            }
        }
        if (GetRightHandedness(c, a, player_pos_next) < -1e-4) {
            // We would cross CA
            Vec2f v = a - c;
            Vec2f w = player_pos_ - c;
            float interp_ca = Cross(v, w) / Cross(player_pos_delta, v);
            if (interp_ca < interp) {
                interp = interp_ca;
                qe_new_triangle = qe_ca->rot;
                v_face = v;
            }
        }

        // Move the player
        // If we would have crossed any triangle, this merely moves
        // the player up to the edge instead
        player_pos_ += player_pos_delta * interp;
        dt_remaining *= (1.0f - interp);
        dt_remaining -= 1e-5;  // eps factor for numeric safety

        if (qe_new_triangle != nullptr) {
            // We would have crossed into a new triangle
            if (is_solid(qe_new_triangle)) {
                // The new triangle is solid, so do not change triangles
                // Process the collision
                player_vel_ = VectorProjection(player_vel_, v_face);
                player_vel_ *= player_friction_slide_; // sliding friction
            } else {
                // Accept the new triangle
                qe_enclosing_triangle_ = qe_new_triangle;
            }
        }
   }

   // Apply air friction
   vel_player_ *= kFrictionAir;
}

This code block isn’t drastically more complicated than what we had before. We get a lot for it – efficient collision physics against general static objects. There is a bit of repeated code, so you could definitely simplify this even further.

Conclusion

This post revisited the problem of 2D player collision against static geometry. We covered some basics of player movement, including integrating velocity over time, basic collision physics, and basic collision approaches based on geometric primitives. We used the Minkowski sum to inflate collision polygons, thereby changing polygon vs. polygon intersections into cheaper point vs. polygon intersections. We concluded by leveraging constrained Delaunay triangulations for our environment instead, which simplified the collision detection logic down to checking for when we cross enclosing triangle line segments.

Please note that I am not the first to come up with this. It turns out I independently discovered Nav Meshes. I since learned that they are used in motion planning for robotics and in path planning for games (notably, Starcraft II). I assume they are also used for collision physics in games, but I haven’t been able to verify that yet.

I implemented this demo in ROS 2 using C++ 20. This was a deviation from my previous posts, which are mostly created in Julia notebooks. This project had more game-like interactivity to it than I thought a Julia notebook could easily handle (I could be wrong) and I simply felt like I could get up and running faster with ROS 2.

The most complicated part to implement in this whole process turns out to be polygon unions. In order to get the constrained Delaunay triangulation to work, you want to union any overlapping collision polygons together. Polygon unions (and other Boolean operations between polygons, like clipping) is quite complicated. I tried to write my own, but ended up using Clipper2.

I hope you found this process interesting! I was a lot of fun to revisit.

Cart Pole Actor Critic

In the previous two posts, we used various control strategies to solve the classic cart-pole problem. In the first, we used a classic LQR controller to stabilize the cart pole in the vicinity of its equilibrium point. In the second, we built a system that could reach stability from anywhere, by forming a hierarchical problem using LQR controllers around various reference states and a top-level search strategy to find the right sequence of controls to reach stability.

All of this was well and good, but what if we change our problem even more drastically?

So far, everything we have worked with has had intimate knowledge of the underlying dynamics. The LQR approach only works because we know the equations of motion, and can take the necessary derivatives to form a linear system. What if we didn’t have those equations of motion? In many real-world cases, we simply don’t. Now we’re going to pretended we don’t know the equations of motion, and will learn a good controller anyways.

From Controls to Reinforcement Learning

We are going to take a journey, leaving the land of controls behind, and head to the realm of reinforcement learning. This land is at first glance completely foreign, but once you get your bearings you begin to realize their strange customs are really just alternate manifestations of things we’re already used to.

Previously, in the land of controls, we derived controllers that produced control inputs to drive the system’s state to equilibrium while minimizing cost. The people of the realm of reinforcement learning do the same thing, except they find policies \(\pi(s)\) that produce actions to drive the system’s state to equilibrium while maximizing reward.

Now, unlike in the land of controls, the people of the realm of reinforcement learning are very used to working on systems where they do not know the equations of motion. They know that a transition function \(T(s’\mid s,a)\) and a reward function \(R(s,a)\) exist, but do not necessarily know what it is. Instead, they allow their agent to interact with its environment, and thereby gain information about these from the state transitions and rewards observed.

In this post I am going to use an actor-critic method. These are covered in chapter 13 of Alg4DM. The “actor” part refers to the fact that we learn a policy that our agent then follows. The “critic” part refers to the fact that we also learn a value function that critiques the agent’s policy. Both are trained in parallel, the critic becoming better at maximizing the critic’s reward, and the critic getting better at reflecting the true value of the actor’s behavior.

Problem Definition

We always start by defining the actual problem we are solving. This is super important. The problem that we solve is always a model of the ground truth. As George Box famously said, “All models are wrong, some models are useful.” The real world doesn’t care that modeling actuator saturation is harder – it will simply do what it always does.

To shake things up a tad, we are going to use cart pole with discrete actions, as it was originally presented. Note that in the land of controls we worked with continuous actions. (We could totally do that here – its just nice to do something different).

Now, the real robot likely can exert continuous forces. We are making a modeling choice, and have to live with the consequences. A series of discrete force outputs will naturally be more jerky / stuttery than a nice smooth output. Maybe we’ll want to tidy that up later, perhaps with a low-pass filter. For now, we’ll go with this.

As such, our problem definition is:

  • State Space \(\mathcal{S}\): The same classic cart-pole states consisting of the lateral position \(x\), the lateral velocity \(v\), the pole angle \(\theta\), and the pole’s angular rate \(\omega\).
  • Action Space \(\mathcal{A}\): We assume the cart has a maximum actuation force of \(F\). The available actions at each time step (same time step of 50Hz as before) are \(F\), \(0\), and \(-F\). As such, we have three discrete actions.
  • Transition Function \(T(s’\mid s, a)\): We use the deterministic cart-pole transition function under the hood, with the same transition dynamics that we have already been using.
  • Reward Function \(R(s,a)\): The original paper just gives a reward of 1 any time the cart is in-bounds in \(x\) and \(\theta\). We are instead going to use the reward function formulation that penalizes deviations from stability:
    \[R(s,a) = \frac{1}{100}r_x + r_\theta = \frac{1}{100}(\Delta x_\text{max}^2 – x^2) + (1 + \cos(\theta))\]
    We continue to give zero reward if we are out of bounds in \(x\), but not in \(\theta\).

Note that our learning system does not know the details of the transition and reward function. Note also that our reward function is non-negative, which isn’t necessary, but can make things a bit more intuitive.

Vanilla Actor Critic

The core ideas of actor-critic are pretty straightforward. We are simultaneously learning a policy and a value function.

Our policy, (which is stochastic to make the math easier), is a function that produces a categorical distribution over our three actions. We’re going to use a simply deep neural net for this, one with three fully connected layers:

Notice the softmax activation function on the final layer, which ensures that our output is a valid probability distribution.

There are multiple actor-critic approaches with different value functions. In this case I’ll learn a state-value function \(U(s)\), which predicts the expected reward from the given state (the discounted sum of all future rewards). I’m also going to use a simple neural network for this:

The final layer has no activation function, so the out put can be any real number.

These can both readily be defined using Flux.jl:

get_stochastic_policy(n_hidden::Int=32) = Chain(Dense(4 => n_hidden,relu), Dense(n_hidden => n_hidden,relu), Dense(n_hidden => 3), softmax)
get_value_function(n_hidden::Int=32) = Chain(Dense(4 => n_hidden,relu), Dense(n_hidden => 1))

In every iteration we run a bunch of rollouts and:

  • compute a policy gradient to get it to maximize the value functions
  • compute a value gradient to get our value function to reflect the true expected value

Alg4DM presents this in one nicely compact pseudocode block:

This formulation uses the likelihood ration approach to computing the policy gradient, which for a stochastic policy is:

\[\nabla U(\theta) = \underset{\tau}{\mathbb{E}}\left[\sum_{k=1}^d \gamma^{k-1} \nabla_\theta \log \pi_\theta\left(a^{(k)}\mid s^{(k)}\right) A_\theta\left(s^{(k)},a^{(k)}\right)\right]\]

where we estimate the advantage using our critic value function:

\[A_\theta(s,a) = \underset{r, s’}{\mathbb{E}}\left[ r + \gamma U_\phi(s’) – U_\phi(s)\right]\]

For the critic, we simply minimize the difference between the predicted expected value \(U_\phi(s)\) and the expected value obtained by running the current policy \(U^{\pi_\theta}(s)\):

\[\ell(\phi) = \frac{1}{2}\underset{s}{\mathbb{E}}\left[\left(U_\phi(s) – U^{\pi_\theta}(s)\right)^2\right]\]

The critic gradient is thus:

\[\nabla \ell(\phi) = \underset{s}{\mathbb{E}}\left[ \sum_{k=1}^d \left( U_\phi(s^{(k)}) – r_\text{to-go}^{(k)} \right) \nabla_\phi U_\phi(s^{(k)}) \right]\]

While the math may look a bit tricky, the code block above shows how straightforward it is to implement. You just simulate a bunch of rollouts (\(\tau\)), and then calculate the gradients.

The code I wrote is functionally identical to the code in the textbook, but is laid out a bit differently to help with legibility (and later expansion). Notice that I compute a few extra things so we can plot them later:

function run_epoch(
    M::ActorCritic,
    π::Chain, # policy π(s)
    U::Chain,  # value function U(s)
    mem::ActorCriticMemory,
    )
    t_start = time()
    𝒫, d_max, m = M.𝒫, M.d, M.m
    data = ActorCriticData()
    params_π, params_U = Flux.params(π), Flux.params(U)
    ∇π, ∇U, vs, Us = mem.∇π, mem.∇U, mem.vs, mem.Us
    states, actions rewards = mem.states, mem.actions, mem.rewards
    a_selected = [0.0,0.0,0.0]
    # Zero out the gradients
    for grad in ∇π
        fill!(grad, 0.0)
    end
    for grad in ∇U
        fill!(grad, 0.0)
    end
    # Run m rollouts and accumulate the gradients.
    for i in 1:m
        # Run a rollout, storing the states, actions, and rewards.
        # Keep track of max depth reached.
        max_depth_reached = rollout(M, π, U, mem)
        fit!(data.max_depth_reached, max_depth_reached)
        # Compute the actor and critic gradients
        r_togo = 0.0
        for d in max_depth_reached:-1:1
            v = vs[d]
            s = states[d]
            a = actions[d]
            r = rewards[d]
            s′ = states[d+1]
            a_selected[1] = 0.0
            a_selected[2] = 0.0
            a_selected[3] = 0.0
            a_selected[a] = 1.0
            r_togo = r + γ*r_togo
            A = r_togo - Us[d]
            fit!(data.advantage, A)
            ∇logπ = gradient(() -> A*γ^(d-1)*log(π(v) ⋅ a_selected), params_π)
            ∇U_huber = gradient(() -> Flux.Losses.huber_loss(U(v), r_togo), params_U)
            for (i_param,param) in enumerate(params_π)
                ∇π[i_param] -= ∇logπ[param]
            end
            for (i_param,param) in enumerate(params_U)
                ∇U[i_param] += ∇U_huber[param]
            end
        end
    end
    # Divide to get the mean
    for grad in ∇π
        grad ./= m
    end
    for grad in ∇U
        grad ./= m
    end
    data.∇_norm_π = sqrt(sum(sum(v^2 for v in grad) for grad in ∇π))
    data.∇_norm_U = sqrt(sum(sum(v^2 for v in grad) for grad in ∇U))
    data.w_norm_π = sqrt(sum(sum(w^2 for w in param) for param in params_π))
    data.w_norm_U = sqrt(sum(sum(w^2 for w in param) for param in params_U))
    data.t_elapsed = time() - t_start
    return (∇π, ∇U, data)
end

Theoretically, that is all you need to get things to work. Each iteration you run a bunch of rollouts, accumulate gradients, and then use these gradients to update both the policy and the value function parameterizations. You keep iterating until the parameterizations converge.

With such an implementation, we get the striking result of….

As we can see, the policy isn’t doing too well. What did we do wrong?

Theoretically, nothing wrong, but practically, we’re missing some bells and whistles that make a massive difference. These sorts of best practices are incredibly important in reinforcement learning (and generally in deep learning).

Decomposing our Angles

The first change we are going to make is to the inputs to our deep neural net. The current inputs use the raw angle, \(\theta\). This may work well in the basic cart-pole problem where our angle is bounded within a small range, but in a more general cart-pole setting where the pole can rotate multiple times, we end up with weird angular effects that are difficult for a neural network to learn how to handle:

In order to better handle the angle, we are going to instead input \(\cos(\theta)\) and \(\sin(\theta)\):

Wrangling the Loss Functions

One of the biggest issues we have is with the reward function. The expected value at a given state can be quite large. A single state can \(R(s,a)\) produce values as large as 4.8^2/100 + 2 = 2.23, but these then get accumulated over 2000 time steps with a discount factor of \(\gamma = 0.995\), resulting in values as large as \(U(s) = \sum_{k=1}^2000 2.23 \gamma^{k-1} \approx 446\). Now, 446 isn’t massive, but it is still somewhat large. Pretty much everything deep-learning likes to work close to zero.

We can plot our gradient norm during training to see how this large reward can cause an issue. The magnitude of the value function gradient is massive! This leads to large instabilities, particularly late in training:

The first thing we will do is dampen out the loss signal for large deviations in reward. Right now, if we predict, say, \(U(s) = 50\) but the real reward is \(U(s) = 400\), we’d end up with a squared loss value of \(350^2 = 122500\). That is somewhat unwieldy.

We want to use a squared loss, for its simplicity and curvature, but only if we are pretty close to the correct value. If we are far off the mark, we’d rather use something gentler.

So we’ll turn to the Huber loss, which uses the squared loss for nearly accurate values and a linear loss for sufficiently inaccurate values:

This helps alleviate the large reward problem, but we can do better. The next thing we will do is normalize our expected values. This step adjusts the expected values to have mean zero and a unit standard deviation (i.e., follow a normal distribution):

\[\bar{u} = (u – \mu) / \sigma\]

Right now, our average advantage starts off very large (~150), with a large standard deviation (~50), and over time settles to around mean 0 and stdev 30:

To normalize, we need to keep a running estimate of the value function mean \(\mu\) and standard deviation \(\sigma\). OnlineStats.jl makes this easy. One important thing to note, however, is that we compute the reward-to-go for multiple depths, and the estimated values are going to be different depending on the depth. A reward-to-go from \(k=1\) with a max depth of 2000 is going to be different than a reward-to-go from \(k=1999\). As such, we actually maintain separate mean and standard deviation estimates for each depth.

The last thing we do is clamping, both to the standardized reward-to-go:

r_togo = r + γ*r_togo
r_togo_standardized = (r_togo - r_togo_μs[d]) / (r_togo_σs[d] + 1e-3)
r_togo_standardized = clamp(r_togo_standardized, -max_standardized_r_togo, max_standardized_r_togo)
fit!(r_togo_est[d], r_togo)

and to the gradient:

for grad in ∇π
    grad ./= m
    clamp!(grad, -max_gradient, max_gradient)
end
for grad in ∇U
    grad ./= m
    clamp!(grad, -max_gradient, max_gradient)
end

Together, these changes allow the value function to predict values with zero-mean and unit variance, which is a lot easier to learn than widely-ranging large values. This is especially important here because the true value function values depend on the policy, which is changing. A normalized reward will consistently be positive for “good” states and negative for “bad” states, whereas a non-normalized will be some large value for “good” states and some not as large value for “bad” states. Normalization thus stabilizes learning.

Regularization

The final change to include is regularization. I am not certain that this makes a huge difference here, but it often does when working with deep learning projects.

Directly minimizing the critic loss and maximizing the expected policy reward can often be achieved with multiple parameterizations. These parameterizations may be on different scales. As we’ve seen before, we don’t like working with large (in an absolute sense) numeric values. Regularization adds a small penalty for large neural network weights, thereby breaking ties and causing the process to settle on a smaller parameterization.

The math is simple if we try to minimize the squared size our parameters:

\[\ell_\text{regularization}(\theta) = \frac{1}{2} \theta^\top \theta\]

Here, the gradient is super simple:

\[\nabla \ell_\text{regularization}(\theta) = \theta\]

We can thus simply add in our parameter values, scaled by a regularization weight:

for (param, grad) in zip(params_π, ∇π)
    grad += M.w_reg_π * param  # regularization term
    Flux.update!(opt_π, param, grad)
end
for (param, grad) in zip(params_U, ∇U)
    grad += M.w_reg_U * param
    Flux.update!(opt_U, param, grad)
end

Putting it all Together

With all of these changes in place, learning is much better. We are able to get Little Tippy the robot to stabilize:

The performance isn’t perfect, because we are, after all, working without explicit knowledge of the system dynamics, and are outputting discrete forces. Nevertheless, this approach was able to figure it out.

The policy was trained for 500 epochs, each with 128 rollouts to a maximum depth of 1000. A rollout terminates if the cart-pole gets further than 4.8m from the origin. The policy pretty quickly figures out how to stay within bounds: (plus or minus one stdev are also shown):

For contrast, here is the same plot with the vanilla setup:

The actual performance is better measured by the mean reward-to-go (i.e., expected value) that the system maintains over time. We can see that we have more-or-less converged:

With regularization, our advantage is nicely kept roughly standardized:

Conclusion

Actor-critic methods are nice for many more general problems, because we often do not know real-world dynamics. Neural networks can be very good function approximations, adapting to the problem at hand (whether that be Go, or an aircraft controller). That being said, actor-critic methods alone are not a panacea, and still require special knowledge and insight. In this post alone, we covered the following decision points that we as a designer made:

  • Whether to use discrete actions
  • The cart-pole reward function
  • Which policy gradient method to use
  • Which critic loss function to minimize
  • Maximum rollout depth
  • Number of rollouts per epoch
  • Number of epochs / training termination condition
  • Policy neural network architecture
  • Critic neural network architecture
  • Neural network weight initialization method
  • Whether to use gradient clamping
  • Using sine and cosine rather than raw angles
  • Standardizing the expected value (reward-to-go)
  • Which optimizer to use (I used Adam).

Being a good engineer often requires being aware of so much more than the vanilla concepts. It is especially important to understand when one is making a decision, and what impact that design has.

Thank you for reading!

The code for this post is available here. A notebook is here.