Tuning a Sokoban Policy Net

In a previous post, I had covered a project in which I was trying to get a neural network to learn to play Sokoban. I trained a transformer that operated on encoded board positions and predicted both the next action and how many steps were left until a solution would be reached:

My training was all done with Flux.jl and my own transformer implementation, not because it was more efficient or anything, but because I like to learn by doing it myself.

This training all happened on my little personal laptop. I love my laptop, but it is not particularly beefy, and it does not have much of a GPU to speak of. Training a fairly simple transformer was taking a long time — around 8 hours — which is pretty long. That doesn’t leave much room for experimentation. I knew that all this training was happening on the CPU, and the best way to make it faster would be to move to the GPU.

Flux making moving to the GPU incredibly easy:

using CUDA
policy = TransformerPolicy(
    batch_size = BATCH_SIZE,
    max_seq_len = MAX_SEQ_LEN,
    encoding_dim = ENCODING_DIM,
    dropout_prob=dropout_prob,
    no_past_info=no_past_info) |> gpu

That’s it – you just use the CUDA package and pipe your model to the GPU.

I tried this, and… it didn’t work. Unfortunately my little laptop does not have a CUDA-capable GPU.

After going through a saga of trying to get access to GPUs in AWS, I put the project to the side. There is sat, waiting for me to eventually pick it back up again whenever I ultimately decided to get a machine with a proper GPU or try to wrangle AWS again.

Then, one fateful day, I happened to be doing some cleaning and noticed an older laptop that I no longer used. Said laptop is bigger, and, just perhaps it had a CUDA-capable GPU. I booted it up, and lo and behold, it did. Not a particularly fancy graphics card, but CUDA-capable nonetheless.

This post is about how I used said GPU to train my Sokoban models, and how I then set up a task system in order to run a variety of parameterizations.

Model Training with a GPU

I switched my training code to move as much stuff as possible over to the GPU. After some elbow grease, I kicked off the same training run before and found that it ran in about 15 min – a 32x speed increase.

The next thing I tried was increasing the encoding from a length of 16 to 32. On the CPU, such an increase would at least double the training time. On the GPU, the training time remained the same.

How could that be?

Simply put, the GPU is really fast at crunching numbers, and the training runtime is dominated by using the CPU to unpack our training examples, sending data to and from the GPU, and running the gradient update on the CPU. In this case there seems to be a free lunch!

Here is a simple depiction of what was happening before (top timeline) vs. what is happening now:

We pay the upfront cost of sending the model to the GPU, but then can more efficiently shovel data into it to get results faster than computing them ourselves. There is nothing magical happening here, just literally passing data there, having it crunch it, and receiving it once its done.

Parameter Tuning

Now that we have faster training, it is nice to look at how various training and model parameters affect our metrics. Good parameters can easily make or break a deep learning project.

Our model parameters are:

  • board dimension – always \(8 \times 8\)
  • encoding dimension – the size of the transformer embedding
  • maximum sequence length – how long of a sequence the transformer can handle (always 64)
  • number of transformer layers

Our training parameters are:

  • learning rate – affects the size of the steps that our AdamW optimizer takes
  • AdamW 1st and 2nd momentum decay
  • AdamW weight decay
  • batch size – the number of training samples per optimization step
  • number of training batches – the number of optimization steps
  • number of training entries – the size of our training set
  • dropout probability – certain layers in the transformer have a chance of randomly dropping outputs for robustness purposes
  • gradient threshold – the gradient is clipped to this value to improve stability

That’s a lot of parameters. How are we going to go about figuring out the best settings?

The tried and true method that happens in practice is try-and-see, where humans just try things they think are reasonable and see how the model performs. That’s what I was doing originally, when each training run took 8 hours. While that’s okay, it’d be nice to do better.

The next simplest approach is grid search. Here we discretize all parameters and train a model on every possible parameter combination. In 2 dimensions this ends up looking like a grid, hence the name:

We have about 10 parameters. Even if we only consider 2 values for each of them, doing a grid search over them all would require training \(2^{10} = 1024\) models, which at 15 min per model is ~10.7 days. That’s both too long and pretty useless – we want higher granularity than that.

With grid search out, the next alternative is to conduct local searches for specific parameters. We already have a training parameterization that works pretty well, the one from the last blog post, and we can vary a single parameter and see how that affects training. That’s much cheaper – just the cost of the number of training points we’d like to evaluate per parameter. If we want to evaluate 5 values per parameter, that’s just \(5 \cdot 10 = 50\) models, or ~12.5 hours of training time. I could kick that off and come back to look at it the next day.

What I just proposed is very similar to cyclic coordinate search or coordinate descent, which is an optimization approach that optimizes one input at a time. It is quite simple, and actually works quite well. In fact, Sebastian Thrun himself has expressed his enthusiasm for this method to me.

There’s a whole realm of more complicated sampling strategies that could be followed. I rather like uniform projection plans and quasi-random space-filling sets like the Halton sequence. They don’t take that much effort to set up and do their best to fill the search space with a limited number of samples.

The method I most leaned up was ultimately a mixture of coordinate search and random sampling. Random sampling is what Andrej Karpathy recommended in CS231n, because it lets you evaluate far more independent values for specific parameters than something like grid search:

Here we see about 1/3 as many samples as grid search covering way more unique values for param 1.

Okay, so we want a way to run random search and perhaps some other, more targeted search approaches. How would we go about doing that?

Tasks

In the next phase of my project I want to expand from just training a transformer policy to predict player up/down/left/right moves to more complicated models that may interact with this simpler model. I also want to use my models to discover solutions to random problems, and perhaps refine previously discovered solutions with better models. I thus don’t want to merely support training this one model, I want to be able to run more general tasks.

function run_tasks()

    done = false
    while !done
        
        task_file = get_next_task()
        if !isa(task_file, String)
            println("No tasks left!")
            done = true
            break
        end

        task_filepath = joinpath(FOLDER_TODO, task_file::String)
        res = run_task(task_filepath)
        dest_folder = res.succeeded ? FOLDER_DONE : FOLDER_TRIED

        # name the task according to the time
        task_dst_name = Dates.format(Dates.now(), "yyyymmdd_HHMMss") * ".task"
        mv(task_filepath, joinpath(dest_folder, task_dst_name))
        write_out_result(res, task_dst_name, dest_folder)
        println(res.succeeded ? "SUCCEEDED" : "FAILED")
    end
end

The task runner simply looks for its next task in the TODO folder, executes it, and when it is done either moves it to the DONE folder or the TRIED folder. It then writes out additional task text that it captured (which can contain errors that are useful for debugging failed tasks).

The task runner is its own Julia process, and it spawns a new Julia process for every task. This helps ensure that issues in a task don’t pollute other tasks. I don’t want an early segfault to waste an entire night’s worth of training time.

The task files are simply Julia files that I load and prepend with a common header:

function run_task(task_filepath::AbstractString)
    res = TaskResult(false, "", time(), NaN)

    try
        content = read(task_filepath, String)

        temp_file, temp_io = mktemp()
        write(temp_io, TASK_HEADER)
        write(temp_io, "\n")
        write(temp_io, content)
        close(temp_io)

        output = read(`julia -q $(temp_file)`, String)

        res.succeeded = true
        res.message = output
    catch e
        # We failed to
        res.succeeded = false
        res.message = string(e)
    end

    res.t_elapsed = time() - res.t_start

    return res
end

This setup is quite nice. I can drop new task files in and the task runner with just dutifully run them as soon as its done with whatever came before. I can inspect the TRIED folder for failed tasks and look at the output for what went wrong.

Results

I ran a bunch of training runs and then loaded and plotted the results to get some insight. Let’s take a look and see if we learn anything.

We’ve got a bunch of metrics, but I’m primarily concerned with the top-2 policy accuracy and the top-2 nsteps accuracy. Both of these measure how often the policy had the correct action (policy accuracy) or number of steps remaining (nsteps accuracy) in its top-2 most likely predictions. The bigger these numbers are the better, with optimal performance being 1.0.

First let’s look at the learning rate, the main parameter that everyone typically has to futz with. First, the top-2 policy accuracy:

We immediately see a clear division between training runs with terrible accuracies (50%) and training runs with reasonable performance. This tells us that some of our model training runs did pretty poorly. That’s good to know – the parameters very much matter and can totally derail training.

Let’s zoom in on the good results:

We don’t see a super-clear trend in learning rate. The best policy accuracy was obtained with a learning rate around 5e-4, but that one sample is somewhat of an outlier.

The nsteps accuracy also shows the bad models, so we’ll jump straight to the zoomed version:

Interestingly, the same learning rate of 5e-4 produces the best nsteps accuracy as well, which is nice for us. Also, the overall spread here tends to prefer larger learning rates, with values down at 1e-4 trending markedly worse.

Next let’s look at the dropout probability. Large enough values are sure to tank training performance, but when does that start happening?

We don’t really have great coverage on the upper end, but based on the samples here it seems that a dropout probability of about 0.01 (or 1%) performs best. The nsteps accuracy shows a similar result.

Next let’s look at the weight decay.

We’ve found our performance-tanking culprit! Weight decay values even moderately larger than zero appear to be highly correlated with terrible performance. It seems to drag the model down and prevent learning. Very small weight decay values appear to be fine, so we’ll have to be careful to just search those.

This is an important aspect of parameter tuning – parameters like the learning rate or weight decay can take on rather small values like 1e-4. Its often more about finding the right exponent rather than finding the right decimal value.

With those learning parameters out of the way, let’s look at some model parameters. First, the encoding dimension:

Naively we would expect that bigger encoding dimensions would be better given infinite training examples and compute, but we those are finite. We didn’t exhaustively evaluate larger encoding dimensions, but find that the nsteps prediction doesn’t benefit all that much from going from 32 to 64 entries, whereas the policy does.

We can also look at the number of transformer layers:

Having more layers means we have a bigger model, with more room to perform operations on our token embeddings as they pass through the model. Bigger is often better, but is ultimately constrained by our training data and compute.

In this case the nsteps predictor can achieve peak performance across a variety of depths, whereas the policy seems to favor larger layer counts (but can still do pretty well even with a single layer).

The next question we might ask is whether the mode size overall is predictive of performance. We can plot the total number of trainable parameters:

In terms of policy accuracy, we are seeing the best performance with the largest models, but the nsteps predictor doesn’t seem to need it as much. That is consistent with what we’ve already observed.

Let’s now identify the best model. How would we do that?

I’m going to consider the best model to be the one with the highest value of top-2 policy accuracy + top-2 nsteps accuracy. That’s the same as asking for the model most top-right in the following plot:

The two accuracies are highly correlated (which makes sense – its hard to predict how many steps it will take to reach the goal without also being a good Sokoban policy). The model that does the best has an encoding dim of 64 with 5 transformer layers, uses a learning rate of 0.0005, has weight decay = 0.0002, and a dropout probability of 0.01.

Conclusion

In this post I got excited about our ability to use the GPU to train our networks, and then I tried to capitalize on it by running a generic task runner. I did some sampling and collected a bunch of metrics in order to try to learn a thing or two about how best to parameterize my model and select the training hyperparameters.

Overall, I would say that the main takeaways are:

  • Its worth spending a little time to help the computer do more work for you
  • Its important to build insight into how the various parameters affect training

That’s all folks. Happy coding.

3D 2-Bone Inverse Kinematics

I am having fun figuring out how to write a 2D sidescroller from scratch, with 3D graphics. I’ve covered how posing and animating those skeletal meshes works. This post extends that work with inverse kinematics, which allows for the automatic placement of hands or feet at specific locations.

This wonderful low-poly knight comes from Davmid at SketchFab.

Some Recap

A character’s pose is defined by the transforms of their skeleton. A skeleton is made up of bones, where each bone is translated, offset, and scaled with respect to a parent bone.

The \(i\)th bone’s transform is given by:

\[\begin{aligned}\boldsymbol{T}_i &= \text{trans}(\boldsymbol{\Delta}_i) \cdot \text{rot}(\boldsymbol{q}_i) \cdot \text{scale}(\boldsymbol{s}) \\ &= \begin{bmatrix}1 & 0 & 0 & \Delta_x \\ 0 & 1 & 0 & \Delta_y \\ 0 & 0 & 1 & \Delta_z \\ 0 & 0 & 0 & 1\end{bmatrix} \cdot \begin{bmatrix} 1 – 2(q_y^2+q_z^2) & 2(q_x q_y – q_w q_z) & 2(q_x q_z + q_w q_y) & 0 \\ 2(q_x q_y + q_w q_z) & 1 – 2(q_x^2 + q_z^2) & 2(q_y q_z – q_w q_x) & 0 \\ 2(q_x q_z – q_w q_y) & 2(q_y q_z – q_w q_x) & 1 – 2(q_x^2 + q_y^2) & 0 \\ 0 & 0 & 0 & 1\end{bmatrix} \cdot \begin{bmatrix}s_x & 0 & 0 & 1 \\ 0 & s_y & 0 & 1 \\ 0 & 0 & s_z & 1 \\ 0 & 0 & 0 & 1\end{bmatrix}\end{aligned}\]

where \(\boldsymbol{\Delta}_i\) is its translation vector, \(\boldsymbol{q}_i\) is its orientation quaternion, and \(\boldsymbol{s}\) is its scaling vector. The transforms are \(4\times 4\) matrices because we’re using homogeneous coordinates such that we can support translations. A 3D position is recovered by dividing each of the first three entries by the fourth entry.

The resulting transform converts homogeneous locations in the bone’s local coordinate frame into the coordinate frame of its parent:

\[\boldsymbol{p}_{\text{parent}(i)} = \boldsymbol{T}_i \boldsymbol{p}_i\]

We can keep on applying these transforms all the way to the root bone to transform the bone into the root coordinate frame:

\[\boldsymbol{p}_{\text{root}} = \boldsymbol{T}_{\text{root}} \cdots \boldsymbol{T}_{\text{parent}(\text{parent}(i))} \boldsymbol{T}_{\text{parent}(i)} \boldsymbol{T}_i \boldsymbol{p}_i = \mathbb{T}_i \boldsymbol{p}_i\]

This aggregate transform \(\mathbb{T}_i\) ends up being very useful. If we have a mesh vertex \(\boldsymbol{v}\) defined in mesh space (with respect to the root bone), we can transform it to the \(i\)th bone’s frame via \(\mathbb{S}^{-1}_i \boldsymbol{v}\) according to the aggregate transform of the original skeleton pose and then transform it back into mesh space via \(\mathbb{T}_i \mathbb{S}^{-1}_i \boldsymbol{v}\). That gives us the vertex’s new location according to the current pose.

The Problem

We can define keyframes and interpolate between them to get some nice animations. However, the keyframes are defined in mesh space, and have no way to react to the world around them. If we have a run animation, the player’s foot will always be placed in the same location, regardless of the ground height. If the animation has the player’s arms extend to grab a bar, it can be very easy for the player’s position to be off and the hands not quite be where the edge is.

Here one foot is hovering in the air and the other is sticking into the ground.

The animation system we have thus far produces hand and foot positions with forward kinematics — the positions are based on the successive application of transforms based on bones in the skeleton hierarchy. We want to do the opposite — find the transforms needed such that an end effector like a hand or a foot ends up where we want it to be.

There are several common ways to formulate this problem. I am going to avoid heuristic approaches like cyclic coordinate descent and FABRIK. This method will be closed-form rather than iterative.

Instead of supporting any number of bones, I will stick to 2-bone systems. For example, placing a wrist by moving an upper an lower arm, or placing a foot by moving an upper and lower leg.

Finally, as is typical, I will only allow rotations. We don’t want our bones to stretch or displace in order to reach their targets.

Our inverse kinematic problem thus reasons about four 3D vectors:

  • a fixed hip position \(\boldsymbol{a}\)
  • an initial knee position \(\boldsymbol{b}\)
  • an initial foot position \(\boldsymbol{c}\)
  • a target foot position \(\boldsymbol{t}\)

Our job is to adjust the rotation transform of the hip and knee bones such that the foot ends up as close as possible to the target position.

The image makes it easy to forget that this is all happening in 3D space.

Working it Out

We are going to break this down into two parts. First we will find the final knee and foot positions \(\boldsymbol{b}’\) and \(\boldsymbol{c}’\). Then we’ll adjust the bone orientations to achieve those positions.

Finding the New Positions

There are three cases:

  • we can reach the target location and \(\boldsymbol{c}’ = \boldsymbol{t}\)
  • the bone segments aren’t long enough to reach the target location
  • the bone segments are too long to reach the target location

Let’s denote the segment lengths as \(\ell_{ab}\) and \(\ell_{bc}\), and additionally compute the distance between \(\boldsymbol{a}\) and \(\boldsymbol{t}\):

The target is too close if \(\ell_{at} < |\ell_{ab} – \ell_{bc}|\). When that happens, we extend the longer segment toward the target and the other segment away.

The target is too far if \(\ell_{at} > \ell_{ab} + \ell_{bc}\). When that happens, we extend the segments directly toward the target, to maximum extension.

The remaining, more interesting case is when the target can be reached exactly, and we know \(\boldsymbol{c}’ = \boldsymbol{t}\). In that case we still have to figure out where the knee \(\boldsymbol{b}’\) ends up.

The final knee position will lie on a circle that is the intersection of two spheres: the sphere of radius \(\ell_{ab}\) centered at \(\boldsymbol{a}\) and the sphere of radius \(\ell_{bc}\) centered at \(\boldsymbol{t}\):

We can calculate \(\theta\) from the law of cosines:

\[\cos \theta = \frac{\ell_{bc}^2 – \ell_{ab}^2 – \ell_{at}^2}{-2 \ell_{ab} \ell_{at}} \]

The radius of the circle that \(\boldsymbol{b}’\) lies on is:

\[r = \ell_{ab} \sin \theta = \ell_{ab} \sqrt{1 – \cos^2 \theta}\]

The center of the circle is then

\[\boldsymbol{m} = \boldsymbol{a} \> – \> \ell_{ab} \cos \theta \hat{\boldsymbol{n}}\]

where \(\hat{\boldsymbol{n}}\) is the unit vector from \(\boldsymbol{t}\) to \(\boldsymbol{a}\).

We can place the knee anywhere we want on this circle using some unit vector \(\hat{\boldsymbol{u}}\) perpendicular to \(\hat{\boldsymbol{n}}\):

\[\boldsymbol{b}’ = \boldsymbol{m} + r \hat{\boldsymbol{u}}\]

For the inverse kinematics to look nice, I want to move the knee as little as possible. As such, let’s try to keep \(\hat{\boldsymbol{u}}\) close to the current knee position:

\[\begin{aligned}\boldsymbol{v} &= \boldsymbol{b} \> – \boldsymbol{m} \\ \boldsymbol{u} &= \boldsymbol{v} \> – (\boldsymbol{v} \cdot \hat{\boldsymbol{n}}) \hat{\boldsymbol{n}} \\ \hat{\boldsymbol{u}} &= \text{normalize}(\boldsymbol{u}) \end{aligned}\]

We can implement this in code as follows:

f32 len_ab = glm::length(b - a);
f32 len_bc = glm::length(c - b);
f32 len_at = glm::length(t - a);

// Unit vector from t to a.
glm::vec3 n = (a - t) / len_at;
if (len_at < 1e-3f) {
    n = glm::vec3(1.0f, 0.0f, 0.0f);
}

bool reached_target = false;
glm::vec3 b_final, c_final;  // The final positions of the knee and foot.
if (len_at < std::abs(len_ab - len_bc)) {
    // The target is too close.
    // Extend the longer appendage toward it and the other away.
    int sign = (len_ab > len_bc) ? 1 : -1;
    b_final = a - sign * len_ab * n;
    c_final = b_final + sign * len_bc * n;

} else if (len_at <= len_ab + len_bc) {
    // The target is reachable
    reached_target = true;
    c_final = t;

    // The final knee position b_final will lie on a circle that is the intersection of two
    // spheres: the sphere of radius len_ab centered at a and the sphere of radius len_bc
    // centered at t.

    // Cosine of the angle t - a - b_final
    f32 cos_theta =
        (len_bc * len_bc - len_ab * len_ab - len_at * len_at) / (-2 * len_ab * len_at);
    f32 sin_theta = std::sqrt(1.0 - cos_theta * cos_theta);

    // Radius of the circle that b_final must lie on.
    f32 r = len_ab * sin_theta;

    // The center of the circle that b_final must lie on.
    glm::vec3 m = a - len_ab * cos_theta * n;

    // Unit vector perpendicular to n and pointing at b from m.
    // If b and m are coincident, we default to any old vector perpendicular to n.
    glm::vec3 u = GetPerpendicularUnitVector(n, b - m);

    // Compute the new knee position.
    b_final = m + r * u;

} else {
    // The target is too far.
    // Reach out toward it to max extension.
    b_final = a - len_ab * n;
    c_final = b_final - len_bc * n;
}

Adjusting the Orientations

Now that we have the updated knee and foot positions, we need to adjust the bone transforms to rotate the segments to match them.

The bone transforms are all defined relative to their parents. Let our end effector (the hand or foot) be associated with bone \(i\), its parent (the elbow or knee) be associated with bone \(j\), and its parent (the hip or shoulder) be associated with bone \(k\). We should already have the aggregate transforms for these bones computed for the existing pose.

First we’ll rotate bone \(k\) to point toward \(\boldsymbol{b}’\) instead of \(\boldsymbol{b}\). We need to calculate both positions in the bone’s local frame:

\[\begin{aligned}\boldsymbol{b}_\text{local} &= \mathbb{T}_k^{-1} \> \boldsymbol{b} \\ \boldsymbol{b}’_\text{local} &= \mathbb{T}_k^{-1} \> \boldsymbol{b}’ \end{aligned}\]

We then normalize these vectors and find the rotation that between them. A minimum rotation between two unit vectors \(\hat{\boldsymbol{u}}\) and \(\hat{\boldsymbol{v}}\) has an axis of rotation:

\[\boldsymbol{a} = \hat{\boldsymbol{u}} \times \hat{\boldsymbol{v}}\]

and a rotation angle:

\[\theta = \cos^{-1}\left(\hat{\boldsymbol{u}} \cdot \hat{\boldsymbol{v}}\right)\]

// Find the minimum rotation between the two given unit vectors.
// If the vectors are sufficiently similar, return the unit quaternion.
glm::quat GetRotationBetween(const glm::vec3& u, const glm::vec3& v, const f32 eps = 1e-3f) {
    glm::vec3 axis = glm::cross(u, v);
    f32 len_axis = glm::length(axis);
    if (len_axis < eps) {
        return glm::quat(0.0f, 0.0f, 0.0f, 1.0f);
    }

    axis /= len_axis;
    f32 angle = std::acos(glm::dot(u, v));
    return glm::angleAxis(angle, axis);
}

We then adjust the orientation of bone \(k\) using this axis and angle to get an updated transform, \(\boldsymbol{T}_k’\). We then propagate that effect down to the child transforms.

We do the same thing with bone \(j\), using the updated transform. The code for this is:

glm::mat4 iTTk = glm::inverse(TTk);
glm::vec3 local_currt = glm::normalize(To3d(iTTk * glm::vec4(b, 1.0f)));
glm::vec3 local_final = glm::normalize(To3d(iTTk * glm::vec4(b_final, 1.0f)));
frame->orientations[grandparent_bone] =
frame->orientations[grandparent_bone] * GetRotationBetween(local_currt, local_final);

(*bone_transforms)[grandparent_bone] =
(*bone_transforms)[grandgrandparent_bone] * CalcTransform(*frame, grandparent_bone);
(*bone_transforms)[parent_bone] =
(*bone_transforms)[grandparent_bone] * CalcTransform(*frame, parent_bone);

// Do the same with the next bone.
glm::mat4 TTj_new = (*bone_transforms)[parent_bone];
local_currt = glm::normalize(To3d(glm::inverse(TTj) * glm::vec4(c, 1.0f)));
local_final = glm::normalize(To3d(glm::inverse(TTj_new) * glm::vec4(c_final, 1.0f)));
frame->orientations[parent_bone] =
frame->orientations[parent_bone] * GetRotationBetween(local_currt, local_final);

Conclusion

This approach to inverse kinematics is efficient and effective. I’ve been using it to refine various action states, such as placing hands and feet onto ladder rungs:

It works pretty well when the hand or foot doesn’t have to be adjusted all that much. For larger deltas it can end up picking some weird rotations because it does not have any notion of maximum joint angles. I could add that, of course, at the expense of complexity.

Its really nice to be able to figure something out like this from the basic math foundations. There is something quite satisfying about being able to lay out the problem formulation and then systematically work through it.

Cheers!

Impulse Responses

This post expands on the previous post, which looked at collision detection and resolution for 2d polygons. By the end we were able to apply impulses to colliding bodies to properly get them to bounce and spin off of one another. Not only will we take it further with friction impulses and joints, but we’ll pay special attention to how these impulses are derived to emphasize that this line of reasoning is a tool that can be extended to new applications.

There is more math in this post than normal because I want to both provide a resource that I myself could have benefited from and show that this stuff can all be derived.

An impulse \(\boldsymbol{j}\) is an instantaneous change in momentum. They are very similar to forces, which influence velocity over time:

\[\boldsymbol{f} = m \boldsymbol{a} \quad \Rightarrow \quad \boldsymbol{v}’ \approx \boldsymbol{v} + \frac{1}{m} \boldsymbol{f} \Delta t\]

An impulse applied to a rigid body at a point \(\boldsymbol{r}\) relative to the body’s center of mass similarly affects its velocity, but without the need for the timestep:

\[\boldsymbol{v}’ = \boldsymbol{v} + \frac{1}{m} \boldsymbol{j}\]

This impulse also effects the angular velocity:

\[\boldsymbol{\omega}’ = \boldsymbol{\omega} + \boldsymbol{I}^{-1} \boldsymbol{r} \times \boldsymbol{j}\]

This update models the angular velocity in three dimensions, treating it as a vector. We can work out the equivalent 2d update where we just keep track of a scalar \(\omega\):

\[\begin{align}\boldsymbol{r} \times \boldsymbol{j} = \begin{bmatrix} r_x \\ r_y \\ 0\end{bmatrix}\end{align} \times \begin{bmatrix}j_x \\ j_y \\ 0\end{bmatrix} = \begin{bmatrix} 0 \\ 0 \\ r_x j_y – r_y j_x \end{bmatrix}\]

If we assume our inertia tensor is diagonal, and just pull out the z entry \(I_{zz}\), we get:

\[\omega’ = \omega + \frac{1}{I_{zz}} \left(r_x j_y – r_y j_x\right)\]

From here on out, if you see a scalar \(I\), it refers to \(I_zz\).

The Normal Impulse

In the last blog post, we used an impulse along the impact normal to cause two colliding rigid bodies to bounce off each other. We’re going to cover the same thing, but derive it instead of just take the result for granted.

We have two bodies, A and B, and a contact point located at \(\boldsymbol{r}_A\) relative to body A’s center and at \(\boldsymbol{r}_B\) relative to body B’s center:

The velocity of the contact points on each body as they enter the collision is:

\[\begin{align}\boldsymbol{v}_{cA} & = \boldsymbol{v}_A + \boldsymbol{\omega}_A \times \boldsymbol{r}_A \\ \boldsymbol{v}_{cB} &= \boldsymbol{v}_B + \boldsymbol{\omega}_B \times \boldsymbol{r}_B \end{align}\]

In 2D that amounts to:

\[\boldsymbol{v}_{cA} = \begin{bmatrix}v_{A,x} \\ v_{A,y} \\ 0\end{bmatrix} + \begin{bmatrix}0 \\ 0 \\ \omega_A\end{bmatrix} \times \begin{bmatrix} r_{A,x} \\ r_{A,y} \\ 0\end{bmatrix} = \begin{bmatrix}v_{A,x} \\ v_{A,y} \\ 0\end{bmatrix} + \begin{bmatrix}-\omega r_{A,y} \\ \hphantom{-}\omega r_{A,x} \\ 0\end{bmatrix} \]

One can do this stuff in 3D and then drop or ignore the \(z\)-axis, but in my code I just stick with 2D:

Vec2f r; // radius
Vec2f v; // linear velocity vector
f32 w;   // angular velocity
Vec2f v_c = v + w * Rotr(r);

The contact velocity — how fast the contact points collide — is the relative speed of the contact points:

\[\boldsymbol{v}_\text{rel} = \boldsymbol{v}_{cB} \> – \boldsymbol{v}_{cA}\]

To derive the impulse, we stipulate two things. First, that the impulse lies in the normal direction:

\[\boldsymbol{j} = j \hat{\boldsymbol{n}}\]

Second, that the relative velocity along the normal direction reverses sign and changes by a factor \(e\):

\[\boldsymbol{v}_\text{rel}’ \cdot \hat{\boldsymbol{n}} = \> – e \> \boldsymbol{v}_\text{rel} \cdot \hat{\boldsymbol{n}}\]

This factor \(e\) is called the restitution. Setting it to zero produces a perfectly inelastic collision (because the resulting relative velocity in the normal direction is zero), whereas setting it to 1 produces a perfectly elastic collision that preserves the kinetic energy. Most objects use a value somewhere in-between.

We now have everything we need to derive the impulse. Let’s expand the left-hand side of the restitution equation:

\[\begin{align}\boldsymbol{v}_\text{rel}’ \cdot \hat{\boldsymbol{n}} \\ \left(\boldsymbol{v}_{cB}’ \> – \boldsymbol{v}_{cA}’\right) \cdot \hat{\boldsymbol{n}} \\ \left[\left(\boldsymbol{v}_B’ + \boldsymbol{\omega}_B’ \times \boldsymbol{r}_B\right) – \left(\boldsymbol{v}_A’ + \boldsymbol{\omega}_A’ \times \boldsymbol{r}_A\right)\right] \cdot \hat{\boldsymbol{n}} \\ \left[\left(\left[\boldsymbol{v}_B + \frac{1}{m_B}\boldsymbol{j}\right] + \left[\boldsymbol{\omega}_B + \boldsymbol{I}_B^{-1} \boldsymbol{r}_B \times \boldsymbol{j}\right] \times \boldsymbol{r}_B\right) – \left(\left[\boldsymbol{v}_A – \frac{1}{m_A}\boldsymbol{j}\right] + \left[\boldsymbol{\omega}_A – \boldsymbol{I}_A^{-1} \boldsymbol{r}_A \times \boldsymbol{j}\right] \times \boldsymbol{r}_A\right)\right] \cdot \hat{\boldsymbol{n}} \\ \left[\left(\left[\boldsymbol{v}_B + j \frac{1}{m_B}\hat{\boldsymbol{n}}\right] + \left[\boldsymbol{\omega}_B + j \boldsymbol{I}_B^{-1} \boldsymbol{r}_B \times \hat{\boldsymbol{n}}\right] \times \boldsymbol{r}_B\right) – \left(\left[\boldsymbol{v}_A – j \frac{1}{m_A}\hat{\boldsymbol{n}}\right] + \left[\boldsymbol{\omega}_A – j \boldsymbol{I}_A^{-1} \boldsymbol{r}_A \times \hat{\boldsymbol{n}}\right] \times \boldsymbol{r}_A\right)\right] \cdot \hat{\boldsymbol{n}} \\ \left[\boldsymbol{v}_\text{rel} + j \left(\frac{1}{m_B}\hat{\boldsymbol{n}} + \left[ \boldsymbol{I}_B^{-1} \boldsymbol{r}_B \times \hat{\boldsymbol{n}}\right] \times \boldsymbol{r}_B + \frac{1}{m_A}\hat{\boldsymbol{n}} + \left[ \boldsymbol{I}_A^{-1} \boldsymbol{r}_A \times \hat{\boldsymbol{n}}\right] \times \boldsymbol{r}_A\right)\right] \cdot \hat{\boldsymbol{n}} \\ \boldsymbol{v}_\text{rel} \cdot \hat{\boldsymbol{n}} + j \left(\frac{1}{m_A}\hat{\boldsymbol{n}} + \frac{1}{m_B}\hat{\boldsymbol{n}} + \left[ \boldsymbol{I}_A^{-1} \boldsymbol{r}_A \times \hat{\boldsymbol{n}}\right] \times \boldsymbol{r}_A + \left[ \boldsymbol{I}_B^{-1} \boldsymbol{r}_B \times \hat{\boldsymbol{n}}\right] \times \boldsymbol{r}_B\right) \cdot \hat{\boldsymbol{n}} \\ \boldsymbol{v}_\text{rel} \cdot \hat{\boldsymbol{n}} + j \left(\frac{1}{m_A}\hat{\boldsymbol{n}}\cdot \hat{\boldsymbol{n}} + \frac{1}{m_B}\hat{\boldsymbol{n}}\cdot \hat{\boldsymbol{n}} + \left[ \boldsymbol{I}_A^{-1} \boldsymbol{r}_A \times \hat{\boldsymbol{n}}\right] \times \boldsymbol{r}_A \cdot \hat{\boldsymbol{n}} + \left[ \boldsymbol{I}_B^{-1} \boldsymbol{r}_B \times \hat{\boldsymbol{n}}\right] \times \boldsymbol{r}_B \cdot \hat{\boldsymbol{n}}\right) \\ \boldsymbol{v}_\text{rel} \cdot \hat{\boldsymbol{n}} + j \left(\frac{1}{m_A} + \frac{1}{m_B} + \left[ \boldsymbol{I}_A^{-1} \boldsymbol{r}_A \times \hat{\boldsymbol{n}}\right] \times \boldsymbol{r}_A \cdot \hat{\boldsymbol{n}} + \left[ \boldsymbol{I}_B^{-1} \boldsymbol{r}_B \times \hat{\boldsymbol{n}}\right] \times \boldsymbol{r}_B \cdot \hat{\boldsymbol{n}}\right) \end{align}\]

We then equate this to \( – e \> \boldsymbol{v}_\text{rel} \cdot \hat{\boldsymbol{n}}\) and solve for the impulse scalar \(j\), yielding:

\[j = \frac{-(1+e) \boldsymbol{v}_\text{rel} \cdot \hat{\boldsymbol{n}} }{ \frac{1}{m_A} + \frac{1}{m_B} + \left(\left[ \boldsymbol{I}_A^{-1} \boldsymbol{r}_A \times \hat{\boldsymbol{n}}\right] \times \boldsymbol{r}_A + \left[ \boldsymbol{I}_B^{-1} \boldsymbol{r}_B \times \hat{\boldsymbol{n}}\right] \times \boldsymbol{r}_B\right) \cdot \hat{\boldsymbol{n}} }\]

This is the equation that Wikipedia gives us.

The equation I gave in the last blog post looks a little different:

\[j = \frac{-(1+e)\boldsymbol{v}_\text{rel} \cdot \hat{\boldsymbol{n}}}{\frac{1}{m_A} + \frac{1}{m_B} + \frac{(\boldsymbol{r}_A \times \boldsymbol{n})^2}{I_A} + \frac{(\boldsymbol{r}_B \times \boldsymbol{n})^2}{I_B}}\]

It only differs in the terms involving the moment of inertia.

Let’s expand the Wikipedia version:

\[\begin{align}\left[\left(\boldsymbol{I}^{-1} \boldsymbol{r} \times \hat{\boldsymbol{n}}\right) \times \boldsymbol{r}\right] \cdot \hat{\boldsymbol{n}} &= \left(\left(\frac{1}{I}\begin{bmatrix}r_x \\ r_y \\ 0\end{bmatrix} \times \begin{bmatrix} \hat{n}_x \\ \hat{n}_y \\ 0 \end{bmatrix}\right) \times \begin{bmatrix}r_x \\ r_y \\ 0\end{bmatrix}\right) \cdot \begin{bmatrix}\hat{n}_x \\ \hat{n}_y \\ 0\end{bmatrix} \\ &= \left(\frac{1}{I}\begin{bmatrix}0 \\ 0 \\ r_x \hat{n}_y – r_y \hat{n}_x\end{bmatrix} \times \begin{bmatrix}r_x \\ r_y \\ 0\end{bmatrix}\right) \cdot \begin{bmatrix}\hat{n}_x \\ \hat{n}_y \\ 0\end{bmatrix} \\ &= \frac{1}{I}\begin{bmatrix}-r_x r_y \hat{n}_y + r_y^2 \hat{n}_x \\ r_x^2 \hat{n}_y – r_x r_y \hat{n}_x \\ 0 \end{bmatrix} \cdot \begin{bmatrix}\hat{n}_x \\ \hat{n}_y \\ 0\end{bmatrix} \\ &= \frac{1}{I}\left( r_x^2 \hat{u}_y^2 – 2 r_x r_y \hat{u}_x \hat{u}_y + r_y^2 \hat{u}_y^2 \right) \end{align}\]

We get the same thing if we expand the other version:

\[\begin{align}\frac{1}{I}\left(\boldsymbol{r} \times \hat{\boldsymbol{n}}\right)^2 &= \frac{1}{I}\left(r_x \hat{n}_y – r_y \hat{n}_x\right)^2 \\ &= \frac{1}{I}\left( r_x^2 \hat{u}_y^2 – 2 r_x r_y \hat{u}_x \hat{u}_y + r_y^2 \hat{u}_y^2 \right)\end{align}\]

The two formulations are equal, but assume that we’re working in 2D and that the moment of inertia tensor is diagonal with every entry equal to \(I\). The Wikipedia version is more general, and can handle 3D collisions and general inertia tensors.

The normal impulse \(j\) in 2D is thus

\[j = \> -(1+e) \> m_\text{eff} \> \boldsymbol{v}_\text{rel} \cdot \hat{\boldsymbol{n}} \]

where the effective mass is

\[m_\text{eff} = \left(\frac{1}{m_A} + \frac{1}{m_B} + \frac{(\boldsymbol{r}_A \times \hat{\boldsymbol{n}})^2}{I_A} + \frac{(\boldsymbol{r}_B \times \hat{\boldsymbol{n}})^2}{I_B}\right)^{-1}\]

After calculating it, we apply an impulse \(j \hat{\boldsymbol{n}}\) to body B and the negated impulse to body A. Note that we only apply an impulse if the relative velocity is causing the bodies to move into contact (\(\boldsymbol{v}_\text{rel} \cdot \hat{\boldsymbol{n}} < 0\)), otherwise the contact point already has a separating velocity.

The restitution \(e\) will depend on the colliding bodies. One convenient way to get it is to assign an restitution to each rigid body, and then use the maximum between the two for the collision. That allows bounciness to win out:

\[e = \max\left(e_A, e_B\right)\]

Friction Impulse

The normal impulse takes care of the relative penetration speed, but does nothing in the tangent direction along the contact line. We typically want shapes that slide along each other to have friction — that is what the friction impulse, or tangent impulse, is for.

Wikipedia and other websites will at this point show you a fancy diagram for Coulomb friction depicting the friction cone. I don’t find that to be particularly illuminating.

Instead, I simply like to think about Coulomb friction as:

  • compute a tangent impulse that causes the relative tangent velocity to be zero (\(e=0\))
  • if the required tangent impulse is sufficiently small, apply it (static friction)
  • if the required tangent impulse is too large, cap it out and apply that instead (dynamic friction)

Let’s derive the update ourselves again.

The friction impulse acts in the tangent direction \(\hat{\boldsymbol{t}}\), which we can get by rotating the normal. We once again have a scalar impulse:

\[j = j \hat{\boldsymbol{t}}\]

We compute the tangent that causes a zero relative tangent velocity (\(e = 0\)).

These two conditions are structurally the same as the normal impulse, except \(e\) is fixed. We can directly apply our previous equations to get:

\[j = \> – m_\text{eff} \> \boldsymbol{v}_\text{rel} \cdot \hat{\boldsymbol{t}}\]

where the effective mass is now in the tangent direction:

\[m_\text{eff} = \left(\frac{1}{m_A} + \frac{1}{m_B} + \frac{(\boldsymbol{r}_A \times \hat{\boldsymbol{t}})^2}{I_A} + \frac{(\boldsymbol{r}_B \times \hat{\boldsymbol{t}})^2}{I_B}\right)^{-1}\]

As long as this impulse is small enough (in magnitude), we apply it in order to prevent any tangential sliding. If it is large enough, we switch from static friction to dynamic friction:

\[j = \begin{cases} j & \text{if } |j| \leq j_\text{static} \\ j_\text{dynamic} & \text{otherwise} \end{cases}\]

Easy peasy.

The static and dynamic friction values are typically multiples of the normal impulse. We multiply the normal impulse by \(\mu_\text{static}\) to get the static friction impulse threshold and by \(\mu_\text{dynamic}\) to get the dynamic friction impulse threshold. Like the restitution, these are properties of each body, and we can obtain values to use for a pair of bodies by combining their values. Here it is recommended to use the square root of the product, which lets extremely low-friction (slippery) values dominate:

\[\mu_\text{static} = \sqrt{\vphantom{P}\mu_{A,\text{static}} \cdot \mu_{B,\text{static}}}\]

Revolute Joints

There are other things we can do with impulses. One of the main ones is to model attachment points between rigid bodies. These revolute joints allow rotation, but force the bodies to maintain a constant distance vector to the joint:

Revolute joints are useful for all sorts of things, including swinging:

I didn’t have a lot of luck finding information on how to enforce revolute joints between rigid bodies in 2D. Thankfully, we can derive the necessary impulse ourselves.

The property we want to enforce is that the contact velocity at the joint after the impulse be zero:

\[\boldsymbol{v}_\text{rel}’ = \boldsymbol{0}\]

We can expand that out:

\[\begin{align}\boldsymbol{v}_\text{rel}’ &= \boldsymbol{0} \\ \boldsymbol{v}_{cB}’ \> – \boldsymbol{v}_{cA}’ &= \boldsymbol{0} \\ \left(\boldsymbol{v}_B’ + \boldsymbol{\omega}_B’ \times \boldsymbol{r}_B\right) – \left(\boldsymbol{v}_A’ + \boldsymbol{\omega}_A’ \times \boldsymbol{r}_A\right) &= \boldsymbol{0} \\ \left(\left[\boldsymbol{v}_B + \frac{1}{m_B}\boldsymbol{j}\right] + \left[\boldsymbol{\omega}_B + \boldsymbol{I}_B^{-1} \boldsymbol{r}_B \times \boldsymbol{j}\right] \times \boldsymbol{r}_B\right) – \left(\left[\boldsymbol{v}_A – \frac{1}{m_A}\boldsymbol{j}\right] + \left[\boldsymbol{\omega}_A – \boldsymbol{I}_A^{-1} \boldsymbol{r}_A \times \boldsymbol{j}\right] \times \boldsymbol{r}_A\right) &= \boldsymbol{0} \\ \boldsymbol{v}_\text{rel} + \left(\frac{1}{m_B}\boldsymbol{j} + \left[\boldsymbol{I}_B^{-1} \boldsymbol{r}_B \times \boldsymbol{j}\right] \times \boldsymbol{r}_B\right) + \left(\frac{1}{m_A}\boldsymbol{j} + \left[\boldsymbol{I}_A^{-1} \boldsymbol{r}_A \times \boldsymbol{j}\right] \times \boldsymbol{r}_A\right) &= \boldsymbol{0} \\ \boldsymbol{v}_\text{rel} + \left(\frac{1}{m_A} + \frac{1}{m_B}\right)\boldsymbol{j} + \left(\boldsymbol{I}_A^{-1} \boldsymbol{r}_A \times \boldsymbol{j}\right) \times \boldsymbol{r}_A + \left(\boldsymbol{I}_B^{-1} \boldsymbol{r}_B \times \boldsymbol{j}\right) \times \boldsymbol{r}_B &= \boldsymbol{0}\end{align}\]

If we work out the cross products, then we end up with the following system of equations:

\[\begin{bmatrix}\frac{1}{m_A} + \frac{1}{m_B} + \frac{1}{I_A}r_{Ay}^2 + \frac{1}{I_B}r_{By}^2 & -\left(\frac{1}{I_A}r_{Ax}r_{Ay} + \frac{1}{I_B}r_{Bx}r_{By}\right) \\ -\left(\frac{1}{I_A}r_{Ax}r_{Ay} + \frac{1}{I_B}r_{Bx}r_{By}\right) & \frac{1}{m_A} + \frac{1}{m_B} + \frac{1}{I_A}r_{Ax}^2 + \frac{1}{I_B}r_{Bx}^2\end{bmatrix}\begin{bmatrix} j_x \\ j_y\end{bmatrix} = \begin{bmatrix}-v_{\text{rel},x} \\ -v_{\text{rel},y}\end{bmatrix}\]

We can write a simple function that can solve such a \(2 \times 2\) linear system:

// Solve a 2x2 system of linear equations
// [a b][x1] = [e]
// [c d][x2] = [f]
// This method assumes that it will succeed, which
// requires that the matrix be invertible.
template <typename T>
void Solve(T a, T b, T c, T d, T e, T f, T& x1, T& x2) {
    T det = d * a - c * b;
    x2 = (a * f - c * e) / det;
    x1 = (e - b * x2) / a;
}

and then use that to solve for our impulse.

// Compute the effective mass components.
f32 m_eff_11 = inv_m1 + inv_m2 + inv_I1 * r1.y * r1.y + inv_I2 * r2.y * r2.y;
f32 m_eff_12 = -(inv_I1 * r1.x * r1.y + inv_I2 * r2.x * r2.y);
f32 m_eff_22 = inv_m1 + inv_m2 + inv_I1 * r1.x * r1.x + inv_I2 * r2.x * r2.x;

// Solve for the impulse:
//  [m_eff_11   m_eff_12][j.x] = [-v_rel.x]
//  [m_eff_12   m_eff_22][j.y] = [-v_rel.y]
Vec2f j;
Solve(m_eff_11, m_eff_12, m_eff_12, m_eff_22, -v_rel.x, -v_rel.y, j.x, j.y);

We can then apply the resulting impulse to both bodies and perform a positional correction. I weight the correction such that the target joint location is biased toward the heavier object (that way joints on infinite-mass objects never move).

Conclusion

This post set the mathematical foundation for impulses, then derived the normal, friction, and revolute joint impulses. In each case, we looked at what was being preserved, formulated that mathematically, and massaged the equations to back out the impulse. While the equations were long at times, they did not lead us astray and we were able to find our way to some solid physics.

2D Collision Detection and Resolution

This post is an update on the sidescroller project. In short, I still find the project interesting, and as long as that is true, I want to work toward my original goal of a 2D sidescroller with arbitrary level geometry (i.e., not based on tiles or AABBs). I’ve achieved my stated goal of learning how to use the GPU and roll my own skeletal animation system, but I obviously haven’t built a game. A full game was, and still is, out of the question, but it would be cool to have something with a game loop and perhaps a complete level and a boss.

Fleshing out the game further requires fleshing out more of the game-play design. As such, physics must happen.

This blog post covers how I’m going about collision detection and resolution for convex polygons.

The Problem

We wish to simulate objects moving around in a 2D space. While we might eventually have fancy objects like ropes, let’s start by restricting ourselves to convex polygons:

A crude rendition of a 2D platformer scene (left) and the convex polygons in it (right).

As we can see, nonconvex polygons can be represented by multiple convex polygons, so this isn’t all that restrictive.

Most of the polygons will be static. The floors and ceilings, for the most part, do not move. Entities like the player, monsters, eventual ropes, crates, etc. will need to move. That movement needs to obey the laws of physics, which mainly means integrating accelerations into velocities, velocities into positions, and handling collisions.

More formally, we have a scene comprised of \(m\) rigid bodies, some of which are static and some of which are dynamic. Each rigid body is defined by a convex polygon and a 2D state. Each convex polygon is represented as a list of vertices in counter-clockwise (right-hand) order:

More serious implementations might include additional information, like precomputed edge normals or an enclosing axis-aligned box for broad-phase collision detection, but let’s leave those optimizations out for now. And yes, I could have used an std::vector but I like having trivially copyable types since I’m serializing and deserializing my levels to disk and I don’t want to think too hard when doing that.

I made the simplification of keeping track of static bodies separately from dynamic rigid bodies. For the static bodies, the polygon vertices are directly defined in the world frame. For the dynamic bodies, the polygon vertices are defined in the body frame of the entity, and are then transformed according to the entity’s state:

A vertex \(\boldsymbol{v}\) in the body frame of a polygon can be transformed into the world frame according to:

\[\begin{bmatrix}v_x \\ v_y\end{bmatrix}_\text{world} = \begin{bmatrix}\cos \theta & -\sin \theta \\ \sin \theta & \hphantom{-}\cos \theta\end{bmatrix}\begin{bmatrix}v_x \\ v_y\end{bmatrix}_\text{body} + \begin{bmatrix}p_x \\ p_y\end{bmatrix}\]

As the physics evolve, these states change, but the polygons do not. In every frame, we’ll accumulate forces and torques on our bodies and use them to update the state. We’ll also have to perform collision detection and handle any resulting forces and torques.

State Propagation

We need to simulate the laws of motion. That is, solve \(F = m \> a\) and compute

\[\begin{matrix}v(t + \Delta t) = v(t) + \int a(t) \> dt \\ p(t + \Delta t) = p(t) + \int v(t) \> dt \end{matrix}\]

except in 2 dimensions with torques and rotations.

We can’t do exact integration, so we’ll have to rely on approximations. Games tend to run at pretty high frame rates, so its reasonable to approximate the update across a single frame using the Euler method:

\[\begin{align} u(t + \Delta t) &= u(t) + a_x(t) \Delta t \\ v(t + \Delta t) &= v(t) + a_y(t) \Delta t \\ \omega(t + \Delta t) &= \omega(t) + \alpha(t) \Delta t \\ x(t + \Delta t) &= x(t) + u(t) \Delta t \\ y(t + \Delta t) &= y(t) + v(t) \Delta t \\ \theta(t + \Delta t) &= \theta(t) + \omega(t) \Delta t\end{align} \]

or written a bit more compactly:

\[\begin{align} u’ &= u + a_x \Delta t \\ v’ &= v + a_y \Delta t \\ \omega’ &= \omega + \alpha \Delta t \\ x’ &= x + u \Delta t \\ y’ &= y + v \Delta t \\ \theta’ &= \theta + \omega \Delta t\end{align} \]

Apparently, doing this would work okay, but using symplectic Euler integration is a bit more stable. It uses the updated velocities for the position update:

\[\begin{align} u’ &= u + a_x \Delta t \\ v’ &= v + a_y \Delta t \\ \omega’ &= \omega + \alpha \Delta t \\ x’ &= x + u’ \Delta t \\ y’ &= y + v’ \Delta t \\ \theta’ &= \theta + \omega’ \Delta t\end{align} \]

Our physics system will perform this update for every dynamic body in every game tick. After doing so, it needs to handle collisions.

Properly handling collisions is a whole big field and there is potentially a lot to talk about. In an ideal sense, if we were to identify a collision, we’d back up in time to where the collision first occurred, and handle it there. That’s more accurate, but also complicated. In a scene with a bunch of objects and a bunch of collisions, we don’t want to be sorting by a bunch of collision times.

Instead, I’ll be taking inspiration from Verlet integration and enforce constraints via iteration. A single tick of the physics system becomes:

void PhysicsEngine::Tick(f32 dt) {
    // Euler integration
    for (auto& rigid_body : rigid_bodies_) {
        EulerIntegrationStep(&rigid_body.state, dt);
    }

    for (int i = 0; i < 4; i++) {
        EnforceConstraints();
    }
}

We’re running Euler integration once, then running several iterations of constraint enforcement! This iterative approach helps things shake out because oftentimes, which you resolve one constraint, you start violating another. Its a pretty messy way to do things, but its so simple to implement and can be surprisingly robust.

What constraints will we enforce? For now we’re only going to prevent polygons from interpenetrating. In general we can enforce a bunch of other types of constraints, most notably joints between bodies.

Collision Detection

Alright, all that pretext to get to the meat and potatoes of this post.

In order to prevent our rigid bodies from interpenetrating, we need to detect when they collide:

We can do this in a pairwise manner, only considering the world-space polygons of two rigid bodies at once.

We start by observing that two polygons are in collision if they share an interior. While that may be technically true, it doesn’t really tell us how to efficiently compute intersections.

A more useful observation is that two polygons A and B are in collision if

  • A contains any vertex of B, or
  • B contains any vertex of A, or
  • any edge in A intersects any edge in B.

This definition actually works for nonconvex as well as convex polygons (but don’t ask me about degenerate polygons that self-intersect and such). Unfortunately, its a bit inefficient. If A has \(n\) vertices and B has \(m\) vertices, then we’d be doing \(n+m\) polygon–point collision checks and then \(nm\) edge intersection checks. If we keep our polygons simple then it isn’t all that bad, but there are better methods.

One go-to method for 2D polygon collision checking is the separating axis theorem, which states that if two convex polygons A and B are in collision, then you cannot draw a line between them. (Sometimes these things seem kind of obvious in retrospect…):

We can draw a separating line between the two polygons on the left, but cannot for the colliding ones on the right.

One direct consequence of the separating axis theorem is that if you can draw a separating line, then the projection of each shape perpendicular to that line will also have a gap:

Similarly, if you project the two shapes in any direction that they cannot be separated in, then their projections will not have a gap:

For convex polygons that are not colliding, a separating axis can always be constructed along one of the polygon’s edges. That is great — it means we only need to consider \(m + n\) possible separating axes.

Our collision detection algorithm is thus:

  • project both polygons onto all edge normals
  • if there are any gaps, then they do not intersect
  • otherwise, they intersect

Projecting a polygon along an axis is pretty easy:

struct PolygonProjection {
    f32 lo;
    f32 hi;
};

Overlap ProjectPolygonAlongDirection(
    const ConvexPolygon& polygon,
    const Vec2f normal)
{
    PolygonProjection proj;
    proj.lo = std::numeric_limits<f32>::infinity();
    proj.hi = -std::numeric_limits<f32>::infinity();
    for (u8 i = 0; i < polygon.n_pts; i++) {
        const f32 d = Dot(normal, polygon.pts[i]);
        proj.lo = std::min(d, overlap.lo);
        proj.hi = std::max(d, overlap.hi);
    }
    return proj;
}

Notice that the direction we pass in is the edge normal, not the edge tangent. This is because the normal is used to measure the distance away from the separating axis.

Two projects have a gap if proj_a.hi <= proj_b.lo || proj_b.hi <= proj_a.lo. Otherwise, they overlap.

Our collision detection algorithm is thus:

bool CollidesInAnAxisOfA(
    const ConvexPolygon& A,
    const ConvexPolygon& B)
{
    Vec2f a1 = A.pts[A.n_pts-1];
    for (u8 i = 0; i < A.n_pts; i++) {
        const Vec2f& a2 = A.pts[i];
        const Vec2f tangent = Normalize(b - a);
        const Vec2f normal = Rotr(tangent);
 
        PolygonProjection proj_a = 
        ProjectPolygonAlongDirection(A, normal);
        PolygonProjection proj_b = 
             ProjectPolygonAlongDirection(B, normal);
        if (proj_a.hi > proj_b.lo && proj_b.hi > proj_a.lo) {
            return true; // no gap, thus a collision
        }
    }
}

bool Collides(
    const ConvexPolygon& A,
    const ConvexPolygon& B)
{
   return CollidesInAnAxisOfA(A, B) || CollidesInAnAxisOfA(B, A);
}

Penetration Vectors

Once we have determined that two polygons are colliding, we need to separate them. In order to do that, we need to find the penetration vector, or the shortest vector that can be added to one of the polygons such that they no longer overlap:

Rather than figuring this out from scratch, let’s modify our collision-checking routine to also figure this out.

When we project the two polygons into a direction, then the overlap of their projections is the distance we would need to move them in the projection direction to separate them:

We can thus find the penetration vector by keeping track of the smallest overlap among all of the projections we do!

struct CollisionResult {
    bool collides;   // whether there is a collision
    f32 penetration; // the length of the penetration vector
    Vec2f dir;       // the unit penetration direction
}

// A helper function that projects along all of the edge tangents of A.
void CollisionHelper(
    const ConvexPolygon& A,
    const ConvexPolygon& B,
    CollisionResult* retval)
{
    Vec2f a1 = A.pts[A.n_pts-1];
    for (u8 i = 0; i < A.n_pts; i++) {
        const Vec2f& a2 = A.pts[i];
        const Vec2f tangent = Normalize(b - a);
        const Vec2f normal = Rotr(tangent);
        
        PolygonProjection proj_a = 
            ProjectPolygonAlongDirection(A, normal);
        PolygonProjection proj_b = 
            ProjectPolygonAlongDirection(B, normal);
        if (proj_a.hi > proj_b.lo && proj_b.hi > proj_a.lo) {
            retval.collides = true;
            return; // no gap, thus a collision
        }

        // Keep track of the smallest penetration
        f32 penetration = min(
            abs(proj_a.lo - proj_b.hi),
            abs(proj_a.hi - proj_b.lo))
        if (penetration < retval.penetration) {
            retval.penetration = penetration;
            retval.dir = normal;
        }
    }
}

CollisionResult Collides(
    const ConvexPolygon& A,
    const ConvexPolygon& B)
{
   CollisionResult retval = {};
   retval.penetration = std::numeric_limits<f32>::infinity();
   CollisionHelper(A, B, &retval);
   if (retval.collides) return retval;
   CollisionHelper(B, A, &retval);
   return retval;
}

Collision Manifolds

Our collision resolution logic can separate the polygons, but that isn’t enough to get realistic physics. We want to affect more than just their positions, but also their velocities and angular states. A collision should result in an impulse force that changes each body’s velocity and angular acceleration.

The torque is the counter clockwise force times the radius to the rigid body’s center of mass. That means we need the radius, which means knowing where the force is applied. So far we’ve only calculated the penetration normal, not where the penetration occurred.

Properly determining a collision response requires computing the contact point or contact manifold. Eric Catto has a nice GDC 2007 slide deck on this, but I wasn’t able to track down the actual talk that goes along with it, so it leaves something to be desired.

The contact manifold represents the points at which two objects come into contact. Ideally that would be the very first points that touch as two objects collide, but because of our discrete time steps, we have to work with partial overlaps:

In some cases, a contact manifold should be an entire edge, but we’ll be approximating it using at most two points.

The contact manifold can be constructed using either polygon as a reference. We’ll be moving the polygons out of collision afterwards, so the contact manifold will be roughly the same independent of which polygon we choose:

The reference polygon will be the one with the edge that the penetration vector is normal to. We’ll call that edge the reference edge. The other polygon is called the incident polygon, and we’ll need to identify an incident edge that crosses the reference edge.

We will start by running our latest collision resolution algorithm, which gives us a penetration unit vector and magnitude. We have to slightly modify that code to keep track of which polygon it came from, so that we know which polygon is the reference polygon.

If it determines that there is no collision, we are done. Otherwise, we then identify the incident edge, which shall be the edge in the incident polygon whose outward normal is most opposed to the reference edge normal:

int inc_edge_index = 0;
int inc_edge_index_succ = 0;
f32 min_dot = std::numeric_limits<f32>::infinity();

int i_prev = I->n_pts - 1;
for (u8 i = 0; i < I->n_pts; i++) {
    common::Vec2f normal_i = Rotl(Normalize(I->pts[i] - I->pts[i_prev]));
    f32 dot = common::Dot(normal, normal_i);
    if (dot < min_dot) {
        min_dot = dot;
        inc_edge_index = i_prev;
        inc_edge_index_succ = i;
        }
    i_prev = i;
}

We then clip the incident edge to lie between the end caps of the reference edge:

This clipping is done in two steps — first by clipping the incident edge (which is a line segment) to lie on the negative side of the left halfspace and then again to lie on the negative side of the right halfspace. We can clip a line segment to a halfspace using the dot product of the endpoints with the halfspace normal vector (which is tangent to the reference edge) and comparing that to the dot product of the reference edge endpoint.

struct ClipResult {
   int n_pts;
   Vec2f pts[2];
};
ClipResult ClipSegmentToHalfspace(
    const Vec2f a,
    const Vec2f b,
    const Vec2f plane_normal,
    const f32 plane_offset,
    const int ref_edge_index) {

    ClipResult retval = {};

    // Compute the distances of the end points to the plane
    f32 dist_a = Dot(plane_normal, a) - plane_offset;
    f32 dist_b = Dot(plane_normal, b) - plane_offset;

    // If the points are on the negative side of the plane, keep them
    if (dist_a <= 0.0f) {
        retval.pts[retval.n_pts] = a;
        retval.n_pts += 1;
    }
    if (dist_b <= 0.0f) {
        retval.pts[retval.n_pts] = b;
        retval.n_pts += 1;
    }

    // If the points are on different sides of the plane, 
    // then we need to find the intersection point. 
    // (We don't have to do anything in the case that the
    // points are both on the positive side of the plane.)
    if (dist_a * dist_b < 0.0f) {
        // Keep the intersection points
        f32 interp = distance_0 / (distance_0 - distance_1);
        retval.pts[retval.n_pts] = a + interp * (b - a);
        retval.n_pts += 1;
    }

    return retval;
}

If we fully clip the segment away (which shouldn’t happen), then we don’t have a real intersection.

The collision manifold will consist of all endpoints in the clipped incident edge that penetrate across the reference edge. This explains how we can end up with either a 1- or a 2-point collision manifold. We can once again use a dot product and compare it to a base offset:

f32 face_offset = Dot(reference_normal, reference_a);
for (int i = 0; i < 2; i++) {
    // Get the distance of the ith endpoint along the normal from the
    // reference face. A negative value indicates penetration into the
    // reference polygon.
    f32 separation = Dot(reference_normal, clip.pts[i]) - face_offset;
    if (separation <= kEps) {
        manifold.points[manifold.n_pts] = clip.pts[i];
        manifold.n_pts += 1;
    }
}

I coded up a visualization of this computation to help with debugging. The red vector is the unit normal along the reference edge, the little aligned blue vector is the same vector scaled by the penetration, the little red square is the 1 point in this collision manifold, and the little green square is the other endpoint of the clipped incident edge, which in this case was not accepted into the collision manifold:

We can set up a case where there are two points in our collision manifold:

I can interact with my visualization by translating or rotation the one square around:

Collision Resolution

Now that we can detect collisions, calculate penetration vectors, and calculate a collision manifold, we can finally do the work of resolving a collision. Wikipedia has a good overview of this, but I’ll write it out here for good measure.

If we have a collision, then our contact manifold will contain one or two contact points. For starters let’s work with the case of a single contact point \(\boldsymbol{c}\). We’ll assume that body A is the reference polygon.

We can simulate the effect of a collision by imparting opposed impulses to the colliding bodies. This impulse is the integral of the force acting on each body over that collision timestep.

The impulse force depends on the relative velocity between the two objects. Let \(\boldsymbol{r}_A\) and \(\boldsymbol{r}_B\) be the vectors from each body’s center of mass (usually its centroid) to the contact point. Then we can calculate the relative velocity at the contact point:

\[\boldsymbol{v}_\text{rel} = (\boldsymbol{v}_B + \omega_B \times \boldsymbol{r}_B) – (\boldsymbol{v}_A + \omega_A \times \boldsymbol{r}_A)\]

The impulse is zero if this relative velocity along the normal is positive.

The normal impulse for the objects is:

\[j = \frac{-(1+e)\boldsymbol{v}_\text{rel} \cdot \boldsymbol{n}}{\frac{1}{m_A} + \frac{1}{m_B} + \frac{(\boldsymbol{r}_A \times \boldsymbol{n})^2}{I_A} + \frac{(\boldsymbol{r}_B \times \boldsymbol{n})^2}{I_B}}\]

where \(e \in [0,1]\) is the coefficient of restitution that determines how elastic the collision is, \(m_A\) and \(m_B\) are the polygon masses, and \(I_A\) and \(I_B\) are their moments of inertia.

We then update each polygon’s linear and angular speed according to:

\[\begin{align} \boldsymbol{v}_A’ &= \boldsymbol{v}_A \> – \frac{j}{m_A} \boldsymbol{n} \\ \boldsymbol{v}_B’ &= \boldsymbol{v}_B + \frac{j}{m_B} \boldsymbol{n} \\ \omega_A’ &= \omega_A \> – \frac{j}{I_A} \left(\boldsymbol{r}_A \times \boldsymbol{n}\right) \\ \omega_B’ &= \omega_B + \frac{j}{I_B} \left(\boldsymbol{r}_B \times \boldsymbol{n} \right)\end{align}\]

When there are two points in the contact manifold, one might be tempted to approximate the collision with their mean. I found that that doesn’t quite work correctly in all cases. For example, in the situation below, the rotation causes the estimated contact point to move right, which imparts a torque in the wrong direction:

In order to fix this, I process both contact points, but weight the impulse in proportion to the penetration depth.

Positional correction is similar. To resolve the collision, we simply need to separate the two bodies along the penetration unit normal by a distance equal to the penetration \(\delta\). We could split the difference and shift the incident polygon by \(\delta / 2\) and the reference polygon by \(-\delta / 2\) along the normal, but we also want to account for their relative masses.

To account for the masses, we’d want to move the incident polygon by

\[\frac{\delta}{2} \frac{m_A}{m_A + m_B}\]

and the reference polygon by

\[-\frac{\delta}{2} \frac{m_B}{m_A + m_B}\]

where \(m_B\) is the mass of the incident polygon and \(m_A\) is the mass of the reference polygon. The heavier polygon is moved less.

However, as stated earlier, rather than fully resolve the collision, we will perform multiple, smaller resolution steps. If our penetration has a magnitude \(\delta\), we only apply \(\beta \delta\) where \(\beta\) is between 0 and 1, typically something like 0.4. We move the incident polygon in the penetration normal direction by that amount, and the reference polygon by the opposite amount. This will help jostle multiple conflicting collisions in a nice way such that they hopefully find a stable, valid solution.

For a nice code reference for all this, I recommend Randy Paul’s Impulse Engine.

Conclusion

And that’s enough to get us some pretty solid solid-body physics!

We actually did quite a lot in this post. We set off to do polygon collision resolution, started with polygon intersections, realized we also needed to figure out the penetration vector, did that, and then realized that to get appropriate impulses we needed to compute approximate collision manifolds. We then used that information to compute impulses, apply them, and to physically separate any colliding bodies.

We don’t have any friction. Its actually pretty easy to add, but I figured this post was already long enough.

We haven’t yet talked about any sort of spatial partitioning or other broad-phase collision tactics. No special memory allocation strategies, stable contact points, or joints. As always, there is a whole world of potential complexity out there that can be tackled if and when its needed, and we want to tackle it.

I highly recommend the Box2D repo as a helpful reference. It is a bit more mature, which can lead to some excessive complexity, but on the whole it is actually very cleanly written and extremely well documented.

Happy coding.

A Transformer Sokoban Policy

I’ve posted about Sokoban before. It is a problem I keep coming back to.

Sokoban is both very simple to specify but its difficulty can be readily scaled up. It is easy to accidentally push blocks into an inescapable kerfuffle, and more interestingly, it isn’t always obvious when you’ve done it.

Moreover, I think people solving Sokoban often beautifully exhibit hierarchical reasoning. The base game is played with up/down/left/right actions, but players very quickly start thinking about sequences of block pushes. We learn to “see” deadlock configurations and reason about rooms and packing orders. This ability to naturally develop higher-order thinking and execute on it is something I want to understand better. So I keep coming back to it.

In this blog post, I’ll recount training a Sokoban policy using deep learning, and more specifically, using the transformer architecture. I started pretty small, developing something I that hope to flesh out more in the future.

The Problem

This blog post is about policy optimization. A policy \(\pi(s)\) maps states to actions. These states and actions must exist in some sort of problem — i.e. a search problem or Markov decision process. It is often a good idea to define the problem before talking about how to structure the policy.

Our problem, of course, is Sokoban. The state space is the space of Sokoban boards. Each board is a grid, and each tile in the grid is either a wall or a floor, each floor may be a goal and may contain either a box or the player. For this post I’m working with 8×8 boards.

\[\text{state space}\enspace\mathcal{S} = \text{Sokoban boards}\]

Normal Sokoban has an action space of up/down/left/right. At least, that is what we’re told. I think that isn’t quite true. In normal Sokoban, the player can get stuck in a deadlock. When that happens, they typically reset the level or hit undo. So I’m going use an action space of up/down/left/right/undo. Trajectories with undos are of course never optimal, but you nevertheless might need to use them when you’re exploring.

\[\text{action space}\enspace\mathcal{A} = \left\{\text{up}, \text{down}, \text{left}, \text{right}, \text{undo}\right\}\]

Notice that we’re working directly with player movements, not with box pushes. Pretty much no-one ever works directly with player movements, because the search space gets so large so quickly. That being said, hierarchical reasoning is the point of my investigations, so I’m starting at the bottom of the reasoning hierarchy.

Our goal is to solve Sokoban problems. For now, let’s just say that we get a reward of 1 whenever we solve a board (all boxes are on goals), and zero reward otherwise. Given two action sequences that solve a board, we prefer the shorter one.

\[R(s,a) = \begin{cases}1 & \text{if } \texttt{issolved}(s) \\ 0 & \text{otherwise}\end{cases}\]

We won’t directly make use of this reward function in this blog post, but it will come into play later.

The Policy

A typical policy would take the current state and try to find the best action that leads toward having all boxes on goals. Its architecture would broadly look like:

The output is a probability distribution over the actions. There are 5 actions (up,down,left,right, and undo), so this is just a categorical distribution over 5 values. That’s the bread and butter of ML and we can easily produce that with a softmax output layer.

The input is a board state. I decided to represent boards as \(8\times 8\times 5\) tensors, where the first two dimensions are the board length and width, and then we have 5 channels that indicate whether a given tile is a wall space, a floor space, has a goal, has a player, and has a box. Altogether that is 320 scalars, which is somewhat long and sparse. However, it contains a lot of structured spatial information that we need to make use of.

The first architectural decision is to run the board state through an encoder head to produce a latent vector. Its basically a 5-channel image, so I’m using convolutional layers, just like you would for an image input. These can learn consistent features that get applied in parallel across the input.

This could work. We’d just stick some feedforward layers with maybe some residual connections into that trunk, and we’d be able to train a policy to place all boxes on goals. I didn’t quite want to go that route though, because I wanted the network to be slightly more configurable.

My eventual goal is to do hierarchical reasoning, so I want to be able to have another, higher-level policy use this lower-level policy as a tool. As such, I want an input that it can provide in order to steer the policy. To this end, I’m providing an additional goal input that tells the neural network what we want to happen:

In Sokoban its pretty important for the player to know when a board just isn’t solvable. Its really easy to end up in a deadlock state:

We want the model to be able to learn when this happens, and to be able to report that it thinks this is the case.

I thus include an additional output, which I’m calling the nsteps output. It predicts the number of steps left in the solution of the puzzle. If the network thinks there are an infinite number of steps left, then the board is not solvable.

In Stop Regressing, the authors suggest avoiding (continuous) regression and to train over discrete targets instead. As such, my nsteps output will actually output a discrete value:

nsteps = Int(clamp(round(log(nsteps+1)), 0, 5)) + 1

which can take on up to 6 values:

and a 7th bin for unsolvable.

That leaves our network as (with the softmaxes implied):

This architecture is a perfectly reasonable policy network. In fact, we’ll compare against this network later on. However, I was also very interested in working with transformers, so I built a transformer version of this network:

This architecture is roughly the same, except you have the massively parallel transformer that:

  • can train with supervised learning more efficiently, by receiving an entire sequence at once
  • can use past information to inform future decisions
  • can easily be scaled up or down based on just a few parameters
  • lends itself nicely to beam search for exploration

I am initially working with relatively small models consisting of 3-layer transformers with 16-dimensional embedding spaces, a maximum sequence length of 32, and 8-headed attention.

Training

This initial model is trained with supervised learning. In order to do that, we need supervised data.

Typical supervised learning would provide a starting state, the goal state, and an optimal set of actions to reach it. My training examples are basically that, but also include examples that cannot be solved and some examples with bad moves and a training mask that allows me to mask them out during training. (This theoretically lets the network learn undos and to recover from bad moves).

start:
########
# #
# #
# . bp #
# #
# #
# #
########

goal:
########
# #
# #
# B #
# #
# #
# #
########

solved
moves: LL
isbad: --

solved
moves: xLL
isbad: x--


solved
moves: UxLL
isbad: xx--

I have a few handcrafted training entries, but most of them are randomly generated problems solved with A*. I generate 1-box problems by starting with an 8×8 grid of all-walls, producing two random rectangular floor sections (rooms), randomly spawning the goal, box, and the player, and then speckling in some random walls. That produces a nice diversity that has structure, but not too much structure.

Most of these boards aren’t solvable, so I when I generate training examples, I set a target ratio for solved and unsolved boards and make sure to keep generating until I get enough of each type. I have 4000 generated problems with solutions and 4000 generated problems without solutions, resulting in 8,000 overall generated training examples.

I was able to cheaply boost the size of the training set by recognizing that a problem can be rotated 4 ways and can be transposed. Incorporating these transforms (and applying the necessary adjustments to the solution paths) automatically increases the number of training examples by 8x.

I train with a standard Flux.jl loop using the default Adam optimizer. I shuffle the training set with each epoch, and train with a batch size of 32. I’m using 1% dropout.

The training loop is set up to create a new timestamped directory with each run and save a .json file with basic metrics. I compute both the policy and nsteps loss on each batch and log these for later printing. This gives me nice training curves, effectively using the code from the Flux quick start guide:

This plot uses a log-scale x-axis, which gives it some distortion. Training curves are often plotted this way because a lot of stuff happens early on and more updates have to be be made later on to get similar movement.

I’ve been consistently seeing the plots play out like they do above. At first the loss drops very quickly, only to plateau between 100 and 1000 batches, and then to suddenly start reducing again. The model seems to suddenly understand something around the 1000-batch mark. Based on how flat the policy loss is, it seems to be related to that. Maybe it learns some reliable heuristics around spatial reasoning and walking to boxes.

The curves suggest that I could continue to train for longer and maybe get more improvement. Maybe that’s true, and maybe I’ll look into that in the future.

Validation

I have two types of validation – direct supervised metrics and validation metrics based on using the policy to attempt to solve problems. Both of these are run on a separate validation dataset of 4000 randomly generated problems (with a 50-50 split of solvable and not solvable).

The supervised metrics are:

  • accuracy at predicting whether the root position is solvable
  • top-1 policy accuracy – whether the highest-likelihood action matches the supervised example
  • top-2 policy accuracy – whether the supervised example action is in the top two choices
  • top-1 nsteps accuracy – whether the highest-likelihood nsteps bucket matches the example
  • top-2 nsteps accuracy – ditto, but top two choices

Note that a Sokoban problem can have multiple solutions, so we shouldn’t expect a perfect policy accuracy. (i.e. going up and then left is often the same as going left and then up).

I then use the model to solve each problem in the validation dataset using beam search. This lets me actually test efficacy. Beam search was run with a beam width of 32, which means I’m always going to find the solution if it takes fewer than 2 steps. The search keeps the top candidates based on the running sum of the policy likelihood. (I can try other formulations, like scoring candidates based on the predicted nsteps values).

Results

I trained two models – a transformer model and a 2nd model that does not have access to past data. The 2nd model is also a transformer, but with a mask that masks out everything but the current and goal states.

The metrics are:

TransformerBaseline
Solvability Accuracy0.9770.977
Top-1 Policy Accuracy0.8670.831
Top-2 Policy Accuracy0.9820.969
Top-1 Nsteps Accuracy0.9000.898
Top-2 Nsteps Accuracy0.9900.990
Solve Rate0.9210.879
Mean Solution Length6.9836.597

Overall, that’s pretty even. The baseline policy that does not have access to past information is ever so slightly worse with respect to top-1 accuracy. Our takeaway is that for this problem, so far, the past doesn’t matter all that much and we can be pretty Markovian.

A 97.7% solvability accuracy is pretty darn good. I am happy with that.

The top-k policy and nsteps accuracies are pretty good too. The full transformer outperforms the baseline a bit here, but only by a bit.

The primary difference comes out when we look at the solve rate, which we obtain by actually using the policies to solve problems. The full transformer succeeds 92.1% of the time, whereas when there is no history it only solves it 87.9% of the time. That’s only a 4.2% spread, but it is noticeable. I can only hypothesize, but I suspect past information is useful for longer traversals and simply the fact that you can pull more information in can give the model more space to compute with.

Differential Evolution

Algorithms for Optimization covers Differential Evolution in its chapter on population methods. It is fairly easy to specify what it does, but the combination of things it does is somewhat strange, and it isn’t all that obvious why it would work, or work particularly well. That being said, there is plenty of support for Differential Evolution in the literature, so I figured there probably is a method to this method’s madness. We’re in the process of updating the textbook for a 2nd edition, so in this post I’ll spend some time trying to build more intuition behind Differential Evolution to better present it in the text.

The Algorithm

Differential Evolution is a derivative-free population method. This means that instead of iteratively improving a single design, it iteratively improves a bunch of designs, and doesn’t need the gradient to do so. The individual updates make use of other individuals in the population. An effective population method will tend to converge its individuals on the (best) local minima of the objective function.

Differential Evolution applied to the Ackley function, proceeding left to right and top to bottom. Image from Algorithms for Optimization published by the MIT Press.

In each iteration, a new candidate individual \(\boldsymbol{x}’\) is created for every individual \(\boldsymbol{x}\) of the population. If the new candidate individual has a lower objective function value, it will replace the original individual in the next generation.

We first construct an interim individual \(\boldsymbol{z}\) using three randomly sampled and distinct other population members:

\[\boldsymbol{z} = \boldsymbol{a} + w \cdot (\boldsymbol{b} – \boldsymbol{c})\]

where \(w\) is a scalar weight.

The candidate design \(\boldsymbol{x}’\) is created via uniform crossover between \(\boldsymbol{z}\) and \(\boldsymbol{x}\), which just means that each component has a probability \(p\) of coming from \(\boldsymbol{z}\) and otherwise comes from \(\boldsymbol{x}\).

This is really easy to code up:

function diff_evo_step!(
population, # vector of designs
f, # objective function
p, # crossover probability
w) # differential weight

n, m = length(population[1]), length(population)
for x in population
a, b, c = sample(population, 3, replace=false)
z = a + w*(b-c)
x′ = x + (rand(m) .< p).*(z - x)
if f(x′) < f(x)
x[:] = x′
end
end
return population
end

What I presented above is a slight simplification to the normal way Differential Evolution is typically shown. First, while my implementation samples three distinct members of the population for \(\boldsymbol{a}\), \(\boldsymbol{b}\), and \(\boldsymbol{c}\), it does not guarantee that they are distinct from \(\boldsymbol{x}\). They will pretty much always be distinct from \(\boldsymbol{x}\) for sufficiently large population sizes, and if not it isn’t all that detrimental.

Second, the standard way to define Differential Evolution always forces at least one component to mutate. This isn’t important if the crossover probability is large enough in comparison to the number of components, but in practice users have found that a very low, or even zero, crossover probability with one guaranteed mutation works best for many objective functions.

What is Going On

Differential Evolution is weird. Or at the very least, its unintuitive. Let’s break it down.

The algorithm proceeds by replacing individuals with new candidate individuals if they are better. So far so good.

The candidate individuals are formed by three things:

  • a baseline individual \(\boldsymbol{a}\)
  • a perturbation formed by the difference \(w\cdot(\boldsymbol{b} – \boldsymbol{c})\)
  • uniform crossover

Differential Evolution gets its name from the second bullet – the difference between individuals. This is its secret sauce. Its nice in that it provides a natural way to take larger steps at the start, when the population is really spread out and hasn’t found local minima, and smaller steps later on when the population is concentrated around a few specific regions:

We can run a version of Differential Evolution that only has this differential, using \(\boldsymbol{x}’ = \boldsymbol{a} + w \cdot (\boldsymbol{b} – \boldsymbol{c})\). If we do that, we’ll get design improvements, but they’ll be local to the current individual. Including the baseline individual is going to do better:

The mean best objective function value and 1 standard deviation intervals using Differential Evolution and pure differential on the Ackley function, over 1000 trials. I’m using \(w = 0.2\) to exaggerate the effect.

Local perturbations alone are often insufficient to get individuals out of local minima to better local minima. This is where the baseline individual \(\boldsymbol{a}\) comes in. Perturbing off of a different individual than \(\boldsymbol{x}\) means we could be substituting \(\boldsymbol{x}\) with a perturbed individual in a better local minima. When that happens, we’re basically warping \(\boldsymbol{x}\) to that better minima:

So that’s where \(\boldsymbol{a} + w \cdot (\boldsymbol{b} – \boldsymbol{c})\) comes from. What about uniform crossover? Let’s run an experiment.

Below I show the average and \(\pm 1\) standard deviation on the best objective function value vs. iteration for 1000 trials on a bunch of different algorithm variations on the Ackley function:

  • no perturbation (p=0.5, w=0.0)
  • no baseline individual (p=0.5, w=0.5, a=x)
  • baseline Differential Evolution (p=0.5, w=0.5)
  • high mutation rate (p=0.9, w=0.5)
  • guaranteed crossover (p=1.0, w=0.5)

Without perturbations (w=0.0) we can’t really explore the space. We’re only looking at coordinate values that exist in our initial population. Its no surprise that that one does poorly.

Using just the baseline individual is the next-worst, because that one can’t as effectively jump individuals to new regions.

The top three performers are all Differential Evolution with varying mutation rates. We find that higher mutation rates in uniform crossover perform best. In fact, our best performance doesn’t use crossover at all, but just directly uses \(\boldsymbol{x}’ = \boldsymbol{z}\). Do we even need uniform crossover?

This table from Differential Evolution – A Simple and Efficient Heuristic for Global Optimization over Continuous Spaces shows the parameters they used on a fairly large test suite of objective functions:

The CR column gives the crossover probability. Its basically always zero! Recall that in official implementations, crossover in Differential Evolution is guaranteed to mutate at least one component.

According to Differential Evolution: A Practical Guide to Global Optimization, this paper somewhat confused the community, because functions tended to either be solved best with \(p \approx 0\) or with \(p \approx 1\). It wasn’t until later that people made the connection that the functions that were solved with \(p \approx 0\) were decomposable, and the functions with \(p \approx 1\) were not.

A function is decomposable if it can be written as a sum of functions using distinct subsets of the inputs, such as:

\[f(\boldsymbol{x}) = \sum_i f^{(i)}(x_i)\]

When this is the case, we can improve individual components in isolation without having to be in-sync with the other components. Small mutation rates are reliable, because we can just reliably adjust a few components at a time and step our way to a minimum. By changing fewer components we have a smaller chance of messing something else up.

Most of the interesting objective functions have a high degree of coupling, and we can’t get away with optimizing some variables in isolation. For example, a neural network mixes its inputs in a convoluted manner. In order to make progress we often have to step in multiple component directions at the same time, which for Differential Evolution requires a higher crossover probability.

Optimizing Decomposable and Non-decomposable Functions

Let’s run another experiment. We’ll again do 1000 trials each (population size 20) with a bunch of versions of Differential Evolution. In the first trial, we’ll optimize a decomposable function, 10 parallel copies of Ackley’s function (making this a 20-dimensional optimization problem):

n_copies = 10
f = x -> sum(100*(x[2i-1]^2-x[2i])^2 + (1-x[2i-1])^2 for i in 1:n_copies)

We’re going to vary the mutation rate. If our results match with the literature, we expect very low mutation rates to do better.

We do indeed find that low mutation rates do better. In fact, a mutation rate of zero with one guaranteed mutation is the best, closely followed by \(p = 0.1\). This does well because taking a small step in just one or two axes can more reliably lead to a win rather than moving in a bunch of axes, and this “knowledge” gained in one individual can easily be copied to other individuals to improve them too.

We can do another experiment with a non-decomposable function, in this case the xor problem. Let’s optimize some basic 2-layer feedforward neural networks. These networks have 21 parameters, so this is a 21-parameter optimization problem:

relu(x) = max(x,0.0)
sigmoid(x) = 1/(1+exp(-x))

function binary_cross_entropy(y_true, y_pred; ϵ=1e-15)
y_pred = clamp.(y_pred, ϵ, 1-ϵ)
return -mean(y_true .* log.(y_pred) + (1.-y_true) .* log.(1.-y_pred))
end

# xor problem
f = x -> begin
inputs = randn(2,10)
W1 = reshape(x[1:10], (5,2))
b1 = x[11:15]
W2 = reshape(x[16:20], (1,5))
b2 = x[21]

hidden = relu.(W1*inputs .+ b1)
y_pred = sigmoid.(W2*hidden .+ b2)

truth_values = inputs .> 0.0
y_true = .!=(truth_values[1, :], truth_values[2, :])

return binary_cross_entropy(y_true, y_pred)
end

This optimization problem is not decomposable. Taking steps in only one axis at a time is less likely to lead to significant improvements.

Again, that is what we see! Here the single guaranteed mutation is the worst. Interestingly, full crossover with \(p=1.0\) is initially pretty good but then tapers off, so there does appear to be some advantage to some crossover randomness. The lowest loss is most consistently achieved with \(p=0.5\), but \(p=0.9\) does nearly as well.

Why is full crossover not as good? I suspect that version will more quickly concentrate its populations into local regions, whereas having some stochasticity spreads the populations out more. Let’s plot the mean pairwise distance in the population for this experiment:

Yup! That’s exactly what we see. With full crossover, our population prematurely converges in about 20 iterations. With \(p=0.9\) it also converges pretty quick, but it apparently has enough time to find a better solution. The other crossover rates are able to avoid that premature convergence.

Conclusion

So now we know a thing or two about what makes Differential Evolution work. To recap, it uses:

  • the differential between two individuals to provide perturbations, which let us explore
  • this differential automatically produces smaller perturbations as the population converges
  • applies the perturbation to a baseline individual which can automatically warp designs to more favorable regions
  • uses uniform crossover to either
    • \(p \approx 0\) force steps in a small set of coordinates; useful for separable functions
    • \(p \geq 0.5\) allow for steps in many coordinates, while preventing premature convergence

So all of the ingredients have their place! Now its time to condense these learnings and add them to the next edition of the textbook.

Cheers.

Transformers – How and Why They Work

This month I decided to take a break from my sidescroller project and instead properly attend to transformers (pun intended). Unless you’ve been living under a rock, you’ve noticed the rapid advanced of AI in the last year and the advent of extremely large models like ChatGPT 4 and Google Gemini. These models, and pretty much every other large and serious application of AI nowadays, are based on the transformer architecture first introduced in Attention is All You Need. So this post is about transformers, how and roughly why they work, and how to write your own.

The Architecture

The standard depiction of the transformer architecture comes from Figure 1 of Attention is All You Need:

I initially found this image difficult to understand. Now that I’ve implemented my own transformer from scratch, I’m actually somewhat impressed by its concise expressiveness. That’s appropriate for a research paper where information density is key, but I think we can do better if we’re explaining the architecture to someone.

Transformers at base operate on tokens, which are just discrete items. In effect, if you have 5 tokens, then you’re talking about having tokens 1, 2, 3, 4, and 5. Transformers first became popular when used on text, where tokens represents words like “potato” or fragments like “-ly”, but they have since been applied to chunks of images, chunks of sound, simple bits, and even discrete states, actions, and rewards. I think words are the most intuitive to work with, so let’s go with that.

Transformers predict the next token some given all of the tokens that came before:

\[P(x_{t} \mid x_{t-1}, x_{t-2}, \ldots)\]

For example, if we started a sentence with “the cat sat”, we might want it to produce a higher likelihood for “on” than “potato”. Conceptually, this looks like:

The output probability distribution is just a Categorical distribution over the \(n\) possible tokens, which we can easily produce with a softmax layer.

You’ll notice that my model goes top to bottom whereas academia for some unfathomable reason usually depicts deep neural nets bottom to top. We read top to bottom so I’m sticking with that.

We could use a model like this to recursively generate a full sentence:

So far we’ve come up with a standard autoregressive model. Those have been around forever. What makes transformers special is that they split the inputs from the outputs such that the inputs need only be encoded once, that they use attention to help make the model resilient to how the tokens are ordered, and that they solve the problem of vanishing gradients.

Separate Inputs and Outputs

The first weird thing about the transformer architecture figure is that it has a left side and a right side. The left side receives “inputs” and the right side receives “outputs”. What is going on here?

If we are trying to generate a sentence that starts with “the cat sat”, then “the cat sat” is the inputs and the subsequent tokens are the outputs. During training we’d know what the subsequent tokens are (our training set would split sentences into (input, output) pairs, such as randomly in the middle or via question/answer), and during inference we’d sample the outputs sequentially.

Conceptually, we’re now thinking about an architecture along these lines:

Is this weird? Yes. Why do they do it? Because it’s more efficient during training, and during inference you only need to run the input head once.

During training, we know all of the tokens and so we just stick a loss function on this to maximize the likelihood of the correct output token:

During inference, we run the encoder once and start by running the model with an empty output:

We only use the probability distribution over the first token, sample from it, and append the sampled token to our output. If we haven’t finished out sentence yet, we continue on:

We only use the probabilistic distribution over the next token, sample from it, etc. This iterative process is auto-regressive, just like we were talking about before. (aside 1) (aside 2)

The reason I think this approach is awkward is because you’d typically want the representation of \(P(x_{t} \mid x_{t-1}, x_{t-2}, \ldots)\) to not care about whether the previous tokens were original inputs or not. That is, we want \(P(\text{the} | \text{the cat sat on})\) to be the same irrespective of whether “the cat sat on” are all inputs or whether “the cat sat” is the input and we just sampled “on”. In a transformer, they are different.

Robust to Token Ordering

The real problem that the transformer architecture solves is a form of robustness to token ordering. Let’s consider the following two input / output pairs:

  • the cat sat | on the mat
  • the yellow cat sat | on the mat

They are the same except for the extra “yellow” token in the second sentence.

If our model was made up of simple feed-forward layers, that would present a problem:

A feedforward layer contains an affine transform \(\boldsymbol{x}’ \gets \boldsymbol{A} \boldsymbol{x} + \boldsymbol{b}\) that learns a different mapping for every input. We can even write it out for three inputs:

\[\begin{matrix}x’_1 &\gets A_{11}x_1 + A_{12}x_2 + A_{13}x_3 + b_1 \\ x’_2 &\gets A_{21}x_1 + A_{22}x_2 + A_{23}x_3 + b_2 \\ x’_3 &\gets A_{31}x_1 + A_{32}x_2 + A_{33}x_3 + b_3\end{matrix}\]

If we learn something about “cat” being in the second position, we’d have to learn it all over again to handle the case where “cat” is in the third position.

Transformers are robust to this issue because of their use of attention. Put very simply, attention allows the neural network to learn when particular tokens are important in a position-independent way, such that they can be focused on when needed.

Transformers use scaled dot product attention. Here, we input three things:

  • a query \(\boldsymbol{q}\)
  • a key \(\boldsymbol{k}\)
  • a value \(\boldsymbol{v}\)

Each of these are vector embeddings, but they can be thought of as:

  • query – a representation of what we are asking for
  • key – how well the current token we’re looking at reflects what we’re asking for
  • value – how important it is that we get something that matches what we’re asking for

For example, if we have “the cat sat on the ____”, and we’re looking to fill in that last blank, it might be useful for the model to have learned a query for representing things that sit, and to value the result of that query a lot when we need to fill in a word for something that sits.

We take the dot product of the query and the key to measure how well they match: \(\boldsymbol{q}^T \boldsymbol{k}\). Each token has a key, so we end up with a measure of how well each of them matches:

Those measures can take on any values. Taking the softmax turns these measures into values that lie in [0,1], preserving relative size: (aside)

In this example, our activation vector has a large value (closer to 1) for “cat”.

Finally, we can take the dot product of our activation vector \(\boldsymbol{\alpha}\) with our value vector to get the overall attention value: \(\boldsymbol{\alpha} \cdot \boldsymbol{v}\).

Notice that where cat was in the list of tokens didn’t matter all to much. If we shift it around, but give it the same key, it will continue to produce the same activation value. Then, as long as the value is high anywhere that activation is active, we’ll get a large output.

Putting this together, our attention function for a single query \(\boldsymbol{q}\) is:

\[ \texttt{attention}(\boldsymbol{q}, \boldsymbol{k}^{(1)}, \ldots, \boldsymbol{k}^{(n)}, \boldsymbol{v}) = \texttt{softmax}\left(\boldsymbol{q}^T [\boldsymbol{k}^{(1)}, \ldots, \boldsymbol{k}^{(n)}]\right) \cdot \boldsymbol{v} \]

We can combine the keys together into a single matrix \(\boldsymbol{K}\), which simplifies things to:

\[ \texttt{attention}(\boldsymbol{q}, \boldsymbol{K}, \boldsymbol{v}) = \texttt{softmax}\left(\boldsymbol{q}^T \boldsymbol{K}\right) \cdot \boldsymbol{v} \]

We’re going to want a bunch of queries, not just one. That’s equivalent to expanding our query and value vectors into matrices:

\[ \texttt{attention}(\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V}) = \texttt{softmax}\left(\boldsymbol{Q}^T \boldsymbol{K}\right) \boldsymbol{V} \]

We’ve basically recovered the attention function given in the paper. I just has an additional scalar term that helps keep the logits passed into softmax smaller in magnitude:

\[\texttt{attention}(\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V}) = \texttt{softmax}\left(\frac{\boldsymbol{Q}\boldsymbol{K}^T}{\sqrt{d_k}}\right) \boldsymbol{V}\]

where \(d_k\) is the dimension of the keys — i.e. how many features long the embeddings are.

The output is a matrix that has larger entries where our queries matched tokens and our value was large. That is, the transformer learned to ask for something and extract it out if it exists.

Robust to Vanishing Gradients

The other important problem that transformers solve is the vanishing gradient problem. The previous alternative to the transformer, the recurrent neural network (RNN), tends to suffer from this issue.

A recurrent neural network represents a sequence by taking as input the current token and a latent state:

This state is referred to as the RNN’s memory.

The vanishing gradients problem arises when we try to propagate gradient information far back in time (where far can just be a few tokens). If we want the network to learn to associate “mat” with “cat”, then we need to propagate through 4 instances of the RNN:

Simply put, in this example we’re taking the gradient of RNN(RNN(RNN(RNN(“cat”, state)). The chain rule tells us that the derivative of \(f(f(f(f(x))))\) is:

\[f'(x) \> f'(f(x)) \> f'(f(f(x))) \> f'(f(f(f(x))))\]

If \(f'(x)\) is smaller than 1, then we very quickly drive that gradient signal toward zero as we try to propagate it back further. The gradient vanishes!

Transformers solve this problem by having all of the inputs be mixed in the attention layers. This allows the gradient to readily flow across.

It also mitigates the problem using residual connections. These are the “skip connections” that you’ll see below. They provide a way for the gradient to flow up the network unimpeded, so it doesn’t decrease if it has to travel through a bunch of transformer layers.

Building a Transformer

Let’s use what we’ve learned and build a transformer, setting aside the fact that we haven’t covered multiheaded attention just yet. We’re going to construct the following input head:

This is analogous to a single “input encoder layer” in the transformer architecture. Note that the real deal uses multiheaded attention and has dropout after each layer norm.

The trainable parameters in our input head are the three projection matrices \(\boldsymbol{W}^Q\), \(\boldsymbol{W}^K\), and \(\boldsymbol{W}^V\), as well as the learnable params in the feedforward layer. Layer normalization is simply normalization along each input vector rather than along the batch dimension, which keeps things nicely scaled.

I’m going to implement this input head in Julia, using Flux.jl. The code is remarkably straightforward and self-explanatory:

struct InputHead
    q_proj::Dense
    k_proj::Dense
    v_proj::Dense
    norm_1::LayerNorm
    affine1::Dense
    affine2::Dense
    norm_2::LayerNorm
end

Flux.@functor InputHead

function (m::InputHead)(
    X::Array{Float32, 3}) # [dim × ntokens × batch_size]

    dim = size(X, 1)

    # scaled dot product attention
    Q = m.q_proj(X) # [dim × ntokens × batch_size]
    K = m.k_proj(X) # [dim × ntokens × batch_size]
    V = m.v_proj(X) # [dim × ntokens × batch_size]
    α = softmax(Q*K'./√Float32(dim), dims=1)
    X′ = α*V

    # add norm
    X = m.norm_1(X + X′)

    # feedforward
    X′ = m.affine2(relu(m.affine1(X)))
    
    # add norm
    X = m.norm_1(X + X′)

    return X
end

Once we have this, building the rest of the simplified transformer is pretty easy. Let’s similarly define an output head:

The output head is basically the same, except it has an additional attention + layer normalization section that receives its query from within the head but its keys and values from the input head. This is sometimes called cross attention, and allows the outputs to receive information from the inputs. Implementing this in Julia would be pretty straightforward, so I’m not covering it here.

In order to fully flesh out a basic transformer, we just have to define what happens at the outputs and inputs. The output is pretty simple. Its just a feedforward layer followed by a softmax:

The feedforward layer both provides a chance for everything to get properly mixed (since its fully connected), but more importantly, changes the tensor dimension from the embedding size to the number tokens. The softmax then ensures that the output represents probabilities.

The inputs, by which I mean all of the tokens, have to be turned into embedding matrices. Remember that there’s only a finite set of tokens. We associate a vector embedding with each token, which we re-use every time the token shows up. This embedding can be learned, in which case its the same as having our token be a one-hot vector \(x_\text{one-hot}\) and learning some matrix \(m\times n\) matrix \(E\) where \(m\) is the embedding dimension and \(n\) is the number of tokens:

\[z = E x_\text{one-hot}\]

We then add a positional encoding to each embedding. Wait, I thought we wanted the network to be robust to position? Well yeah, it is. But knowing where something shows up, and whether it shows up before or after something else is also important. So to give the model a way to determine that, we introduce a sort of signature or pattern to each embedding. The original paper uses a sinusoidal pattern:

\[z_\text{pos encode}(i)_j = \begin{cases} \sin\left(i / 10000^{2j / j_\text{max}}\right) & \text{if } j \text{ is even} \\ \cos\left(i / 10000^{2j / j_\text{max}}\right) & \text{otherwise} \end{cases}\]

for the \(j\)th entry of the position encoding at position \(i\).

What I’m not showing here is that the original paper again uses dropout after the position encoding is added in.

Flux already supports token embeddings, so let’s just use that. We can just generate our position encodings in advance:

function generate_position_encoding(dim::Int, max_sequence_length::Int)
    pos_enc = zeros(Float32, dim, max_sequence_length)

    for j in 0:2:(dim-1)
        denominator::Float32 = 10000.0^(2.0*(j ÷ 2)/dim)
        for i in 1:max_sequence_length
            pos_enc[j+1, i] = sin(i / denominator)
        end
    end

    for j in 1:2:(dim-1)
        denominator::Float32 = 10000.0^(2.0*(j ÷ 2)/dim)
        for i in 1:max_sequence_length
            pos_enc[j+1, i] = cos(i / denominator)
        end
    end

    return pos_enc
end

The overall input encoder is then simply:

struct InputEncoder
    embedding::Embedding # [vocab_size => dim]
    position_encoding::Matrix{Float32} # [dim × n_tokens]
    dropout::Dropout
end

Flux.@functor InputEncoder

function (m::InputEncoder)(tokens::Matrix{Int}) # [n_tokens, batch_size]
    X = m.embedding(tokens) # [dim × n_tokens × batch_size]
    X = X .+ m.position_encoding
    return m.dropout(X)
end

When it gets down to it, coding this stuff up really is just like stacking a bunch of lego bricks.

We can put it all together to get a super-simple transformer:

struct Transformer
    input_encoder::InputEncoder
    trans_enc::TransformerEncoderLayer
    trans_dec::TransformerDecoderLayer
    linear::Dense # [vocab_size × dim]
end

Flux.@functor Transformer

function Transformer(vocab_size::Int, dim::Int, n_tokens::Int;
                   hidden_dim_scale::Int = 4,
                   init = Flux.glorot_uniform,
                   dropout_prob = 0.0)
    input_encoder = InputEncoder(
        Flux.Embedding(vocab_size => dim),
        generate_position_encoding(dim, n_tokens),
        Dropout(dropout_prob))
    trans_enc = TransformerEncoderLayer(dim,
        hidden_dim_scale=hidden_dim_scale,
        bias=true, init=init, dropout_prob=dropout_prob)
    trans_dec = TransformerDecoderLayer(dim,
        hidden_dim_scale=hidden_dim_scale,
        bias=true, init=init, dropout_prob=dropout_prob)
    linear = Dense(dim => vocab_size, bias=true, init=init)
    return Transformer(input_encoder, trans_enc, trans_dec, linear)
end

function (m::Transformer)(
    input_tokens::Matrix{Int},
    output_tokens::Matrix{Int}) # [n_tokens, batch_size]

    X_in = m.input_encoder(input_tokens) # [dim × n_tokens × batch_size]
    X_out = m.input_encoder(output_tokens) # [dim × n_tokens × batch_size]
    E = m.trans_enc(X_in)
    X = m.trans_dec(X_out, E)
    logits = m.linear(X) # [vocab_size × n_tokens × batch_size]
    return logits
end

Note that this code doesn’t run a softmax because we can directly use the crossentropy loss on the logits, which is often more accurate.

Running this model requires generating datasets of tokens, where each token is an integer. We pass in our batch of token inputs and our batch of token outputs, and see what logits we get.

Masking

We’ve got a basic transformer that will take input tokens and produce output logits. We can train it to the maximize the likelihood of the expected future tokens. We’ll probably be able to perform pretty well on the training set, but if we go and try to actually generate sentences, it won’t do all that well.

Why? Because the current setup allows the neural network to attend to future information.

Recall that we’re trying to predict the next token given the tokens that came before:

\[P(x_{t} \mid x_{t-1}, x_{t-2}, \ldots)\]

The way our output head is currently constructed, we’re allowing tokens there to attend all of the tokens, which includes future output tokens.

Let’s revisit our attention function for a single query \(\boldsymbol{q}\):

\[ \texttt{attention}(\boldsymbol{q}, \boldsymbol{k}^{(1)}, \ldots, \boldsymbol{k}^{(n)}, \boldsymbol{v}) = \texttt{softmax}\left(\boldsymbol{q}^T [\boldsymbol{k}^{(1)}, \ldots, \boldsymbol{k}^{(n)}]\right) \cdot \boldsymbol{v} \]

We want to modify the attention function such that we do not consider the keys for tokens beyond a certain index, say index \(t\). That means we want softmax to disregard those tokens. To achieve that, we need their values to be \(-\infty\):

\[ \underset{\leq t}{\texttt{attention}}(\boldsymbol{q}, \boldsymbol{k}^{(1)}, \ldots, \boldsymbol{k}^{(n)}, \boldsymbol{v}) = \texttt{softmax}\left([\boldsymbol{q}^T \boldsymbol{k}^{(1)}, \ldots, \boldsymbol{q}^T \boldsymbol{k}^{(t)}, -\infty, \ldots, -\infty]\right) \cdot \boldsymbol{v} \]

One easy way to do this is to pass in an additional mask vector that is zero for \(i \leq t\) and negative infinity otherwise:

\[ \begin{aligned}\underset{\leq t}{\texttt{attention}}(\boldsymbol{q}, \boldsymbol{k}^{(1)}, \ldots, \boldsymbol{k}^{(n)}, \boldsymbol{v}) = \texttt{softmax}( & [\boldsymbol{q}^T \boldsymbol{k}^{(1)}, \ldots, \boldsymbol{q}^T \boldsymbol{k}^{(t)}, \boldsymbol{q}^T \boldsymbol{k}^{(t+1)}, \ldots, \boldsymbol{q}^T \boldsymbol{k}^{(n)}] + \\ & [0, \ldots, 0, -\infty, \ldots, -\infty]) \cdot \boldsymbol{v}\end{aligned} \]

or

\[ \texttt{attention}(\boldsymbol{q}, \boldsymbol{K}, \boldsymbol{v}, \boldsymbol{m}) = \texttt{softmax}( \boldsymbol{q}^T \boldsymbol{K} + \boldsymbol{m}) \cdot \boldsymbol{v} \]

We don’t use just one query, we use a bunch. If you’re paying attention, you noticed that we actually get one query per output-head token because our query matrix is:

\[\boldsymbol{Q} = \boldsymbol{X}\cdot \boldsymbol{W}^Q\]

So we want \(\boldsymbol{q}^{(1)}\) to use \(t = 1\), \(\boldsymbol{q}^{(2)}\) to use \(t = 2\), etc. This means that when we move to attention with multiple queries, we get:

\[ \texttt{attention}(\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V}, \boldsymbol{M}) = \texttt{softmax}\left( \frac{\boldsymbol{Q} \boldsymbol{K}^T}{\sqrt{d_k}} + \boldsymbol{M}\right) \boldsymbol{V} \]

where \(\boldsymbol{M}\) is a lookahead mask with the upper right triangle set to negative infinity:

Check out this blog post for some additional nice coverage of self-attention masks.

Are We Done Yet?

Alright, we’ve got our baby transformer and we’ve added lookahead attention. Are we done?

Well, sort of. This is the minimum amount of stuff necessary to appropriately learn something that can call itself a transformer and that will probably sort-of work. All of the concepts are there. A real transformer will be the same, just bigger.

First, it will use multi-headed rather than single-headed attention. This just means that it does single-headed attention multiple times in parallel:

We could do that by running a \(\texttt{for}\) loop over multiple attention heads, but in practice it can be implemented by splitting our tensor into a new dimension according to the number of heads:

function (mha::MultiHeadAttention)(
    Q_in::Array{Float32, 3}, # [dim × ntokens × batch_size]
    K_in::Array{Float32, 3}, # [dim × ntokens × batch_size]
    V_in::Array{Float32, 3}, # [dim × ntokens × batch_size]
    mask::Array{Float32, 2}) # [ntokens × ntokens]

    dims = size(Q_in)
    dim = dims[1]

    # All matrices end up being [dim × ntokens × batch_size]
    Q = mha.q_proj(Q_in) # [dim × ntokens × batch_size]
    K = mha.k_proj(K_in) # [dim × ntokens × batch_size]
    V = mha.v_proj(V_in) # [dim × ntokens × batch_size]

    # Reshape to # [dim÷nheads × nheads × ntokens × batch_size]
    Q = reshape(Q, dim ÷ mha.nheads, mha.nheads, dims[2:end]...)
    K = reshape(K, dim ÷ mha.nheads, mha.nheads, dims[2:end]...)
    V = reshape(V, dim ÷ mha.nheads, mha.nheads, dims[2:end]...)

    # We're going to use batched_mul, which operates on the first 2 dimensions.
    # We want Q, K, and V to act as `nheads` separate attention heads, so we 
    # need to move the 'nheads' dimension out of the first 2 dimensions.
    Kp = permutedims(K, (3, 1, 2, 4)) # This effectively takes care of the transpose too
    Qp = permutedims(Q, (1, 3, 2, 4)) ./ √Float32(dim)
    logits = batched_mul(Kp, Qp) # [ntokens × ntokens × nheads × batch_size]

    # Apply the mask
    logits = logits .+ mask # [ntokens × ntokens × nheads × batch_size]

    # Compute the activations
    α = softmax(logits, dims=1) # [ntokens × ntokens × nheads × batch_size]

    # Run dropout on the activations
    α = mha.dropout(α) # [ntokens × ntokens × nheads × batch_size]

    # Multiply by V, again with a batched_mul
    Vp = permutedims(V, (1, 3, 2, 4))
    X = batched_mul(Vp, α)
    X = permutedims(X, (1, 3, 2, 4)) # [dim÷nheads × nheads × ntokens × batch_size]
    X = reshape(X, :, size(X)[3:end]...) # [dim × ntokens × batch_size]

    # Compute the outward projection
    return mha.out_proj(X) # [dim × ntokens × batch_size]
end

Note that in this code, the inputs are all X in the first multi-headed attention layer of the output head, whereas in the second one (called the cross attention layer), the queries Q_in are set to X whereas the keys and values are set to the output of the input head.

The other thing that a real transformer has is multiple transformer layers. That is, we repeat what we currently have as our input head and output head multiple times:

The Attention is All You Need paper used 8 parallel attention layers (attention heads) and 6 identical layers each in the encoder (input head) and decoder (output head).

Conclusion

I hope this post helps demystify the transformer architecture. They are the workhorse of modern large-scale deep learning, and its worth familiarizing yourself with them if you’re going to be working in that area. While complicated at first glance, they actually arise from fairly straightforward principles and solve some fairly practical and understandable problems. The information here should be enough to get started using them, or even roll your own from scratch.

Happy coding!

Skeletal Animation

In the previous post, I updated the sidescroller project with skeletal meshes — meshes that can be posed. In this post we animate those meshes.

As a fun fact, we’re about 4 months into this project and I’ve written about 5600 lines of code. It keeps piling up.

Animations

An animation defines where the \(m\) mesh vertices are as a function of time:

\[\vec{p}_i(t) \text{ for } i \text{ in } 1:m\]

That’s rather general, and would require specifying a whole lot of functions for a whole lot of mesh vertices. We’ll instead leverage the mesh skeletons from the last blog post, and define an animation using a series of poses, also called keyframes:

5 keyframes from the walk animation

A single keyframe defines all of the bone transforms that pose the model in a certain way. A series of them acts like a series of still frames in a movie – they act as snapshots of the mesh’s motion over time:

When we’re in-between frames, we have to somehow interpolate between the surrounding frames. The simplest and most common method is linear interpolation, where we linearly blend between the frames based on the fraction of time we are between them:

Here \(\alpha\) ranges from 0 at the previous frame and 1 at the next frame:

\[\alpha = \frac{t \> – \> t^{(k)}}{t^{(k+1)} – t^{(k)}}\]

Each frame is a collection of bone transforms. We could interpolate between the transform matrices directly, but it is more common to decompose those into position \(\vec{p}\), orientation \(\vec{q}\), and scale \(\vec{s}\), and interpolate between those instead. If we’re a fraction \(\alpha\) between frames \(f^{(k)}\) and \(f^{(k+1)}\), then the transform components for the \(i\)th bone are:

\[\begin{align} \vec{p}_i(\alpha) & = (1- \alpha) \vec{p}^{(k)}_i + \alpha \vec{p}^{(k+1)}_i \\ \vec{q}_i(\alpha) & = \texttt{slerp}(\vec{q}^{(k)}_i, \vec{q}^{(k+1)}_i, \alpha) \\ \vec{s}_i(\alpha) & = (1- \alpha) \vec{s}^{(k)}_i + \alpha \vec{s}^{(k+1)}_i \end{align}\]

You’ll notice that the position and scaling vectors use linear interpolation, but we’re using spherical linear interpolation for the rotation. This is why we decompose it – basic linear interpolation doesn’t work right for rotations.

Conceptually, animation is super simple. A keyframe just stores some components and a duration:

struct Keyframe {
    f32 duration;
    std::vector<glm::vec3> positions;  // for each bone
    std::vector<glm::quat> orientations;
    std::vector<glm::vec3> scales;
};

and an Animation is just a named list of keyframes:

struct Animation {
    std::string name;
    std::vector<Keyframe> frames;
};

If you want a simple animation, you slap down some spaced-out frames and let interpolation handle what happens in-between. If you want a nice, detailed animation, you can fill out your animation with a higher density of frames, or define special interpolation methods or easing functions to give you a better effect.

Posing a mesh is as simple as interpolating the keyframes and then applying its transforms to the mesh. To do that, its useful to keep track of how long our animation has been playing, and which frames we’re between. To this end we define an AnimationIndex:

struct AnimationIndex {
    f32 t_total;  // elapsed time since the start of the animation
    f32 t_frame;  // elapsed time since the start of the current frame
    int i_frame;  // index of the current frame
};

Advancing the index requires updating our times and the frame index:

AnimationIndex Advance(
    const AnimationIndex& idx,
    f32 dt,
    const Animation& anim)
{
    AnimationIndex retval;
    retval.t_total = idx.t_total + dt;
    retval.t_frame = idx.t_frame + dt;
    retval.i_frame = idx.i_frame;

    const int n_frames = (int)anim.frames.size();
    while (retval.i_frame < n_frames && 
           retval.t_frame > anim.frames[retval.i_frame].duration) {
        retval.t_frame -= anim.frames[retval.i_frame].duration;
        retval.i_frame += 1;
    }

    return retval;
}

The interpolation code almost isn’t worth showing, since its the same as the math:

void Interpolate(
    Keyframe* dst,
    const AnimationIndex& idx,
    const Animation& anim)
{
    int i_lo = idx.i_frame;
    int i_hi = idx.i_frame + 1;
    f32 alpha = 0.0;
    int n_frames = (int)anim.frames.size();
    if (idx.i_frame + 1 >= n_frames) {
        // Animation is done
        i_lo = i_hi = anim.frames.size() - 1;
        alpha = 0.0;
    } else {
        alpha = idx.t_frame / anim.frames[idx.i_frame].duration;
    }

    Interpolate(dst, anim.frames[i_lo], anim.frames[i_hi], alpha);
}

void Interpolate(
    Keyframe* dst,
    const AnimationIndex& idx,
    const Animation& anim)
{
    int i_lo = idx.i_frame;
    int i_hi = idx.i_frame + 1;
    f32 alpha = 0.0;
    int n_frames = (int)anim.frames.size();
    if (idx.i_frame + 1 >= n_frames) {
        // Animation is done
        i_lo = i_hi = anim.frames.size() - 1;
        alpha = 0.0;
    } else {
        alpha = idx.t_frame / anim.frames[idx.i_frame].duration;
    }

    Interpolate(dst, anim.frames[i_lo], anim.frames[i_hi], alpha);
}

We can use our interpolated keyframe to pose our mesh:

void PoseMeshToAnimationFrame(
    std::vector<glm::mat4>* bone_transforms_curr,
    std::vector<glm::mat4>* bone_transforms_final,
    const Mesh& mesh,
    const Keyframe& frame)
{
    size_t n_bones = bone_transforms_final->size();
    for (size_t i_bone = 0; i_bone < n_bones; i_bone++) {
        glm::vec3 pos = frame.positions[i_bone];
        glm::quat ori = frame.orientations[i_bone];
        glm::vec3 scale = frame.scales[i_bone];

        // Compute the new current transform
        glm::mat4 Sbone = glm::translate(glm::mat4(1.0f), pos);
        Sbone = Sbone * glm::mat4_cast(ori);
        Sbone = glm::scale(Sbone, scale);

        // Incorporate the parent transform
        if (mesh.bone_parents[i_bone] >= 0) {
            const glm::mat4& Sparent = 
                (*bone_transforms_curr)[mesh.bone_parents[i_bone]];
            Sbone = Sparent * Sbone;
        }

        // Store the result
        const glm::mat4& Tinv = mesh.bone_transforms_orig[i_bone];
        (*bone_transforms_curr)[i_bone] = Sbone;
        (*bone_transforms_final)[i_bone] = Sbone * Tinv;
    }
}

Adding Animations to the Game

Now that we have animations, we want to integrate them into the sidescroller game. We can do that!

These aren’t beautiful masterpieces or anything, since I clumsily made them myself, but hey – that’s scrappy do-it-to-learn-it gamedev.

We are now able to call StartAnimation on a RigComponent, specifying the name of the animation we want. For example, if the player loses contact with the ground, we can start either the falling or jumping animations with a code snippet like this:

if (!player_movable->support.is_supported) {
    // Fall or Jump!
    StartAnimation(player_rig, jump_pressed ? "jump" : "fall");
    game->player_state_enum = PlayerActionState::AIRBORNE;
}

This looks up the requested animation and sets our animation index to its initial value.

There is one additional thing we need to get right. Instantaneously transitioning to a new animation can look jarring. Watch what happens here were I start punching in the air:

The knight’s legs suddenly snap to their idle positions!

What we’d like instead is to blend seamlessly from the pose the model is at when StartAnimation is called to the poses specified by the new animation. We can already blend between any two poses, so we just leverage that knowledge to blend between the initial pose and the animation pose.

We specify a fadeout time \(t_\text{fadeout}\) whenever we start a new animation. For the first \(t_\text{fadeout}\) seconds of our animation, we linearly interpolate between our initial pose based on how much time has elapsed. If we happen to start in a pose close to what the animation plays from, this effect won’t be all that noticeable. If we happen to start in a wildly different pose we’ll move the mesh over more naturally.

Editing Animations

Animations contain a lot of data. There a multiple frames and a bunch of things bones can do per frame. I didn’t want to specify those transforms in code, nor manually in some sort of textfile.

I made a basic animation editor using ImGUI and used that to craft my animations:

Its fairly rudimentary, but gets the job done.

One could argue that I should have stuck with Blender’s animation tools. There are a few advantages to rolling my own editor:

  • Blender doesn’t run on my Linux laptop and keeps crashing
  • My model format has somewhat departed from Blender’s
  • I want to be able to edit hitspheres and hurtspheres down the road
  • My objective is to learn, not turn a profit or release a real game

The right trade-off will of course depend on your unique situation.

Conclusion

Going from pose-able meshes to animations was mainly about figuring out how to tie a bunch of poses together over time. Now that we have that, and a way to cleanly transition between animations, we’re pretty much set.

Animations in big games can be a lot more complicated. We might have overlapping animations for various parts of a mesh, such as facial animations that play simultaneously with body-level animations. We could implement inverse kinematics to help with automatically placing our feet at natural stepping locations (or when climbing to grasp at natural handholds). There are a bunch of fancier techniques that give you tighter control over the mesh too, but at the core of it, what we did here covers the fundamentals.

Next I plan to work on a basic combat system and get some other entities into the game. In order for this to be a proper level, we kind of need that. And maybe we’ll do some ladders.

Posing Meshes with OpenGL

This month I’m continuing the work on the side scroller and moving from static meshes to meshes with animations. This is a fairly sizeable jump as it involves a whole host of new concepts, including bones and armatures, vertex painting, and interpolation between keyframes. For reference, the Skeletal Animation post on learnopengl.com prints out to about 16 pages. That’s a lot to learn. Writing my own post about it helps me make sure I understand the fundamentals well enough to explain them to someone else.

Moving Meshes

Our goal is to get our 3D meshes to move. This primarily means adding having the ability to make our player mesh move in prescribed ways, such as jumping, crouching, attacking, and climbing ladders.

So far we’ve been working with meshes, i.e. vertex and face data that forms our player character. We want those vertices to move around. Our animations will specify how those vertices move around. That is, where the \(m\) mesh vertices are as a function of time:

\[\vec{p}_i(t) \text{ for } i \text{ in } 1:m\]

Unfortunately, a given mesh has a lot of vertices. The low-poly knight model I’m using has more than 1.5k of them, and its relatively tiny. Specifying where each individual vertex goes over time would be an excessive amount of work.

Not only would specifying an animation on a per-vertex level be a lot of work, it would be very slow. One of the primary advantages of a graphics card is that we can ship the data over at setup time and not have to send it all over again:

With the rendering we’re doing now, we just update a few small uniforms every frame to change the player position. The mesh data itself is already on the GPU, stored as a vertex buffer object.

So fully specifying the vertex positions over time is both extremely tedious and computationally expensive. What do we do instead?

When a character moves, many of their vertices typically move together. For example, if my character punches, we expect all of the vertices in their clenched fist to together, their forearm arm to follow. Their upper arm follows that, maybe with a bend, etc. Vertices thus tend to be grouped, and we can leverage the grouping for increased efficiency.

Bones

You know how artists sometimes have these wooden figures that they can bend and twist into poses? That’s the industry-standard way to do 3D animation.

We create an armature, which is a sort of rigid skeleton that we can use to pose the mesh. The armature is made up of rigid entities, bones, which can move around to produce the movement expected in our animation. There are far fewer bones than vertices. We then compute our vertex positions based on the bones positions – a vertex in the character’s fist is going to move with the fist bone.

This approach solves our problems. The bones are much easier to pose and thus build animations out of, and we can continue to send the mesh and necessary bone information to the GPU once at startup, and just send the updated bone poses whenever we render the mesh.

The bones in an armature have a hierarchical structure. Every bone has a single parent and any number of children, except the root bone, which has no parent.

Unlike my oh-so-fancy drawing, bones don’t actually take up space. They are actually just little reference frames. Each bone’s reference frame is given by a transformation \(T\) with respect to its parent. More specifically, \(T\) transforms a point in the bone’s frame to that of its parent.

For example, we can use this transform to get the “location” of a bone. We can get the position of the root bone by transforming the origin of its frame to its parent – the model frame. This is given by \(T_\text{root} \vec{o}\), where \(\vec{o}\) is the origin in homogenous coordinates: \([0,0,0,1]\). Our transforms use 4×4 matrices so that we can get translation, which is ubiquitous in 3D graphics.

Similarly, the position of one of the root bone’s children with transform \(T_\text{c}\) can be obtained by first computing its position in the root bone’s frame, \(T_\text{c} \vec{o}\), and then convert that to the model frame, \(T_\text{root} T_\text{c} \vec{o}\).

The order of operation matters! It is super important. It is what gives us the property that moving a leaf bone only affects that one bone, but moving a bone higher up in the hierarchy affects all child bones. If you ever can’t remember which order it is, and you can’t just quickly test it out in code, try composing two transforms together.

The origin of a bone 4 levels deep is:

\[T_\text{root} T_\text{c1} T_\text{c2} T_\text{c3} T_\text{c4} \vec{o}\]

Let’s call the aggregated transform for bone \(c4\), this product of its ancestors, \(\mathbb{T}_{c4}\).

It is these transformations that we’ll be changing in order to produce animations. Before we get to that, we have to figure out how the vertices move with the bones.

Transforming a Vertex

The vertices are all defined in our mesh. They have positions in model-space.

Each bone has a position (and orientation) in model space, as we’ve already seen.

Let’s consider what happens when we associate a vertex with a bone. That is, we “connect” the vertex to the bone such that, if the bone moves, the vertex moves with it. We expect the vertex to move along in the bone’s local frame:

Here the bone both translates to the right and up, and rotates counter-clockwise about 30 degrees. In the image, the vertex does the same.

The image above has the bone starting off at the origin, with its axis aligned with the coordinate axes. This means the vertex is starting off in the bone’s reference frame. To get the vertex into the bone frame (it is originally defined in the model frame), we have to multiply it by \(\mathbb{T}^{-1}\). Intuitively, if \(\mathbb{T} \vec{p}\) takes a point from bone space to model space then the inverse takes a point from model space to bone space.

The bone moved. That means it has a new transform relative to its parent, \(S\). Plus, that parent bone may have moved, and so forth. We have to compute a new aggregate bone transform \(\mathbb{S}\). The updated position of the vertex in model space is thus obtained by transforming it from its original model space into bone space with \(\mathbb{T}^{-1}\) and then transforming it back to model space according to the new armature configuration with \(\mathbb{S}\):

\[\vec v’ = \mathbb{S} \mathbb{T}^{-1} \vec v\]

We can visualize this as moving a vertex in the original mesh pose to the bone-relative space, and then transforming it back to model-space based on the new armature pose:

This means we should store \(\mathbb{T}^{-1}\) for each bone in the armature – we’re going to need it a lot. In any given frame we’ll just have to compute \(\mathbb{S}\) by walking down the armature. We then compute \(\mathbb{S} \mathbb{T}^{-1}\) and pass that to our shader to properly pose the model.

Bone Weights

The previous section let us move a vertex associated with a single bone. That works okay for very blocky models composed of rigid segments. For example, a stocky robot or simple car with spinning wheels. Most models are less rigid. Yes, if you punch someone you want the vertices in your fist to follow the hand bone, but as you extend your elbow the vertices near the joint will follow both the bone from the upper arm and the bone from the lower arm. We want a way to allow this to happen.

Instead of a vertex being associated with one bone, we allow a vertex to be associated with multiple bones. We have the final vertex combination be a mix of any number of other bones:

\[\vec v’ = \sum_{i=1}^m w^{(i)} \mathbb{S}_i \mathbb{T}_i^{-1} \vec v\]

where the nonnegative weights \(\vec w\) sum to one.

In practice, most vertices are associated with one one or two bones. It is common to allow up to 4 bone associations, simply because that covers most needs and then we can use the 4-dimensional vector types supported by OpenGL.

Data-wise, what this means is:

  • In addition to the original mesh data, we also need to provide an armature, which is a hierarchy of bones and their relative transformations.
  • We also need to associate each vertex with some set of bones. This is typically done via a vec4 of weights and an ivec4 of bone indices. We store these alongside the other vertex data.
  • The vertex data is typically static and can be stored on the graphics card in the vertex array buffer.
  • We compute the inverse bone transforms for the original armature pose on startup.
  • When we want to render a model, we pose the bones, compute the current bone transforms \(\mathbb{S}\), and then compute and sent \(\mathbb{S} \mathbb{T}^{-1}\) to the shader as a uniform.

Coding it Up

Alright, enough talk. How do we get this implemented?

We start by updating our definition for a vertex. In addition to the position, normal, and texture coordinates, we now also store the bone weights:

struct RiggedVertex {
    glm::vec3 position;   // location
    glm::vec3 normal;     // normal vector
    glm::vec2 tex_coord;  // texture coordinate

    // Up to 4 bones can contribute to a particular vertex
    // Unused bones must have zero weight
    glm::ivec4 bone_ids;
    glm::vec4 bone_weights;
};

This means we have to update how we set up the vertex buffer object:

glGenVertexArrays(1, &(mesh->vao));
glGenBuffers(1, &(mesh->vbo));

glBindVertexArray(mesh->vao);
glBindBuffer(GL_ARRAY_BUFFER, mesh->vbo);
glBufferData(GL_ARRAY_BUFFER, mesh->vertices.size() * sizeof(RiggedVertex),
                              &mesh->vertices[0], GL_STATIC_DRAW);

// vertex positions
glEnableVertexAttribArray(0);
glVertexAttribPointer(0, 3, GL_FLOAT, GL_FALSE, sizeof(RiggedVertex),
                      (void*)offsetof(RiggedVertex, position));
// vertex normals
glEnableVertexAttribArray(1);
glVertexAttribPointer(1, 3, GL_FLOAT, GL_FALSE, sizeof(RiggedVertex),
                      (void*)offsetof(RiggedVertex, normal));
// texture coordinates
glEnableVertexAttribArray(2);
glVertexAttribPointer(2, 2, GL_FLOAT, GL_FALSE, sizeof(RiggedVertex),
                      (void*)offsetof(RiggedVertex, tex_coord));
// bone ids (max 4)
glEnableVertexAttribArray(3);
glVertexAttribIPointer(3, 4, GL_INT, sizeof(RiggedVertex),
                       (void*)offsetof(RiggedVertex, bone_ids));
// bone weights (max 4)
glEnableVertexAttribArray(4);
glVertexAttribPointer(4, 4, GL_FLOAT, GL_FALSE, sizeof(RiggedVertex),
                       (void*)offsetof(RiggedVertex, bone_weights));

Note the use of glVertexAttribIPointer instead of glEnableVertexAttribArray for the bone ids. That problem took me many hours to figure out. It turns out that glVertexAttribPointer accepts integers but has the card interpret them as floating point, which messes everything up it you indent to actually use integers on the shader side.

As far as our shader goes, we are only changing where the vertices are located, not how they are colored. As such, we only need to update our vertex shader (not the fragment shader). The new shader is:

#version 330 core

layout (location = 0) in vec3 aPos;
layout (location = 1) in vec3 aNormal;
layout (location = 2) in vec2 aTexCoord;
layout (location = 3) in ivec4 aBoneIds; 
layout (location = 4) in vec4 aBoneWeights;

out vec3 pos_world;
out vec3 normal_world;
out vec2 tex_coords;

const int MAX_BONES = 100;
const int MAX_BONE_INFLUENCE = 4;

uniform mat4 model; // transform from model to world space
uniform mat4 view; // transform from world to view space
uniform mat4 projection; // transform from view space to clip space
uniform mat4 bone_transforms[MAX_BONES]; // S*Tinv for each bone

void main()
{
    // Accumulate the position over the bones, in model space
    vec4 pos_model = vec4(0.0f);
    vec3 normal_model = vec3(0.0f);
    for(int i = 0 ; i < MAX_BONE_INFLUENCE; i++)
    {
        j = aBoneIds[i];
        w = aBoneWeights[j]
        STinv = bone_transforms[j];
        pos_model += w*(STinv*vec4(aPos,1.0f));
        normal_model += w*(mat3(STinv)*aNormal);
    }

    pos_world = vec3(model * pos_model);
    normal_world = mat3(model) * normal_model;
    gl_Position = projection * view * model * pos_model;
    tex_coords = vec2(aTexCoord.x, 1.0 - aTexCoord.y);
}

Note that the normal vector doesn’t get translated, so we only use mat3. The vertice’s texture coordinate is not affected.

Testing it Out

Let’s start with something simple, a 2-triangle, 2-bone system with 5 vertices:

The root bone is at the origin (via the identity transform) and the child bone is at [2,1,0]. The vertices of the left triangle are all associated only with the root bone and the rightmost two vertices are only associated with the child bone.

If we slap on a sin transform for the child bone, we wave the right triangle:

If we instead slap a sin transform on the root bone, we wave both triangles as a rigid unit:

We can apply a sin to both triangles, in which case the motions compose like they do if you bend both your upper and lower arms:

We can build a longer segmented arm and move each segment a bit differently:

Posing a Model

Now that we’ve built up some confidence in our math, we can start posing our model (The low-poly knight by Davmid, available here on Sketchfab) The ideas are exactly the same – we load the model, define some bone transforms, and then use those bone transforms to render its location.

I created an animation editing mode that lets us do this posing. As before, the UI stuff is all done with Dear ImGUI. It lets you select your model, add animations, add frames to those animations, and adjust the bone transforms:

A single frame represents an overall model pose by defining the transformations for each bone:

struct Keyframe {
    f32 duration;
    std::vector<glm::vec3> positions;
    std::vector<glm::quat> orientations;
    std::vector<glm::vec3> scales;
};

The position, orientation, and scale are stored separately, mainly because it is easier to reason about them when they are broken out like this. Later, when we do animation, we’ll have to interpolate between frames, and it is also easier to reason about interpolation between these components rather than between the overall transforms. A bone’s local transform \(S\) is simply the combination of its position, orientation, and scale.

We calculate our fine bone transforms \(\mathbb{S} \mathbb{T}^{-1}\) for our frame and then pass that off to the shader for rendering:

size_t n_bones = mesh->bone_transforms_final.size();
for (size_t i_bone = 0; i_bone < n_bones; i_bone++) {
    glm::vec3 pos = frame.positions[i_bone];
    glm::quat ori = frame.orientations[i_bone];
    glm::vec3 scale = frame.scales[i_bone];

    // Compute the new current transform
    glm::mat4 Sbone = glm::translate(glm::mat4(1.0f), pos);
    Sbone = Sbone * glm::mat4_cast(ori);
    Sbone = glm::scale(Sbone, scale);

    if (mesh->bone_parents[i_bone] >= 0) {
        const glm::mat4& Sparent =
            mesh->bone_transforms_curr[mesh->bone_parents[i_bone]];
        Sbone = Sparent * Sbone;
    }

    mesh->bone_transforms_curr[i_bone] = Sbone;

    glm::mat4 Tinv = mesh->bone_transforms_orig[i_bone];
    mesh->bone_transforms_final[i_bone] = Sbone * Tinv;
}

Conclusion

When you get down to it, the math behind rigged meshes isn’t all too bad. The main idea is that you have bones with transforms relative to their parent bone, and that vertices are associated with a combination of bones. You have to implement rigged meshes that store your bones and vertex weights, need a shader for rigged meshes, and a way to represent your model pose.

This main stuff doesn’t take too long to implement in theory, but this post took me way longer to develop than normal because I wrote a bunch of other supporting code (plus banged my head against the wall when things didn’t initially work – its hard to debug a garbled mess).

Adding my own animation editor meant I probably wanted to save those animations to disk. That meant I probably wanted to save my meshes to disk too, which meant saving the models, textures, and materials. Getting all that set up took time, and its own fair share of debugging. (I’m using a WAD-like data structure like I did for TOOM).

The animation editor also took some work. Dear ImGUI makes writing UI code a lot easier, but adding all of the widgets and restructuring a lot of my code to be able to have separate application modes for things like gameplay and the animation editor simply took time. The largest time suck (and hit to motivation) for a larger project like this is knowing you have to refactor a large chunk of it to make forward progress. Its great to reach the other end though and to have something that makes sense.

We’re all set up with posable models now, so the next step is to render animations. Once that’s possible we’ll be able to tie those animations to the gameplay — have the model walk, jump, and stab accordingly. We’ll then be able to attach other meshes to our models, i.e. a sword to its hand, and tie hitspheres and hurtspheres to the animation, which will form the core of our combat system.

Rendering with OpenGL

In the previous post I introduced a new side scroller project, and laid out some of the things I wanted to accomplish with it. One of those points was to render with the graphics card. I had never worked directly with 3D assets and graphics cards before. Unless some high-level API was doing rendering for me under the hood, I wasn’t harnessing the power of modern graphics pipelines.

For TOOM I was not using the graphics card. In fact, the whole point of TOOM was to write a software renderer, which only uses the CPU for graphics. That was a great learning experience, but there is a reason that graphics cards exist, and that reason pretty much is that software renderers spend a lot of time rendering pixels when it’d be a lot nicer to hand that work off to something that can churn through it more efficiently.

Before starting, I thought that learning to render with the graphics card would more-or-less boil down to calling render(mesh). I assumed the bulk of the benefit would be faster calls. What I did not anticipate was learning how rich render pipelines can be, with customized shaders and data formats opening up a whole world of neat possibilities. I love it when building a basic understanding of something suddenly causes all sorts of other things to click. For example, CUDA used to be an abstract, scary performance thing that only super smart programmers would do. Now that I know a few things, I can conceptualize how it works. It is a pretty nice feeling.

Learning OpenGL

I learned OpenGL the way many people do. I went to learnopengl.com and followed their wonderful tutorials. It took some doing, and it was somewhat onerous, but it was well worth it. The chapters are well laid out and gradually build up concepts.

This process had me install a few additional libraries. First off, I learned that one does not commonly directly use OpenGL. It involves a bunch of function pointers corresponding to various OpenGL versions, and getting the right function pointers from the graphics driver can be rather tedious. Thus, one uses a service like GLAD to generate the necessary code for you.

As recommended by the tutorials, I am using stb_image by the legendary Sean Barrett for loading images for textures, and assimp for loading 3D asset files. I still find it hilarious that they can get away with basically calling that last one Butt Devil. Lastly, I’m using the OpenGL mathematics library (glm) for math types compatible with OpenGL. I am still using SDL2 but switched to the imgui_impl_opengl3 backend. These were all pretty easy to use and install — the tutorials go a long way and the rest is pretty self-explanatory.

I did run into an issue with OpenGL where I didn’t have a graphics driver new enough to support the version that learnopengl was using (3.3). I ended up being able to resolve it by updating my OS to Ubuntu 22.04. I had not upgraded in about 4 years on account of relying on this machine to compile the textbooks, but given that they’re out I felt I didn’t have to be so paranoid anymore.

In the end, updating to Jammy Jellyfish was pretty easy, except it broke all of my screen capture software. I previously used Shutter and Peek to take screenshots and capture GIFs. Apparently that all went away with Wayland, and now I need to give Flameshot permission every time I take a screenshot and I didn’t have any solution for GIF recordings, even if login with Xorg.

In writing this post I buckled down and tried Kooha again, which tries to record but then hangs forever on save (like Peek). I ended up installing OBS Studio, recording a .mp4, and then converting that to a GIF in the terminal via ffmpeg:

ffmpeg -ss 1 -i input.mp4 \
  -vf "fps=30,scale=320:-1:flags=lanczos,split[s0][s1];[s0]palettegen[p];[s1][p]paletteuse" \
   -loop 0 2d_to_3d.gif

This was only possible because of this excellent stack overflow answer and this OBS blog post. The resulting quality is actually a lot better than Peek, and I’m finding I have to adjust some settings to make the file size reasonable. I suppose its nice to be in control, but I really liked Peek’s convenience.

Moving to 3D

The side scroller logic was all done in a 2D space, but now rendering can happen in 3D. The side scroller game logic will largely remain 2D – the player entities will continue to move around 2D world geometry:

but that 2D world geometry all exists at depth \(z=0\) with respect to a perspective-enabled camera:

Conceptually, this is actually somewhat more complicated than simply calling DrawLine for each edge, which is what I was doing before via SDL2:

for edge in mesh
    DrawLine(edge)

Rendering in 3D requires:

  • setting up a vertex buffer and filling it with my mesh data, including color
  • telling OpenGL how to interpret said data
  • using a vertex shader that takes the position and color attributes and transforms them via the camera transformations into screen space
  • using a dead-simple fragment shader that just passes on the received vertex colors
  • calling glDrawArrays with GL_LINES to execute all of this machinery

This is not hard to do per se, but it is more complicated. One nice consequence is that there is only a single draw call for all this mesh data rather than a bunch of independent line draw calls. The GPU then crunches it all in parallel:

The coordinate transforms for this are basically what learnopengl recommends:

  • an identity model matrix since the 2D mesh data is already in the world frame
  • a view matrix obtained via glm::lookat based on a camera position and orientation
  • a projection matrix obtained via glm::perspective with a 45-degree viewing angle, our window’s aspect ratio, and \(z\) bounds between 0.1 and 100:
glm::mat4 projection = glm::perspective(
     glm::radians(45.0f),
     screen_size_x / screen_size_y,
     0.1f, 100.0f);

This setup worked great to get to parity with the debug line drawing I was doing before, but I wanted to be using OpenGL for some real mesh rendering. I ended up defining two shader programs in addition to the debug-line-drawing one: one for textured meshes and one for material-based meshes. I am now able to render 3D meshes!

This low-poly knight was created by Davmid, and is available here on Sketchfab. The rest of the assets I actually made myself in Blender and exported to FBX. I had never used Blender before. Yeah, its been quite a month of learning new tools.

The recording above also includes a flashlight effect. This was of course based on the learnopengl tutorials too.

Rendering with shaders adds a ton of additional options for effects, but also comes with a lot of decisions and tradeoffs around how data is stored. I would like to simplify my approach and end up with as few shaders as possible. For now its easiest for me to make models in Blender with basic materials, but maybe I can move the material definitions to a super-low-res texture (i.e., one pixel per material) and then only use textured models.

Architecture

It is perhaps worth taking a step back to talk about how this stuff is architected. That is, how the data structures are arranged. As projects get bigger, it is easier to get lost in the confusion, and a little organization goes a long way.

First off, the 3D stuff is mostly cosmetic. It exists to look good. It is thus largely separate from the underlying game logic:

I created a MeshManager class that is responsible for storing my mesh-related data. It currently keeps track of four things:

  1. Materials – color properties for untextured meshes
  2. Textures – diffuse and specular maps for textured meshes
  3. Meshes – the 3D data associated with a single material or texture
  4. Models – a named 3D asset that consists of 1 or more meshes

I think some structs help get the idea across:

struct Vertex {
    glm::vec3 position;   // location
    glm::vec3 normal;     // normal vector
    glm::vec2 tex_coord;  // texture coordinate
};
struct Material {
    std::string name;
    glm::vec3 diffuse;   // diffuse color
    glm::vec3 specular;  // specular color
    f32 shininess;       // The exponent in the Phong specular equation
};
struct Texture {
    GLuint id;             // The OpenGL texture id
    std::string filepath;  // This is essentially its id
};
struct Mesh {
    std::string name;  // The mesh name

    std::vector<Vertex> vertices;

    // Indices into `vertices` that form our mesh triangles
    std::vector<u32> indices;

    // An untextured mesh has a single material, typemax if none.
    // If the original mesh has multiple materials, 
    // the assimp import process splits them.
    u16 material_id;

    // A textured mesh has both a diffuse and a specular texture
    // (typemax if none)
    u16 diffuse_texture_id;
    u16 specular_texture_id;

    GLuint vao;  // The OpenGL vertex array object id
    GLuint vbo;  // The OpenGL vertex buffer id associated with `vertices`
    GLuint ebo;  // The element array buffer associated with `indices`
};
struct Model {
    std::string filepath;       // The filepath for the loaded model
    std::string name;           // The name of the loaded model
    std::vector<u16> mesh_ids;  // All mesh ids
};

Visually, this amounts to:

All of this data can be loaded from the FBX files via assimp. In the future, I’ll also be writing this data out and back in via my own game data format.

My game objects are currently somewhat of a mess. I have an EntityManager that maintains a set of entities:

struct Entity {
    u32 uid;  // Unique entity id
    u32 flags;

    common::Vec2f pos;               // Position in gamespace

    // A dual quarter edge for the face containing this position
    core::QuarterEdgeIndex qe_dual;

    u16 movable_id;       // The movable id, or 0xFFFF otherwise
    u16 collider_id;      // The collider id, or 0xFFFF otherwise
};

Entities have unique ids and contain ids for the various resources they might be associated with. For now this includes a movable id and a collider id. I keep those in their own managers as well.

A movable provides information around how an entity is moving. Right now it is just just a velocity vector and some information about what the entity is standing on.

A collider is a convex polygon that represents the entity’s shape with respect to collision with the 2D game world:

The player is the only entity that currently does anything, and there is only one collider size from which the Delaunay mesh is built. I’ll be cleaning up the entity system in the future, but you can see where its headed – it is a basic entity component system.

Editing Assets

I want to use these meshes to decorate my game levels. While I could theoretically create a level mesh in Blender, I would much prefer to be able to edit the levels in-game using smaller mesh components.

I created a very basic version of this using ImGUI wherein one can create and edit a list of set pieces:

This is leaps and bounds better than editing the level using a text file, or worse, hard-coding it. At the same time, it is pretty basic. I don’t highlight the selected set piece, I don’t have mouse events hooked up for click and drag, and my camera controls aren’t amazing. I do have some camera controls though! And I can set the camera to orthographic, which helps a lot:

The inclusion of this set piece editor meant that I needed a way to save and load game data. (You can see the import and export buttons on the top-left). I went with the same WAD-file-like approach that I used for TOOM, which is a binary finally consisting of a basic header, a list of binary blobs, followed by a table of contents:

Each blob consists of a name and a bunch of data. It is up to me to define how to serialize or deserialize the data associated with a particular blob type. Many of them just end up being a u32 count followed by a list of structs.

One nice result of using a WAD-like binary is that I don’t always have to copy the data out to another data structure. With TOOM, I was copying things out in the C++ editor but was referencing the data as-is directly from the binary in the C code that executed the game. Later down the road I’ll probably do something similar with a baked version of the assets that the game can refer to.

There are some tradeoffs. Binary files are basically impossible to edit manually. Adding new blob types doesn’t invalidate old save files, but changing how a blob type is serialized or deserialized does. I could add versions, but as a solo developer I haven’t found the need for it yet. I usually update the export method first, run my program and load using the old import method, export with the new method, and then update my import method. This is a little cumbersome but works.

With TOOM I would keep loading and overwriting a single binary file. This was fine, but I was worried that if I ever accidentally corrupted it, I would have a hard time recreating all of my assets. With this project I decided to export new files every time. I have a little export directory and append the timestamp to the filename, ex: sidescroller_2023_12_22_21_28_55.bin. I’m currently loading the most recent one every time, but I still have older ones available if I ever bork things.

Conclusion

This post was a little less targeted than normal. I usually talk about how a specific thing works. In this case, I didn’t want to re-hash how OpenGL works, and instead covered a somewhat broad range of things I learned and did while adding meshes to my game project. A lot of things in gamedev are connected, and by doing it I’m learning what the implications of various decisions are and oftentimes, why things are as they are.

There are a lot of concepts to learn and tools to use. The last month involved:

  • OpenGL
  • GLAD
  • assimp
  • stb_image
  • Blender
  • OBS Studio
  • textures
  • Phong lighting
  • shaders

I am often struck by how something foreign and arcane like using a GPU seems mysterious and hard to do from the outset, but once you dive in you can pull back the curtains and figure it out. Sure, I still have a ton to learn, but the shape of the thing is there, and I can already do things with it.

Happy coding, folks.