I spent most of December on vacation, so no big post. However, I did learn about Emscripten, a C++ compiler that produces an executable in WebAssembly. I tried it out and have to say, it was remarkably easy to get something up and running.
I made a very simple crossword game. You you can try it out here.
Best on desktop. While it does load in the browser, it is looking for key events, which is really cumbersome on mobile.
Compilation is pretty straightforwad. Instead of running gcc <srcs> <libs> you run emcc <srcs> <libs>. The compiler does some magic to reinterpret use of OpenGL with WebGL, and spits out a .wasm binary blob, a .js script that can execute it, and a .html page that runs it all. My game link above is simply to that .html page.
To test it locally without uploading it to a web page, you can simply start a Python server in your dev directory:
python -m http.server
and then open your .html file in the browser:
http://localhost:8000/your_thing.html
I spend a lot of time tinkering with games and other programs in C++, but I’m developing on a Linux machine. As a result, I can’t really share my games with my friends. Emscripten might be a nice way to achieve that.
Last month I wrote about moving the Sokoban policy training code from CPU to GPU, yielding massive speedups.Reduced training time by about 32x for the base case and training time scales much better with model size. That significantly shortened both training time and the time it takes to compute basic validation metrics. It has not, unfortunately, significantly changed how long it takes to run rollouts, and relatedly, how long it takes to run beam search.
The Bottleneck
The training that I’ve done so far has all been with teacher forcing, which allows all inputs to be passed to the net at once:
When we do a rollout, we can’t pass everything in at once. We start with our initial state and use the policy to discover where we end up:
The problem is that the left-side of that image, the policy call, is happening on the GPU, but the right side, the state advancement, is happening on the CPU. If a rollout involves 62 player steps, then instead of one data transfer step like we have for training, we’re doing 61 transfers! Our bottleneck is all that back-and-forth communication:
Let’s move everything to the GPU.
CPU Code
So what is currently happening on the CPU?
At every state, we are:
Sampling an action for each board from the action logits
Applying that action to each board to advance the state
Sampling from the actions is pretty straightforward to run on the GPU. That’s the bread and butter of transformers and RL in general.
# policy_logits are [a×s×b] (a=actions, s=sequence length, b = batch size)
policy_logits, nsteps_logits = policy(inputs)
# Sample from the logits using the Gumbel-max trick
sampled_actions = argmax(policy_logits .+ gumbel_noise, dims=1)
where we use the Gumbel-max trick and the Gumble noise is sampled in advance and passed to the GPU like the other inputs:
using Distributions.jl
gumbel_noise = rand(Gumbel(0, 1), size(a, s, b))
Advancing the board states is more complicated. Here is the CPU method for a single state:
function maybe_move!(board::Board, dir::Direction)::Bool
□_player::TileValue=find_player_tile(board)
step_fore = get_step_fore(board, dir)
□ = □_player # where the player starts
▩ = □ + step_fore # where the player potentially ends up
if is_set(board[▩], WALL)
return false # We would be walking into a wall
end
if is_set(board[▩], BOX)
# We would be walking into a box.
# This is only a legal move if we can push the box.
◰ = ▩ + step_fore # where box ends up
if is_set(board[◰], WALL + BOX)
return false # We would be pushing the box into a box or wall
end
# Move the box
board[▩] &= ~BOX # Clear the box
board[◰] |= BOX # Add the box
end
# At this point we have established this as a legal move.
# Finish by moving the player
board[□] &= ~PLAYER # Clear the player
board[▩] |= PLAYER # Add the player
return true
end
There are many ways to represent board states. This representation is a simple Matrix{UInt8}, so an 8×8 board is just an 8×8 matrix. Each tile is a bitfield with components that can be set for whether that tile has/is a wall, box, floor, or tile.
Moving the player has 3 possible paths:
successful step: the destination tile is empty and we just move the player to it
successful push: the destination tile has a box, and the next one over is empty, so we move both the player and the box
failed move: otherwise, this is an illegal move and the player stays where they are
Moving this logic to the GPU has to preserve this flow, use the GPU’s representation of the board state, and handle a tensor’s worth of board states at a time.
GPU Representation
The input to the policy is a tensor of size \([h \times w \times f \times s \times b]\), where 1 board is encoded as a sparse \((h = \text{height}) \times (w=\text{width}) \times (f = \text{num features} = 5)\) tensor:We don't really need floors as a feature, since it is just not-walls. I originally thought it would be useful for the model, and can try ablating it in the future to see.
and we have board sequences of length \(s\) and \(b\) sequences per batch of them:
I purposely chose 4-step boards here, but sequences can generally be much longer and of different lengths, and the first state in each sequence is the goal state.
Our actions will be the \([4\times s \times b]\) actions tensor — one up/down/left/right action per board state.
Shifting Tensors
The first fundamental operation we’re going to need is to be able to check tile neighbors. That is, instead of doing this:
□ = □_player # where the player starts
▩ = □ + step_fore # where the player potentially ends up
we’ll be shifting all tiles over and checking that instead:
The shift_tensor method method takes in a tensor and shifts it by the given number of rows and columns, padding in new values:
We pass in the number of rows or columns to shift, figure out what that means in terms of padding, and then leverage NNlib’s pad_constant method to give us a new tensor that we clamp to a new range:
function shift_tensor(
tensor::AbstractArray,
d_row::Integer,
d_col::Integer,
pad_value)
pad_up = max( d_row, 0)
pad_down = max(-d_row, 0)
pad_left = max( d_col, 0)
pad_right = max(-d_col, 0)
tensor_padded = NNlib.pad_constant(
tensor,
(pad_up, pad_down, pad_left, pad_right,
(0 for i in 1:2*(ndims(tensor)-2))...),
pad_value)
dims = size(tensor_padded)
row_lo = 1 + pad_down
row_hi = dims[1] - pad_up
col_lo = 1 + pad_right
col_hi = dims[2] - pad_left
return tensor_padded[row_lo:row_hi, col_lo:col_hi,
(Colon() for d in dims[3:end])...]
end
This method works on tensors with varying numbers of dimensions, and always operates on the first two dimensions as the row and column dimensions.
Taking Actions
If we know the player move, we can use the appropriate shift direction to get the “next tile over”. Our player moves can be reflected by the following row and column shift values:
UP = (d_row=-1, d_col= 0) LEFT = (d_row= 0, d_col=-1) DOWN = (d_row=+1, d_col= 0) RIGHT = (d_row= 0, d_col=+1)
This lets us convert the CPU-movement code into a bunch of Boolean tensor operations:
function advance_boards(
inputs::AbstractArray{Bool}, # [h,w,f,s,b]
d_row::Integer,
d_col::Integer)
boxes = inputs[:,:,DIM_BOX, :,:]
player = inputs[:,:,DIM_PLAYER,:,:]
walls = inputs[:,:,DIM_WALL, :,:]
player_shifted = shift_tensor(player, d_row, d_col, false)
player_2_shift = shift_tensor(player_shifted, d_row, d_col, false)
# A move is valid if the player destination is empty
# or if its a box and the next space over is empty
not_box_or_wall = .!(boxes .| walls)
# 1 if it is a valid player destination tile for a basic player move
move_space_empty = player_shifted .& not_box_or_wall
# 1 if the tile is a player destination tile containing a box
move_space_isbox = player_shifted .& boxes
# 1 if the tile is a player destination tile whose next one over
# is a valid box push receptor
push_space_empty = player_shifted .& shift_tensor(not_box_or_wall, -d_row, -d_col, false)
# 1 if it is a valid player move destination
move_mask = move_space_empty
# 1 if it is a valid player push destination
# (which also means it currently has a box)
push_mask = move_space_isbox .& push_space_empty
# new player location
mask = move_mask .| push_mask
player_new = mask .| (player .* shift_tensor(.!mask, -d_row, -d_col, false))
# new box location
box_destinations = shift_tensor(boxes .* push_mask, d_row, d_col, false)
boxes_new = (boxes .* (.!push_mask)) .| box_destinations
return player_new, boxes_new
end
The method appropriately moves any player tile that has an open space in the neighboring tile, or any player tile that has a neighboring pushable box. We create both a new player tensor and a new box tensor.
This may seem extremely computationally expensive — we’re operating on all tiles rather than on just the ones we care about. But GPUs are really good at exactly this, and it is much cheaper to let the GPU churn through that than wait for the transfer to/from the CPU.
The main complication here is that we’re using the same action across all boards. In a given instance, there are \(s\times b\) boards in our tensor. We don’t want to be using the same action in all of them.
Instead of sharding different actions to different boards, we’ll compute the results of all 4 actions and then index into the resulting state that we need:
Working with GPUs sure makes you think differently about things.
We’re almost there. This updates the boards in-place. To get the new inputs tensor, we want to shift our boards in the sequence dimension, propagating successor boards to the next sequence index. However, we can’t just shift the entire tensor. We want to keep the goals and the initial states:
The code for this amounts to a cat operation and some indexing:
function advance_board_inputs(
inputs::AbstractArray{Bool}, # [h,w,f,s,b]
actions::AbstractArray{Int}) # [s,b]
inputs_new = advance_boards(inputs, actions)
# Right shift and keep the goal and starting state
return cat(inputs[:, :, :, 1:2, :],
inputs_new[:, :, :, 2:end-1, :], dims=4) # [h,w,f,s,b]
end
And with that, we’re processing actions across entire batches!
Rollouts on the GPU
We can leverage this new propagation code to propagate our inputs tensor during a rollout. The policy and the inputs have to be on the GPU, which in Flux.jl can be done with gpu(policy). Note that this requires a CUDA-compatible GPU.
A single iteration is then:
# Run the model
# policy_logits are [4 × s × b]
# nsteps_logits are [7 × s × b]
policy_logits_gpu, nsteps_logits_gpu = policy0(inputs_gpu)
# Sample from the action logits using the Gumbel-max trick
actions_gpu = argmax(policy_logits_gpu .+ gumbel_noise_gpu, dims=1)
actions_gpu = getindex.(actions_gpu, 1) # Int64[1 × s × b]
actions_gpu = dropdims(actions_gpu, dims=1) # Int64[s × b]
# Apply the actions
inputs_gpu = advance_board_inputs(inputs_gpu, actions_gpu)
The overall rollout code just throws this into a loop and does some setup:
function rollouts!(
inputs::Array{Bool, 5}, # [h×w×f×s×b]
gumbel_noise::Array{Float32, 3}, # [4×s×b]
policy0::SokobanPolicyLevel0,
s_starts::Vector{Board}, # [b]
s_goals::Vector{Board}) # [b]
policy0 = gpu(policy0)
h, w, f, s, b = size(inputs)
@assert length(s_starts) == b
@assert length(s_goals) == b
# Fill the goals into the first sequence channel
for (bi, s_goal) in enumerate(s_goals)
set_board_input!(inputs, s_goal, 1, bi)
end
# Fill the start states in the second sequence channel
for (bi, s_start) in enumerate(s_starts)
set_board_input!(inputs, s_start, 2, bi)
end
inputs_gpu = gpu(inputs)
gumbel_noise_gpu = gpu(gumbel_noise)
for si in 2:s-1
# Run the model
# policy_logits are [4 × s × b]
# nsteps_logits are [7 × s × b]
policy_logits_gpu, nsteps_logits_gpu = policy0(inputs_gpu)
# Sample from the action logits using the Gumbel-max trick
actions_gpu = argmax(policy_logits_gpu .+ gumbel_noise_gpu, dims=1)
actions_gpu = getindex.(actions_gpu, 1) # Int64[1 × s × b]
actions_gpu = dropdims(actions_gpu, dims=1) # Int64[s × b]
# Apply the actions
inputs_gpu = advance_board_inputs(inputs_gpu, actions_gpu)
end
return cpu(inputs_gpu)
end
There are several differences:
The code is simpler. We only have a single loop, over the sequence length (number of steps to take). The content of that loop is pretty compact.
The code does more work. We’re processing more stuff, but because it happens in parallel on the GPU, its okay. We’re also propagating all the way to the end of the sequence whether we need to or not. (The CPU code would check whether all boards had finished already).
If we time how long it takes to doing a batch worth of rollouts before and after moving to the GPU, we get about a \(60\times\) speedup. Our efforts have been worth it!
Beam Search on the GPU
Rollouts aren’t the only thing we want to speed up. I want to use beam search to explore the space using the policy and try to find solutions. Rollouts might happen to find solutions, but beam search should be a lot better.
The code ends up being basically the same, except a single goal and board is used to seed the entire batch (giving us a number of beams equal to the batch size), and we have to do some work to score the beams and then select which ones to keep:
unction beam_search!(
inputs::Array{Bool, 5}, # [h×w×f×s×b]
policy0::SokobanPolicyLevel0,
s_start::Board,
s_goal::Board)
policy0 = gpu(policy0)
h, w, f, s, b = size(inputs)
# Fill the goals and starting states into the first sequence channel
for bi in 1:b
set_board_input!(inputs, s_goal, 1, bi)
set_board_input!(inputs, s_start, 2, bi)
end
# The scores all start at zero
beam_scores = zeros(Float32, 1, b) |> gpu # [1, b]
# Keep track of the actual actions
actions = ones(Int, s, b) |> gpu # [s, b]
inputs_gpu = gpu(inputs)
# Advance the games in parallel
for si in 2:s-1
# Run the model
# policy_logits are [4 × s × b]
# nsteps_logits are [7 × s × b]
policy_logits, nsteps_logits = policy0(inputs_gpu)
# Compute the probabilities
action_probs = softmax(policy_logits, dims=1) # [4 × s × b]
action_logls = log.(action_probs) # [4 × s × b]
# The beam scores are the running log likelihoods
action_logls_si = action_logls[:, si, :] # [4, b]
candidate_beam_scores = action_logls_si .+ beam_scores # [4, b]
candidate_beam_scores_flat = vec(candidate_beam_scores) # [4b]
# Get the top 'b' beams
topk_indices = partialsortperm(candidate_beam_scores_flat, 1:b; rev=true)
# Convert flat indices back to action and beam indices
selected_actions = (topk_indices .- 1) .÷ b .+ 1 # [b] action indices (1 to 4)
selected_beams = (topk_indices .- 1) .% b .+ 1 # [b] beam indices (1 to b)
selected_scores = candidate_beam_scores_flat[topk_indices] # [b]
inputs_gpu = inputs_gpu[:,:,:,:,selected_beams]
actions[si,:] = selected_actions
# Apply the actions to the selected beams
inputs = advance_board_inputs(inputs_gpu, actions)
end
return (cpu(inputs_gpu), cpu(actions))
end
This again results in what looks like way simpler code. The beam scoring and such is all done on tensors, rather than a bunch of additional for loops. It all happens on the GPU, and it is way faster (\(23\times\)).
Conclusion
The previous blog post was about leveraging the GPU during training. This blog post was about leveraging the GPU during inference. We had to avoid expensive data transfers between the CPU and the GPU, and to achieve that had to convert non-trivial player movement code to computations amenable to the GPU. Going about that meant thinking about and structuring our code very differently, working across tensors and creating more work that the GPU could nonetheless complete faster.
This post was a great example of how code changes based on the scale you’re operating at. Peter van Hardenberg gives a great talk about similar concepts in Why Can’t We Make Simple Software?. How you think about a problem changes a lot based on problem scale and hardware. Now that we’re graduating from the CPU to processing many many boards, we have to think about the problem differently.
Our inference code has been GPU-ized, so we can leverage it to speed up validation and solution search. It was taking me 20 min to train a network but 30 min to run beam search on all boards in my validation set. This change avoids that sorry state of affairs.
In a previous post, I had covered a project in which I was trying to get a neural network to learn to play Sokoban. I trained a transformer that operated on encoded board positions and predicted both the next action and how many steps were left until a solution would be reached:
My training was all done with Flux.jl and my own transformer implementation, not because it was more efficient or anything, but because I like to learn by doing it myself.
This training all happened on my little personal laptop. I love my laptop, but it is not particularly beefy, and it does not have much of a GPU to speak of. Training a fairly simple transformer was taking a long time — around 8 hours — which is pretty long. That doesn’t leave much room for experimentation. I knew that all this training was happening on the CPU, and the best way to make it faster would be to move to the GPU.
Flux making moving to the GPU incredibly easy:
using CUDA
policy = TransformerPolicy(
batch_size = BATCH_SIZE,
max_seq_len = MAX_SEQ_LEN,
encoding_dim = ENCODING_DIM,
dropout_prob=dropout_prob,
no_past_info=no_past_info) |> gpu
That’s it – you just use the CUDA package and pipe your model to the GPU.
I tried this, and… it didn’t work. Unfortunately my little laptop does not have a CUDA-capable GPU.
After going through a saga of trying to get access to GPUs in AWS, I put the project to the side. There is sat, waiting for me to eventually pick it back up again whenever I ultimately decided to get a machine with a proper GPU or try to wrangle AWS again.
Then, one fateful day, I happened to be doing some cleaning and noticed an older laptop that I no longer used. Said laptop is bigger, and, just perhaps it had a CUDA-capable GPU. I booted it up, and lo and behold, it did. Not a particularly fancy graphics card, but CUDA-capable nonetheless.
This post is about how I used said GPU to train my Sokoban models, and how I then set up a task system in order to run a variety of parameterizations.
Model Training with a GPU
I switched my training code to move as much stuff as possible over to the GPU. After some elbow grease, I kicked off the same training run before and found that it ran in about 15 min – a 32x speed increase.
The next thing I tried was increasing the encoding from a length of 16 to 32. On the CPU, such an increase would at least double the training time. On the GPU, the training time remained the same.
How could that be?
Simply put, the GPU is really fast at crunching numbers, and the training runtime is dominated by using the CPU to unpack our training examples, sending data to and from the GPU, and running the gradient update on the CPU. In this case there seems to be a free lunch!
Here is a simple depiction of what was happening before (top timeline) vs. what is happening now:
We pay the upfront cost of sending the model to the GPU, but then can more efficiently shovel data into it to get results faster than computing them ourselves. There is nothing magical happening here, just literally passing data there, having it crunch it, and receiving it once its done.
Parameter Tuning
Now that we have faster training, it is nice to look at how various training and model parameters affect our metrics. Good parameters can easily make or break a deep learning project.
Our model parameters are:
board dimension – always \(8 \times 8\)
encoding dimension – the size of the transformer embedding
maximum sequence length – how long of a sequence the transformer can handle (always 64)
number of transformer layers
Our training parameters are:
learning rate – affects the size of the steps that our AdamW optimizer takes
AdamW 1st and 2nd momentum decay
AdamW weight decay
batch size – the number of training samples per optimization step
number of training batches – the number of optimization steps
number of training entries – the size of our training set
dropout probability – certain layers in the transformer have a chance of randomly dropping outputs for robustness purposes
gradient threshold – the gradient is clipped to this value to improve stability
That’s a lot of parameters. How are we going to go about figuring out the best settings?
The tried and true method that happens in practice is try-and-see, where humans just try things they think are reasonable and see how the model performs. That’s what I was doing originally, when each training run took 8 hours. While that’s okay, it’d be nice to do better.
The next simplest approach is grid search. Here we discretize all parameters and train a model on every possible parameter combination. In 2 dimensions this ends up looking like a grid, hence the name:
We have about 10 parameters. Even if we only consider 2 values for each of them, doing a grid search over them all would require training \(2^{10} = 1024\) models, which at 15 min per model is ~10.7 days. That’s both too long and pretty useless – we want higher granularity than that.
With grid search out, the next alternative is to conduct local searches for specific parameters. We already have a training parameterization that works pretty well, the one from the last blog post, and we can vary a single parameter and see how that affects training. That’s much cheaper – just the cost of the number of training points we’d like to evaluate per parameter. If we want to evaluate 5 values per parameter, that’s just \(5 \cdot 10 = 50\) models, or ~12.5 hours of training time. I could kick that off and come back to look at it the next day.
What I just proposed is very similar to cyclic coordinate search or coordinate descent, which is an optimization approach that optimizes one input at a time. It is quite simple, and actually works quite well. In fact, Sebastian Thrun himself has expressed his enthusiasm for this method to me.
There’s a whole realm of more complicated sampling strategies that could be followed. I rather like uniform projection plans and quasi-random space-filling sets like the Halton sequence. They don’t take that much effort to set up and do their best to fill the search space with a limited number of samples.
The method I most leaned up was ultimately a mixture of coordinate search and random sampling. Random sampling is what Andrej Karpathy recommended in CS231n, because it lets you evaluate far more independent values for specific parameters than something like grid search:
Here we see about 1/3 as many samples as grid search covering way more unique values for param 1.
Okay, so we want a way to run random search and perhaps some other, more targeted search approaches. How would we go about doing that?
Tasks
In the next phase of my project I want to expand from just training a transformer policy to predict player up/down/left/right moves to more complicated models that may interact with this simpler model. I also want to use my models to discover solutions to random problems, and perhaps refine previously discovered solutions with better models. I thus don’t want to merely support training this one model, I want to be able to run more general tasks.
function run_tasks()
done = false
while !done
task_file = get_next_task()
if !isa(task_file, String)
println("No tasks left!")
done = true
break
end
task_filepath = joinpath(FOLDER_TODO, task_file::String)
res = run_task(task_filepath)
dest_folder = res.succeeded ? FOLDER_DONE : FOLDER_TRIED
# name the task according to the time
task_dst_name = Dates.format(Dates.now(), "yyyymmdd_HHMMss") * ".task"
mv(task_filepath, joinpath(dest_folder, task_dst_name))
write_out_result(res, task_dst_name, dest_folder)
println(res.succeeded ? "SUCCEEDED" : "FAILED")
end
end
The task runner simply looks for its next task in the TODO folder, executes it, and when it is done either moves it to the DONE folder or the TRIED folder. It then writes out additional task text that it captured (which can contain errors that are useful for debugging failed tasks).
The task runner is its own Julia process, and it spawns a new Julia process for every task. This helps ensure that issues in a task don’t pollute other tasks. I don’t want an early segfault to waste an entire night’s worth of training time.
The task files are simply Julia files that I load and prepend with a common header:
function run_task(task_filepath::AbstractString)
res = TaskResult(false, "", time(), NaN)
try
content = read(task_filepath, String)
temp_file, temp_io = mktemp()
write(temp_io, TASK_HEADER)
write(temp_io, "\n")
write(temp_io, content)
close(temp_io)
output = read(`julia -q $(temp_file)`, String)
res.succeeded = true
res.message = output
catch e
# We failed to
res.succeeded = false
res.message = string(e)
end
res.t_elapsed = time() - res.t_start
return res
end
This setup is quite nice. I can drop new task files in and the task runner with just dutifully run them as soon as its done with whatever came before. I can inspect the TRIED folder for failed tasks and look at the output for what went wrong.
Results
I ran a bunch of training runs and then loaded and plotted the results to get some insight. Let’s take a look and see if we learn anything.
We’ve got a bunch of metrics, but I’m primarily concerned with the top-2 policy accuracy and the top-2 nsteps accuracy. Both of these measure how often the policy had the correct action (policy accuracy) or number of steps remaining (nsteps accuracy) in its top-2 most likely predictions. The bigger these numbers are the better, with optimal performance being 1.0.
First let’s look at the learning rate, the main parameter that everyone typically has to futz with. First, the top-2 policy accuracy:
We immediately see a clear division between training runs with terrible accuracies (50%) and training runs with reasonable performance. This tells us that some of our model training runs did pretty poorly. That’s good to know – the parameters very much matter and can totally derail training.
Let’s zoom in on the good results:
We don’t see a super-clear trend in learning rate. The best policy accuracy was obtained with a learning rate around 5e-4, but that one sample is somewhat of an outlier.
The nsteps accuracy also shows the bad models, so we’ll jump straight to the zoomed version:
Interestingly, the same learning rate of 5e-4 produces the best nsteps accuracy as well, which is nice for us. Also, the overall spread here tends to prefer larger learning rates, with values down at 1e-4 trending markedly worse.
Next let’s look at the dropout probability. Large enough values are sure to tank training performance, but when does that start happening?
We don’t really have great coverage on the upper end, but based on the samples here it seems that a dropout probability of about 0.01 (or 1%) performs best. The nsteps accuracy shows a similar result.
Next let’s look at the weight decay.
We’ve found our performance-tanking culprit! Weight decay values even moderately larger than zero appear to be highly correlated with terrible performance. It seems to drag the model down and prevent learning. Very small weight decay values appear to be fine, so we’ll have to be careful to just search those.
This is an important aspect of parameter tuning – parameters like the learning rate or weight decay can take on rather small values like 1e-4. Its often more about finding the right exponent rather than finding the right decimal value.
With those learning parameters out of the way, let’s look at some model parameters. First, the encoding dimension:
Naively we would expect that bigger encoding dimensions would be better given infinite training examples and compute, but we those are finite. We didn’t exhaustively evaluate larger encoding dimensions, but find that the nsteps prediction doesn’t benefit all that much from going from 32 to 64 entries, whereas the policy does.
We can also look at the number of transformer layers:
Having more layers means we have a bigger model, with more room to perform operations on our token embeddings as they pass through the model. Bigger is often better, but is ultimately constrained by our training data and compute.
In this case the nsteps predictor can achieve peak performance across a variety of depths, whereas the policy seems to favor larger layer counts (but can still do pretty well even with a single layer).
The next question we might ask is whether the mode size overall is predictive of performance. We can plot the total number of trainable parameters:
In terms of policy accuracy, we are seeing the best performance with the largest models, but the nsteps predictor doesn’t seem to need it as much. That is consistent with what we’ve already observed.
Let’s now identify the best model. How would we do that?
I’m going to consider the best model to be the one with the highest value of top-2 policy accuracy + top-2 nsteps accuracy. That’s the same as asking for the model most top-right in the following plot:
The two accuracies are highly correlated (which makes sense – its hard to predict how many steps it will take to reach the goal without also being a good Sokoban policy). The model that does the best has an encoding dim of 64 with 5 transformer layers, uses a learning rate of 0.0005, has weight decay = 0.0002, and a dropout probability of 0.01.
Conclusion
In this post I got excited about our ability to use the GPU to train our networks, and then I tried to capitalize on it by running a generic task runner. I did some sampling and collected a bunch of metrics in order to try to learn a thing or two about how best to parameterize my model and select the training hyperparameters.
Overall, I would say that the main takeaways are:
Its worth spending a little time to help the computer do more work for you
Its important to build insight into how the various parameters affect training
I am having fun figuring out how to write a 2D sidescroller from scratch, with 3D graphics. I’ve covered how posing and animating those skeletal meshes works. This post extends that work with inverse kinematics, which allows for the automatic placement of hands or feet at specific locations.
A character’s pose is defined by the transforms of their skeleton. A skeleton is made up of bones, where each bone is translated, offset, and scaled with respect to a parent bone.
where \(\boldsymbol{\Delta}_i\) is its translation vector, \(\boldsymbol{q}_i\) is its orientation quaternion, and \(\boldsymbol{s}\) is its scaling vector. The transforms are \(4\times 4\) matrices because we’re using homogeneous coordinates such that we can support translations. A 3D position is recovered by dividing each of the first three entries by the fourth entry.
The resulting transform converts homogeneous locations in the bone’s local coordinate frame into the coordinate frame of its parent:
This aggregate transform \(\mathbb{T}_i\) ends up being very useful. If we have a mesh vertex \(\boldsymbol{v}\) defined in mesh space (with respect to the root bone), we can transform it to the \(i\)th bone’s frame via \(\mathbb{S}^{-1}_i \boldsymbol{v}\) according to the aggregate transform of the original skeleton pose and then transform it back into mesh space via \(\mathbb{T}_i \mathbb{S}^{-1}_i \boldsymbol{v}\). That gives us the vertex’s new location according to the current pose.
The Problem
We can define keyframes and interpolate between them to get some nice animations. However, the keyframes are defined in mesh space, and have no way to react to the world around them. If we have a run animation, the player’s foot will always be placed in the same location, regardless of the ground height. If the animation has the player’s arms extend to grab a bar, it can be very easy for the player’s position to be off and the hands not quite be where the edge is.
Here one foot is hovering in the air and the other is sticking into the ground.
The animation system we have thus far produces hand and foot positions with forward kinematics — the positions are based on the successive application of transforms based on bones in the skeleton hierarchy. We want to do the opposite — find the transforms needed such that an end effector like a hand or a foot ends up where we want it to be.
There are several common ways to formulate this problem. I am going to avoid heuristic approaches like cyclic coordinate descent and FABRIK. This method will be closed-form rather than iterative.
Instead of supporting any number of bones, I will stick to 2-bone systems. For example, placing a wrist by moving an upper an lower arm, or placing a foot by moving an upper and lower leg.
Finally, as is typical, I will only allow rotations. We don’t want our bones to stretch or displace in order to reach their targets.
Our inverse kinematic problem thus reasons about four 3D vectors:
a fixed hip position \(\boldsymbol{a}\)
an initial knee position \(\boldsymbol{b}\)
an initial foot position \(\boldsymbol{c}\)
a target foot position \(\boldsymbol{t}\)
Our job is to adjust the rotation transform of the hip and knee bones such that the foot ends up as close as possible to the target position.
The image makes it easy to forget that this is all happening in 3D space.
Working it Out
We are going to break this down into two parts. First we will find the final knee and foot positions \(\boldsymbol{b}’\) and \(\boldsymbol{c}’\). Then we’ll adjust the bone orientations to achieve those positions.
Finding the New Positions
There are three cases:
we can reach the target location and \(\boldsymbol{c}’ = \boldsymbol{t}\)
the bone segments aren’t long enough to reach the target location
the bone segments are too long to reach the target location
Let’s denote the segment lengths as \(\ell_{ab}\) and \(\ell_{bc}\), and additionally compute the distance between \(\boldsymbol{a}\) and \(\boldsymbol{t}\):
The target is too close if \(\ell_{at} < |\ell_{ab} – \ell_{bc}|\). When that happens, we extend the longer segment toward the target and the other segment away.
The target is too far if \(\ell_{at} > \ell_{ab} + \ell_{bc}\). When that happens, we extend the segments directly toward the target, to maximum extension.
The remaining, more interesting case is when the target can be reached exactly, and we know \(\boldsymbol{c}’ = \boldsymbol{t}\). In that case we still have to figure out where the knee \(\boldsymbol{b}’\) ends up.
The final knee position will lie on a circle that is the intersection of two spheres: the sphere of radius \(\ell_{ab}\) centered at \(\boldsymbol{a}\) and the sphere of radius \(\ell_{bc}\) centered at \(\boldsymbol{t}\):
We can calculate \(\theta\) from the law of cosines:
where \(\hat{\boldsymbol{n}}\) is the unit vector from \(\boldsymbol{t}\) to \(\boldsymbol{a}\).
We can place the knee anywhere we want on this circle using some unit vector \(\hat{\boldsymbol{u}}\) perpendicular to \(\hat{\boldsymbol{n}}\):
\[\boldsymbol{b}’ = \boldsymbol{m} + r \hat{\boldsymbol{u}}\]
For the inverse kinematics to look nice, I want to move the knee as little as possible. As such, let’s try to keep \(\hat{\boldsymbol{u}}\) close to the current knee position:
f32 len_ab = glm::length(b - a);
f32 len_bc = glm::length(c - b);
f32 len_at = glm::length(t - a);
// Unit vector from t to a.
glm::vec3 n = (a - t) / len_at;
if (len_at < 1e-3f) {
n = glm::vec3(1.0f, 0.0f, 0.0f);
}
bool reached_target = false;
glm::vec3 b_final, c_final; // The final positions of the knee and foot.
if (len_at < std::abs(len_ab - len_bc)) {
// The target is too close.
// Extend the longer appendage toward it and the other away.
int sign = (len_ab > len_bc) ? 1 : -1;
b_final = a - sign * len_ab * n;
c_final = b_final + sign * len_bc * n;
} else if (len_at <= len_ab + len_bc) {
// The target is reachable
reached_target = true;
c_final = t;
// The final knee position b_final will lie on a circle that is the intersection of two
// spheres: the sphere of radius len_ab centered at a and the sphere of radius len_bc
// centered at t.
// Cosine of the angle t - a - b_final
f32 cos_theta =
(len_bc * len_bc - len_ab * len_ab - len_at * len_at) / (-2 * len_ab * len_at);
f32 sin_theta = std::sqrt(1.0 - cos_theta * cos_theta);
// Radius of the circle that b_final must lie on.
f32 r = len_ab * sin_theta;
// The center of the circle that b_final must lie on.
glm::vec3 m = a - len_ab * cos_theta * n;
// Unit vector perpendicular to n and pointing at b from m.
// If b and m are coincident, we default to any old vector perpendicular to n.
glm::vec3 u = GetPerpendicularUnitVector(n, b - m);
// Compute the new knee position.
b_final = m + r * u;
} else {
// The target is too far.
// Reach out toward it to max extension.
b_final = a - len_ab * n;
c_final = b_final - len_bc * n;
}
Adjusting the Orientations
Now that we have the updated knee and foot positions, we need to adjust the bone transforms to rotate the segments to match them.
The bone transforms are all defined relative to their parents. Let our end effector (the hand or foot) be associated with bone \(i\), its parent (the elbow or knee) be associated with bone \(j\), and its parent (the hip or shoulder) be associated with bone \(k\). We should already have the aggregate transforms for these bones computed for the existing pose.
First we’ll rotate bone \(k\) to point toward \(\boldsymbol{b}’\) instead of \(\boldsymbol{b}\). We need to calculate both positions in the bone’s local frame:
We then normalize these vectors and find the rotation that between them. A minimum rotation between two unit vectors \(\hat{\boldsymbol{u}}\) and \(\hat{\boldsymbol{v}}\) has an axis of rotation:
// Find the minimum rotation between the two given unit vectors.
// If the vectors are sufficiently similar, return the unit quaternion.
glm::quat GetRotationBetween(const glm::vec3& u, const glm::vec3& v, const f32 eps = 1e-3f) {
glm::vec3 axis = glm::cross(u, v);
f32 len_axis = glm::length(axis);
if (len_axis < eps) {
return glm::quat(0.0f, 0.0f, 0.0f, 1.0f);
}
axis /= len_axis;
f32 angle = std::acos(glm::dot(u, v));
return glm::angleAxis(angle, axis);
}
We then adjust the orientation of bone \(k\) using this axis and angle to get an updated transform, \(\boldsymbol{T}_k’\). We then propagate that effect down to the child transforms.
We do the same thing with bone \(j\), using the updated transform. The code for this is:
// Do the same with the next bone. glm::mat4 TTj_new = (*bone_transforms)[parent_bone]; local_currt = glm::normalize(To3d(glm::inverse(TTj) * glm::vec4(c, 1.0f))); local_final = glm::normalize(To3d(glm::inverse(TTj_new) * glm::vec4(c_final, 1.0f))); frame->orientations[parent_bone] = frame->orientations[parent_bone] * GetRotationBetween(local_currt, local_final);
Conclusion
This approach to inverse kinematics is efficient and effective. I’ve been using it to refine various action states, such as placing hands and feet onto ladder rungs:
It works pretty well when the hand or foot doesn’t have to be adjusted all that much. For larger deltas it can end up picking some weird rotations because it does not have any notion of maximum joint angles. I could add that, of course, at the expense of complexity.
Its really nice to be able to figure something out like this from the basic math foundations. There is something quite satisfying about being able to lay out the problem formulation and then systematically work through it.
This post expands on the previous post, which looked at collision detection and resolution for 2d polygons. By the end we were able to apply impulses to colliding bodies to properly get them to bounce and spin off of one another. Not only will we take it further with friction impulses and joints, but we’ll pay special attention to how these impulses are derived to emphasize that this line of reasoning is a tool that can be extended to new applications.
There is more math in this post than normal because I want to both provide a resource that I myself could have benefited from and show that this stuff can all be derived.
An impulse \(\boldsymbol{j}\) is an instantaneous change in momentum. They are very similar to forces, which influence velocity over time:
An impulse applied to a rigid body at a point \(\boldsymbol{r}\) relative to the body’s center of mass similarly affects its velocity, but without the need for the timestep:
This update models the angular velocity in three dimensions, treating it as a vector. We can work out the equivalent 2d update where we just keep track of a scalar \(\omega\):
From here on out, if you see a scalar \(I\), it refers to \(I_zz\).
The Normal Impulse
In the last blog post, we used an impulse along the impact normal to cause two colliding rigid bodies to bounce off each other. We’re going to cover the same thing, but derive it instead of just take the result for granted.
We have two bodies, A and B, and a contact point located at \(\boldsymbol{r}_A\) relative to body A’s center and at \(\boldsymbol{r}_B\) relative to body B’s center:
The velocity of the contact points on each body as they enter the collision is:
This factor \(e\) is called the restitution. Setting it to zero produces a perfectly inelastic collision (because the resulting relative velocity in the normal direction is zero), whereas setting it to 1 produces a perfectly elastic collision that preserves the kinetic energy. Most objects use a value somewhere in-between.
We now have everything we need to derive the impulse. Let’s expand the left-hand side of the restitution equation:
The two formulations are equal, but assume that we’re working in 2D and that the moment of inertia tensor is diagonal with every entry equal to \(I\). The Wikipedia version is more general, and can handle 3D collisions and general inertia tensors.
After calculating it, we apply an impulse \(j \hat{\boldsymbol{n}}\) to body B and the negated impulse to body A. Note that we only apply an impulse if the relative velocity is causing the bodies to move into contact (\(\boldsymbol{v}_\text{rel} \cdot \hat{\boldsymbol{n}} < 0\)), otherwise the contact point already has a separating velocity.
The restitution \(e\) will depend on the colliding bodies. One convenient way to get it is to assign an restitution to each rigid body, and then use the maximum between the two for the collision. That allows bounciness to win out:
\[e = \max\left(e_A, e_B\right)\]
Friction Impulse
The normal impulse takes care of the relative penetration speed, but does nothing in the tangent direction along the contact line. We typically want shapes that slide along each other to have friction — that is what the friction impulse, or tangent impulse, is for.
Wikipedia and other websites will at this point show you a fancy diagram for Coulomb friction depicting the friction cone. I don’t find that to be particularly illuminating.
Instead, I simply like to think about Coulomb friction as:
compute a tangent impulse that causes the relative tangent velocity to be zero (\(e=0\))
if the required tangent impulse is sufficiently small, apply it (static friction)
if the required tangent impulse is too large, cap it out and apply that instead (dynamic friction)
Let’s derive the update ourselves again.
The friction impulse acts in the tangent direction \(\hat{\boldsymbol{t}}\), which we can get by rotating the normal. We once again have a scalar impulse:
\[j = j \hat{\boldsymbol{t}}\]
We compute the tangent that causes a zero relative tangent velocity (\(e = 0\)).
These two conditions are structurally the same as the normal impulse, except \(e\) is fixed. We can directly apply our previous equations to get:
As long as this impulse is small enough (in magnitude), we apply it in order to prevent any tangential sliding. If it is large enough, we switch from static friction to dynamic friction:
The static and dynamic friction values are typically multiples of the normal impulse. We multiply the normal impulse by \(\mu_\text{static}\) to get the static friction impulse threshold and by \(\mu_\text{dynamic}\) to get the dynamic friction impulse threshold. Like the restitution, these are properties of each body, and we can obtain values to use for a pair of bodies by combining their values. Here it is recommended to use the square root of the product, which lets extremely low-friction (slippery) values dominate:
There are other things we can do with impulses. One of the main ones is to model attachment points between rigid bodies. These revolute joints allow rotation, but force the bodies to maintain a constant distance vector to the joint:
Revolute joints are useful for all sorts of things, including swinging:
I didn’t have a lot of luck finding information on how to enforce revolute joints between rigid bodies in 2D. Thankfully, we can derive the necessary impulse ourselves.
The property we want to enforce is that the contact velocity at the joint after the impulse be zero:
We can write a simple function that can solve such a \(2 \times 2\) linear system:
// Solve a 2x2 system of linear equations
// [a b][x1] = [e]
// [c d][x2] = [f]
// This method assumes that it will succeed, which
// requires that the matrix be invertible.
template <typename T>
void Solve(T a, T b, T c, T d, T e, T f, T& x1, T& x2) {
T det = d * a - c * b;
x2 = (a * f - c * e) / det;
x1 = (e - b * x2) / a;
}
We can then apply the resulting impulse to both bodies and perform a positional correction. I weight the correction such that the target joint location is biased toward the heavier object (that way joints on infinite-mass objects never move).
Conclusion
This post set the mathematical foundation for impulses, then derived the normal, friction, and revolute joint impulses. In each case, we looked at what was being preserved, formulated that mathematically, and massaged the equations to back out the impulse. While the equations were long at times, they did not lead us astray and we were able to find our way to some solid physics.
In the previous post, I updated the sidescroller project with skeletal meshes — meshes that can be posed. In this post we animate those meshes.
As a fun fact, we’re about 4 months into this project and I’ve written about 5600 lines of code. It keeps piling up.
Animations
An animation defines where the \(m\) mesh vertices are as a function of time:
\[\vec{p}_i(t) \text{ for } i \text{ in } 1:m\]
That’s rather general, and would require specifying a whole lot of functions for a whole lot of mesh vertices. We’ll instead leverage the mesh skeletons from the last blog post, and define an animation using a series of poses, also called keyframes:
5 keyframes from the walk animation
A single keyframe defines all of the bone transforms that pose the model in a certain way. A series of them acts like a series of still frames in a movie – they act as snapshots of the mesh’s motion over time:
When we’re in-between frames, we have to somehow interpolate between the surrounding frames. The simplest and most common method is linear interpolation, where we linearly blend between the frames based on the fraction of time we are between them:
Here \(\alpha\) ranges from 0 at the previous frame and 1 at the next frame:
Each frame is a collection of bone transforms. We could interpolate between the transform matrices directly, but it is more common to decompose those into position \(\vec{p}\), orientation \(\vec{q}\), and scale \(\vec{s}\), and interpolate between those instead. If we’re a fraction \(\alpha\) between frames \(f^{(k)}\) and \(f^{(k+1)}\), then the transform components for the \(i\)th bone are:
You’ll notice that the position and scaling vectors use linear interpolation, but we’re using spherical linear interpolation for the rotation. This is why we decompose it – basic linear interpolation doesn’t work right for rotations.
Conceptually, animation is super simple. A keyframe just stores some components and a duration:
struct Keyframe {
f32 duration;
std::vector<glm::vec3> positions; // for each bone
std::vector<glm::quat> orientations;
std::vector<glm::vec3> scales;
};
and an Animation is just a named list of keyframes:
If you want a simple animation, you slap down some spaced-out frames and let interpolation handle what happens in-between. If you want a nice, detailed animation, you can fill out your animation with a higher density of frames, or define special interpolation methods or easing functions to give you a better effect.
Posing a mesh is as simple as interpolating the keyframes and then applying its transforms to the mesh. To do that, its useful to keep track of how long our animation has been playing, and which frames we’re between. To this end we define an AnimationIndex:
struct AnimationIndex {
f32 t_total; // elapsed time since the start of the animation
f32 t_frame; // elapsed time since the start of the current frame
int i_frame; // index of the current frame
};
Advancing the index requires updating our times and the frame index:
We can use our interpolated keyframe to pose our mesh:
void PoseMeshToAnimationFrame(
std::vector<glm::mat4>* bone_transforms_curr,
std::vector<glm::mat4>* bone_transforms_final,
const Mesh& mesh,
const Keyframe& frame)
{
size_t n_bones = bone_transforms_final->size();
for (size_t i_bone = 0; i_bone < n_bones; i_bone++) {
glm::vec3 pos = frame.positions[i_bone];
glm::quat ori = frame.orientations[i_bone];
glm::vec3 scale = frame.scales[i_bone];
// Compute the new current transform
glm::mat4 Sbone = glm::translate(glm::mat4(1.0f), pos);
Sbone = Sbone * glm::mat4_cast(ori);
Sbone = glm::scale(Sbone, scale);
// Incorporate the parent transform
if (mesh.bone_parents[i_bone] >= 0) {
const glm::mat4& Sparent =
(*bone_transforms_curr)[mesh.bone_parents[i_bone]];
Sbone = Sparent * Sbone;
}
// Store the result
const glm::mat4& Tinv = mesh.bone_transforms_orig[i_bone];
(*bone_transforms_curr)[i_bone] = Sbone;
(*bone_transforms_final)[i_bone] = Sbone * Tinv;
}
}
Adding Animations to the Game
Now that we have animations, we want to integrate them into the sidescroller game. We can do that!
These aren’t beautiful masterpieces or anything, since I clumsily made them myself, but hey – that’s scrappy do-it-to-learn-it gamedev.
We are now able to call StartAnimation on a RigComponent, specifying the name of the animation we want. For example, if the player loses contact with the ground, we can start either the falling or jumping animations with a code snippet like this:
if (!player_movable->support.is_supported) {
// Fall or Jump!
StartAnimation(player_rig, jump_pressed ? "jump" : "fall");
game->player_state_enum = PlayerActionState::AIRBORNE;
}
This looks up the requested animation and sets our animation index to its initial value.
There is one additional thing we need to get right. Instantaneously transitioning to a new animation can look jarring. Watch what happens here were I start punching in the air:
The knight’s legs suddenly snap to their idle positions!
What we’d like instead is to blend seamlessly from the pose the model is at when StartAnimation is called to the poses specified by the new animation. We can already blend between any two poses, so we just leverage that knowledge to blend between the initial pose and the animation pose.
We specify a fadeout time \(t_\text{fadeout}\) whenever we start a new animation. For the first \(t_\text{fadeout}\) seconds of our animation, we linearly interpolate between our initial pose based on how much time has elapsed. If we happen to start in a pose close to what the animation plays from, this effect won’t be all that noticeable. If we happen to start in a wildly different pose we’ll move the mesh over more naturally.
Editing Animations
Animations contain a lot of data. There a multiple frames and a bunch of things bones can do per frame. I didn’t want to specify those transforms in code, nor manually in some sort of textfile.
I made a basic animation editor using ImGUI and used that to craft my animations:
Its fairly rudimentary, but gets the job done.
One could argue that I should have stuck with Blender’s animation tools. There are a few advantages to rolling my own editor:
Blender doesn’t run on my Linux laptop and keeps crashing
My model format has somewhat departed from Blender’s
I want to be able to edit hitspheres and hurtspheres down the road
My objective is to learn, not turn a profit or release a real game
The right trade-off will of course depend on your unique situation.
Conclusion
Going from pose-able meshes to animations was mainly about figuring out how to tie a bunch of poses together over time. Now that we have that, and a way to cleanly transition between animations, we’re pretty much set.
Animations in big games can be a lot more complicated. We might have overlapping animations for various parts of a mesh, such as facial animations that play simultaneously with body-level animations. We could implement inverse kinematics to help with automatically placing our feet at natural stepping locations (or when climbing to grasp at natural handholds). There are a bunch of fancier techniques that give you tighter control over the mesh too, but at the core of it, what we did here covers the fundamentals.
Next I plan to work on a basic combat system and get some other entities into the game. In order for this to be a proper level, we kind of need that. And maybe we’ll do some ladders.
This month I’m continuing the work on the side scroller and moving from static meshes to meshes with animations. This is a fairly sizeable jump as it involves a whole host of new concepts, including bones and armatures, vertex painting, and interpolation between keyframes. For reference, the Skeletal Animation post on learnopengl.com prints out to about 16 pages. That’s a lot to learn. Writing my own post about it helps me make sure I understand the fundamentals well enough to explain them to someone else.
Moving Meshes
Our goal is to get our 3D meshes to move. This primarily means adding having the ability to make our player mesh move in prescribed ways, such as jumping, crouching, attacking, and climbing ladders.
So far we’ve been working with meshes, i.e. vertex and face data that forms our player character. We want those vertices to move around. Our animations will specify how those vertices move around. That is, where the \(m\) mesh vertices are as a function of time:
\[\vec{p}_i(t) \text{ for } i \text{ in } 1:m\]
Unfortunately, a given mesh has a lot of vertices. The low-poly knight model I’m using has more than 1.5k of them, and its relatively tiny. Specifying where each individual vertex goes over time would be an excessive amount of work.
Not only would specifying an animation on a per-vertex level be a lot of work, it would be very slow. One of the primary advantages of a graphics card is that we can ship the data over at setup time and not have to send it all over again:
With the rendering we’re doing now, we just update a few small uniforms every frame to change the player position. The mesh data itself is already on the GPU, stored as a vertex buffer object.
So fully specifying the vertex positions over time is both extremely tedious and computationally expensive. What do we do instead?
When a character moves, many of their vertices typically move together. For example, if my character punches, we expect all of the vertices in their clenched fist to together, their forearm arm to follow. Their upper arm follows that, maybe with a bend, etc. Vertices thus tend to be grouped, and we can leverage the grouping for increased efficiency.
Bones
You know how artists sometimes have these wooden figures that they can bend and twist into poses? That’s the industry-standard way to do 3D animation.
We create an armature, which is a sort of rigid skeleton that we can use to pose the mesh. The armature is made up of rigid entities, bones, which can move around to produce the movement expected in our animation. There are far fewer bones than vertices. We then compute our vertex positions based on the bones positions – a vertex in the character’s fist is going to move with the fist bone.
This approach solves our problems. The bones are much easier to pose and thus build animations out of, and we can continue to send the mesh and necessary bone information to the GPU once at startup, and just send the updated bone poses whenever we render the mesh.
The bones in an armature have a hierarchical structure. Every bone has a single parent and any number of children, except the root bone, which has no parent.
Unlike my oh-so-fancy drawing, bones don’t actually take up space. They are actually just little reference frames. Each bone’s reference frame is given by a transformation \(T\) with respect to its parent. More specifically, \(T\) transforms a point in the bone’s frame to that of its parent.
For example, we can use this transform to get the “location” of a bone. We can get the position of the root bone by transforming the origin of its frame to its parent – the model frame. This is given by \(T_\text{root} \vec{o}\), where \(\vec{o}\) is the origin in homogenous coordinates: \([0,0,0,1]\). Our transforms use 4×4 matrices so that we can get translation, which is ubiquitous in 3D graphics.
Similarly, the position of one of the root bone’s children with transform \(T_\text{c}\) can be obtained by first computing its position in the root bone’s frame, \(T_\text{c} \vec{o}\), and then convert that to the model frame, \(T_\text{root} T_\text{c} \vec{o}\).
The order of operation matters! It is super important. It is what gives us the property that moving a leaf bone only affects that one bone, but moving a bone higher up in the hierarchy affects all child bones. If you ever can’t remember which order it is, and you can’t just quickly test it out in code, try composing two transforms together.
Let’s call the aggregated transform for bone \(c4\), this product of its ancestors, \(\mathbb{T}_{c4}\).
It is these transformations that we’ll be changing in order to produce animations. Before we get to that, we have to figure out how the vertices move with the bones.
Transforming a Vertex
The vertices are all defined in our mesh. They have positions in model-space.
Each bone has a position (and orientation) in model space, as we’ve already seen.
Let’s consider what happens when we associate a vertex with a bone. That is, we “connect” the vertex to the bone such that, if the bone moves, the vertex moves with it. We expect the vertex to move along in the bone’s local frame:
Here the bone both translates to the right and up, and rotates counter-clockwise about 30 degrees. In the image, the vertex does the same.
The image above has the bone starting off at the origin, with its axis aligned with the coordinate axes. This means the vertex is starting off in the bone’s reference frame. To get the vertex into the bone frame (it is originally defined in the model frame), we have to multiply it by \(\mathbb{T}^{-1}\). Intuitively, if \(\mathbb{T} \vec{p}\) takes a point from bone space to model space then the inverse takes a point from model space to bone space.
The bone moved. That means it has a new transform relative to its parent, \(S\). Plus, that parent bone may have moved, and so forth. We have to compute a new aggregate bone transform \(\mathbb{S}\). The updated position of the vertex in model space is thus obtained by transforming it from its original model space into bone space with \(\mathbb{T}^{-1}\) and then transforming it back to model space according to the new armature configuration with \(\mathbb{S}\):
\[\vec v’ = \mathbb{S} \mathbb{T}^{-1} \vec v\]
We can visualize this as moving a vertex in the original mesh pose to the bone-relative space, and then transforming it back to model-space based on the new armature pose:
This means we should store \(\mathbb{T}^{-1}\) for each bone in the armature – we’re going to need it a lot. In any given frame we’ll just have to compute \(\mathbb{S}\) by walking down the armature. We then compute \(\mathbb{S} \mathbb{T}^{-1}\) and pass that to our shader to properly pose the model.
Bone Weights
The previous section let us move a vertex associated with a single bone. That works okay for very blocky models composed of rigid segments. For example, a stocky robot or simple car with spinning wheels. Most models are less rigid. Yes, if you punch someone you want the vertices in your fist to follow the hand bone, but as you extend your elbow the vertices near the joint will follow both the bone from the upper arm and the bone from the lower arm. We want a way to allow this to happen.
Instead of a vertex being associated with one bone, we allow a vertex to be associated with multiple bones. We have the final vertex combination be a mix of any number of other bones:
where the nonnegative weights \(\vec w\) sum to one.
In practice, most vertices are associated with one one or two bones. It is common to allow up to 4 bone associations, simply because that covers most needs and then we can use the 4-dimensional vector types supported by OpenGL.
Data-wise, what this means is:
In addition to the original mesh data, we also need to provide an armature, which is a hierarchy of bones and their relative transformations.
We also need to associate each vertex with some set of bones. This is typically done via a vec4 of weights and an ivec4 of bone indices. We store these alongside the other vertex data.
The vertex data is typically static and can be stored on the graphics card in the vertex array buffer.
We compute the inverse bone transforms for the original armature pose on startup.
When we want to render a model, we pose the bones, compute the current bone transforms \(\mathbb{S}\), and then compute and sent \(\mathbb{S} \mathbb{T}^{-1}\) to the shader as a uniform.
Coding it Up
Alright, enough talk. How do we get this implemented?
We start by updating our definition for a vertex. In addition to the position, normal, and texture coordinates, we now also store the bone weights:
struct RiggedVertex {
glm::vec3 position; // location
glm::vec3 normal; // normal vector
glm::vec2 tex_coord; // texture coordinate
// Up to 4 bones can contribute to a particular vertex
// Unused bones must have zero weight
glm::ivec4 bone_ids;
glm::vec4 bone_weights;
};
This means we have to update how we set up the vertex buffer object:
Note the use of glVertexAttribIPointer instead of glEnableVertexAttribArray for the bone ids. That problem took me many hours to figure out. It turns out that glVertexAttribPointer accepts integers but has the card interpret them as floating point, which messes everything up it you indent to actually use integers on the shader side.
As far as our shader goes, we are only changing where the vertices are located, not how they are colored. As such, we only need to update our vertex shader (not the fragment shader). The new shader is:
#version 330 core
layout (location = 0) in vec3 aPos;
layout (location = 1) in vec3 aNormal;
layout (location = 2) in vec2 aTexCoord;
layout (location = 3) in ivec4 aBoneIds;
layout (location = 4) in vec4 aBoneWeights;
out vec3 pos_world;
out vec3 normal_world;
out vec2 tex_coords;
const int MAX_BONES = 100;
const int MAX_BONE_INFLUENCE = 4;
uniform mat4 model; // transform from model to world space
uniform mat4 view; // transform from world to view space
uniform mat4 projection; // transform from view space to clip space
uniform mat4 bone_transforms[MAX_BONES]; // S*Tinv for each bone
void main()
{
// Accumulate the position over the bones, in model space
vec4 pos_model = vec4(0.0f);
vec3 normal_model = vec3(0.0f);
for(int i = 0 ; i < MAX_BONE_INFLUENCE; i++)
{
j = aBoneIds[i];
w = aBoneWeights[j]
STinv = bone_transforms[j];
pos_model += w*(STinv*vec4(aPos,1.0f));
normal_model += w*(mat3(STinv)*aNormal);
}
pos_world = vec3(model * pos_model);
normal_world = mat3(model) * normal_model;
gl_Position = projection * view * model * pos_model;
tex_coords = vec2(aTexCoord.x, 1.0 - aTexCoord.y);
}
Note that the normal vector doesn’t get translated, so we only use mat3. The vertice’s texture coordinate is not affected.
Testing it Out
Let’s start with something simple, a 2-triangle, 2-bone system with 5 vertices:
The root bone is at the origin (via the identity transform) and the child bone is at [2,1,0]. The vertices of the left triangle are all associated only with the root bone and the rightmost two vertices are only associated with the child bone.
If we slap on a sin transform for the child bone, we wave the right triangle:
If we instead slap a sin transform on the root bone, we wave both triangles as a rigid unit:
We can apply a sin to both triangles, in which case the motions compose like they do if you bend both your upper and lower arms:
We can build a longer segmented arm and move each segment a bit differently:
Posing a Model
Now that we’ve built up some confidence in our math, we can start posing our model (The low-poly knight by Davmid, available here on Sketchfab) The ideas are exactly the same – we load the model, define some bone transforms, and then use those bone transforms to render its location.
I created an animation editing mode that lets us do this posing. As before, the UI stuff is all done with Dear ImGUI. It lets you select your model, add animations, add frames to those animations, and adjust the bone transforms:
A single frame represents an overall model pose by defining the transformations for each bone:
The position, orientation, and scale are stored separately, mainly because it is easier to reason about them when they are broken out like this. Later, when we do animation, we’ll have to interpolate between frames, and it is also easier to reason about interpolation between these components rather than between the overall transforms. A bone’s local transform \(S\) is simply the combination of its position, orientation, and scale.
We calculate our fine bone transforms \(\mathbb{S} \mathbb{T}^{-1}\) for our frame and then pass that off to the shader for rendering:
When you get down to it, the math behind rigged meshes isn’t all too bad. The main idea is that you have bones with transforms relative to their parent bone, and that vertices are associated with a combination of bones. You have to implement rigged meshes that store your bones and vertex weights, need a shader for rigged meshes, and a way to represent your model pose.
This main stuff doesn’t take too long to implement in theory, but this post took me way longer to develop than normal because I wrote a bunch of other supporting code (plus banged my head against the wall when things didn’t initially work – its hard to debug a garbled mess).
Adding my own animation editor meant I probably wanted to save those animations to disk. That meant I probably wanted to save my meshes to disk too, which meant saving the models, textures, and materials. Getting all that set up took time, and its own fair share of debugging. (I’m using a WAD-like data structure like I did for TOOM).
The animation editor also took some work. Dear ImGUI makes writing UI code a lot easier, but adding all of the widgets and restructuring a lot of my code to be able to have separate application modes for things like gameplay and the animation editor simply took time. The largest time suck (and hit to motivation) for a larger project like this is knowing you have to refactor a large chunk of it to make forward progress. Its great to reach the other end though and to have something that makes sense.
We’re all set up with posable models now, so the next step is to render animations. Once that’s possible we’ll be able to tie those animations to the gameplay — have the model walk, jump, and stab accordingly. We’ll then be able to attach other meshes to our models, i.e. a sword to its hand, and tie hitspheres and hurtspheres to the animation, which will form the core of our combat system.
In the previous post I introduced a new side scroller project, and laid out some of the things I wanted to accomplish with it. One of those points was to render with the graphics card. I had never worked directly with 3D assets and graphics cards before. Unless some high-level API was doing rendering for me under the hood, I wasn’t harnessing the power of modern graphics pipelines.
For TOOM I was not using the graphics card. In fact, the whole point of TOOM was to write a software renderer, which only uses the CPU for graphics. That was a great learning experience, but there is a reason that graphics cards exist, and that reason pretty much is that software renderers spend a lot of time rendering pixels when it’d be a lot nicer to hand that work off to something that can churn through it more efficiently.
Before starting, I thought that learning to render with the graphics card would more-or-less boil down to calling render(mesh). I assumed the bulk of the benefit would be faster calls. What I did not anticipate was learning how rich render pipelines can be, with customized shaders and data formats opening up a whole world of neat possibilities. I love it when building a basic understanding of something suddenly causes all sorts of other things to click. For example, CUDA used to be an abstract, scary performance thing that only super smart programmers would do. Now that I know a few things, I can conceptualize how it works. It is a pretty nice feeling.
Learning OpenGL
I learned OpenGL the way many people do. I went to learnopengl.com and followed their wonderful tutorials. It took some doing, and it was somewhat onerous, but it was well worth it. The chapters are well laid out and gradually build up concepts.
This process had me install a few additional libraries. First off, I learned that one does not commonly directly use OpenGL. It involves a bunch of function pointers corresponding to various OpenGL versions, and getting the right function pointers from the graphics driver can be rather tedious. Thus, one uses a service like GLAD to generate the necessary code for you.
As recommended by the tutorials, I am using stb_image by the legendary Sean Barrett for loading images for textures, and assimp for loading 3D asset files. I still find it hilarious that they can get away with basically calling that last one Butt Devil. Lastly, I’m using the OpenGL mathematics library (glm) for math types compatible with OpenGL. I am still using SDL2 but switched to the imgui_impl_opengl3 backend. These were all pretty easy to use and install — the tutorials go a long way and the rest is pretty self-explanatory.
I did run into an issue with OpenGL where I didn’t have a graphics driver new enough to support the version that learnopengl was using (3.3). I ended up being able to resolve it by updating my OS to Ubuntu 22.04. I had not upgraded in about 4 years on account of relying on this machine to compile thetextbooks, but given that they’re out I felt I didn’t have to be so paranoid anymore.
In the end, updating to Jammy Jellyfish was pretty easy, except it broke all of my screen capture software. I previously used Shutter and Peek to take screenshots and capture GIFs. Apparently that all went away with Wayland, and now I need to give Flameshot permission every time I take a screenshot and I didn’t have any solution for GIF recordings, even if login with Xorg.
In writing this post I buckled down and tried Kooha again, which tries to record but then hangs forever on save (like Peek). I ended up installing OBS Studio, recording a .mp4, and then converting that to a GIF in the terminal via ffmpeg:
This was only possible because of this excellent stack overflow answer and this OBS blog post. The resulting quality is actually a lot better than Peek, and I’m finding I have to adjust some settings to make the file size reasonable. I suppose its nice to be in control, but I really liked Peek’s convenience.
Moving to 3D
The side scroller logic was all done in a 2D space, but now rendering can happen in 3D. The side scroller game logic will largely remain 2D – the player entities will continue to move around 2D world geometry:
but that 2D world geometry all exists at depth \(z=0\) with respect to a perspective-enabled camera:
Conceptually, this is actually somewhat more complicated than simply calling DrawLine for each edge, which is what I was doing before via SDL2:
for edge in mesh
DrawLine(edge)
Rendering in 3D requires:
setting up a vertex buffer and filling it with my mesh data, including color
telling OpenGL how to interpret said data
using a vertex shader that takes the position and color attributes and transforms them via the camera transformations into screen space
using a dead-simple fragment shader that just passes on the received vertex colors
calling glDrawArrays with GL_LINES to execute all of this machinery
This is not hard to do per se, but it is more complicated. One nice consequence is that there is only a single draw call for all this mesh data rather than a bunch of independent line draw calls. The GPU then crunches it all in parallel:
The coordinate transforms for this are basically what learnopengl recommends:
an identity model matrix since the 2D mesh data is already in the world frame
a view matrix obtained via glm::lookat based on a camera position and orientation
a projection matrix obtained via glm::perspective with a 45-degree viewing angle, our window’s aspect ratio, and \(z\) bounds between 0.1 and 100:
This setup worked great to get to parity with the debug line drawing I was doing before, but I wanted to be using OpenGL for some real mesh rendering. I ended up defining two shader programs in addition to the debug-line-drawing one: one for textured meshes and one for material-based meshes. I am now able to render 3D meshes!
This low-poly knight was created by Davmid, and is available here on Sketchfab. The rest of the assets I actually made myself in Blender and exported to FBX. I had never used Blender before. Yeah, its been quite a month of learning new tools.
The recording above also includes a flashlight effect. This was of course based on the learnopengl tutorials too.
Rendering with shaders adds a ton of additional options for effects, but also comes with a lot of decisions and tradeoffs around how data is stored. I would like to simplify my approach and end up with as few shaders as possible. For now its easiest for me to make models in Blender with basic materials, but maybe I can move the material definitions to a super-low-res texture (i.e., one pixel per material) and then only use textured models.
Architecture
It is perhaps worth taking a step back to talk about how this stuff is architected. That is, how the data structures are arranged. As projects get bigger, it is easier to get lost in the confusion, and a little organization goes a long way.
First off, the 3D stuff is mostly cosmetic. It exists to look good. It is thus largely separate from the underlying game logic:
I created a MeshManager class that is responsible for storing my mesh-related data. It currently keeps track of four things:
Materials – color properties for untextured meshes
Textures – diffuse and specular maps for textured meshes
Meshes – the 3D data associated with a single material or texture
Models – a named 3D asset that consists of 1 or more meshes
struct Material {
std::string name;
glm::vec3 diffuse; // diffuse color
glm::vec3 specular; // specular color
f32 shininess; // The exponent in the Phong specular equation
};
struct Texture {
GLuint id; // The OpenGL texture id
std::string filepath; // This is essentially its id
};
struct Mesh {
std::string name; // The mesh name
std::vector<Vertex> vertices;
// Indices into `vertices` that form our mesh triangles
std::vector<u32> indices;
// An untextured mesh has a single material, typemax if none.
// If the original mesh has multiple materials,
// the assimp import process splits them.
u16 material_id;
// A textured mesh has both a diffuse and a specular texture
// (typemax if none)
u16 diffuse_texture_id;
u16 specular_texture_id;
GLuint vao; // The OpenGL vertex array object id
GLuint vbo; // The OpenGL vertex buffer id associated with `vertices`
GLuint ebo; // The element array buffer associated with `indices`
};
struct Model {
std::string filepath; // The filepath for the loaded model
std::string name; // The name of the loaded model
std::vector<u16> mesh_ids; // All mesh ids
};
Visually, this amounts to:
All of this data can be loaded from the FBX files via assimp. In the future, I’ll also be writing this data out and back in via my own game data format.
My game objects are currently somewhat of a mess. I have an EntityManager that maintains a set of entities:
struct Entity {
u32 uid; // Unique entity id
u32 flags;
common::Vec2f pos; // Position in gamespace
// A dual quarter edge for the face containing this position
core::QuarterEdgeIndex qe_dual;
u16 movable_id; // The movable id, or 0xFFFF otherwise
u16 collider_id; // The collider id, or 0xFFFF otherwise
};
Entities have unique ids and contain ids for the various resources they might be associated with. For now this includes a movable id and a collider id. I keep those in their own managers as well.
A movable provides information around how an entity is moving. Right now it is just just a velocity vector and some information about what the entity is standing on.
A collider is a convex polygon that represents the entity’s shape with respect to collision with the 2D game world:
The player is the only entity that currently does anything, and there is only one collider size from which the Delaunay mesh is built. I’ll be cleaning up the entity system in the future, but you can see where its headed – it is a basic entity component system.
Editing Assets
I want to use these meshes to decorate my game levels. While I could theoretically create a level mesh in Blender, I would much prefer to be able to edit the levels in-game using smaller mesh components.
I created a very basic version of this using ImGUI wherein one can create and edit a list of set pieces:
This is leaps and bounds better than editing the level using a text file, or worse, hard-coding it. At the same time, it is pretty basic. I don’t highlight the selected set piece, I don’t have mouse events hooked up for click and drag, and my camera controls aren’t amazing. I do have some camera controls though! And I can set the camera to orthographic, which helps a lot:
The inclusion of this set piece editor meant that I needed a way to save and load game data. (You can see the import and export buttons on the top-left). I went with the same WAD-file-like approach that I used for TOOM, which is a binary finally consisting of a basic header, a list of binary blobs, followed by a table of contents:
Each blob consists of a name and a bunch of data. It is up to me to define how to serialize or deserialize the data associated with a particular blob type. Many of them just end up being a u32 count followed by a list of structs.
One nice result of using a WAD-like binary is that I don’t always have to copy the data out to another data structure. With TOOM, I was copying things out in the C++ editor but was referencing the data as-is directly from the binary in the C code that executed the game. Later down the road I’ll probably do something similar with a baked version of the assets that the game can refer to.
There are some tradeoffs. Binary files are basically impossible to edit manually. Adding new blob types doesn’t invalidate old save files, but changing how a blob type is serialized or deserialized does. I could add versions, but as a solo developer I haven’t found the need for it yet. I usually update the export method first, run my program and load using the old import method, export with the new method, and then update my import method. This is a little cumbersome but works.
With TOOM I would keep loading and overwriting a single binary file. This was fine, but I was worried that if I ever accidentally corrupted it, I would have a hard time recreating all of my assets. With this project I decided to export new files every time. I have a little export directory and append the timestamp to the filename, ex: sidescroller_2023_12_22_21_28_55.bin. I’m currently loading the most recent one every time, but I still have older ones available if I ever bork things.
Conclusion
This post was a little less targeted than normal. I usually talk about how a specific thing works. In this case, I didn’t want to re-hash how OpenGL works, and instead covered a somewhat broad range of things I learned and did while adding meshes to my game project. A lot of things in gamedev are connected, and by doing it I’m learning what the implications of various decisions are and oftentimes, why things are as they are.
There are a lot of concepts to learn and tools to use. The last month involved:
OpenGL
GLAD
assimp
stb_image
Blender
OBS Studio
textures
Phong lighting
shaders
…
I am often struck by how something foreign and arcane like using a GPU seems mysterious and hard to do from the outset, but once you dive in you can pull back the curtains and figure it out. Sure, I still have a ton to learn, but the shape of the thing is there, and I can already do things with it.
With TOOM‘done’Or rather, as done as I want it to be., I found myself wanting to try something new. So, I started a new vscode workspace and embarked on a new side project – a 2D side scroller.
My overarching goal, once again, is to explore and take it as far as I find it interesting. That being said, I want to overcome the challenge of building a stable game with general 2D physics. By that I mean, “vectorial” physics of the sort used by Braid, with arbitrary 2D level geometry:
One reason for this goal is that I’ve tried to noodle it out before, but was surprised to find a significant lack of resources. In fact, I had to laugh out loud when I read this excellent heavily cited blog post about 2D platformers. It covers tile-based and rastered platforms in great detail, only to leave vectorial platformers with a vague “you do it yourself” and a warning not to use a physics engine. Well, challenge accepted! Let’s get into it!
There are a few other things I want to learn from this adventure. I plan to eventually use 3D GPU-accelerated graphics. Its worth knowing how that works, and the work on TOOM convinced me that drawing with the CPU is super slow. The aesthetic I have in mind is pretty low-poly, so while I’d normally go with pixel art because that’s all I can reasonably attempt myself, I nevertheless want to give 3D a try. The other big reason to go with 3D is that I can build an animation system and animate characters once rather than 8x for the various viewing angles. I was wanting to build a skeletal animation system, but felt that 2D skeletal animation systems very easily feel flat without serious artistic effort.
The general image I have in my mind is a souls-like 2D platformer with movement and combat somewhat similar to playing Link in Super Smash, except with some additional movement mechanics like ladders and other climbable surfaces and no bubble shield. It’ll probably never be a full game but maybe I can make a proper level and boss at the end.
I’ll be using C++ and vscode for this, on Linux. I’ll try to minimize library use, but will leverage SDL2 as my platform library and probably OpenGL for 3D graphics. I love ImGui and will use that for dev UI.
What I tried first Didn’t Work
When I first started, I tried to go with an object system where the world is made of convex polygonal colliders. Seems reasonable:
The player (and one enemy) have colliders represented by these six-sided shapes. The white polygons are static geometry. So far so good.
When an entity moves, it ticks forward by \(\Delta t\). I could have done this the boring way by:
Moving the entity its full distance
Checking for collisions
Shunting the entity out of any colliding polygons
I’ve written about why this is a bad idea. In short, this approach can cause tunneling or other weird physics effects, especially when the entity is moving fast or when there are multiple intersecting bodies.
The system I implemented sought to instead find the first collision. This is the same as taking the source polygon and sweeping it along the direction of travel and finding the longest sweep \(\delta t \in [0, \Delta t]\) that does not intersect anything:
This sweeping method is harder to formulate. We aren’t really colliding the player’s collider with each static collider. Instead, we are colliding the swept player collider with each static colliderWe can use a broad collision phase narrow down to a small number of local potential colliders., but we still want to back out the maximum step we can take in our direction of motion. So we’re effectively sweeping the player collider along \(\Delta t \cdot \boldsymbol{v}\) and then checking for collisions, and we have to find the smallest \(\delta t\) such that the volume swept along \(\delta t \cdot \boldsymbol{v}\) does not collide.
The way I implemented that was to fall back to our trusty strategy of using the Minkowski sumWhich is really a Minkowski difference.. That is, we can expand the other polygon by the player’s collision volume first, and then all we have to do is check whether the line segment between the player’s starting location and that location shifted by \(\Delta t \cdot \boldsymbol{v}\) intersects with the polygon:
That mostly worked. It just has a lot of trouble with corner cases. For example, when the player is standing on a surface and we’re traveling along it. We don’t want floating point errors to accidentally place the player inside the supporting polygon. This was particularly problematic in the gameplay gif above, where the player ends up wedged between two squares and slides through one of them.
I had a lot of trouble with walking. I wanted the player to be able to walk across colliders along the floor without treating it like a ramp and launching off for a bit:
This means we have to do some hacky checks to figure out when we’re walking along a polygon and reach its end, and may need to transition to walking along a new polygon. I tried getting this all to work reliably, and got pretty far, but ultimately the number of hacks I had to add made the system untenable and left a bad taste in my mouth.
Redesign
The fundamental problem is that my previous system was fully continuous, but that continuity inherently had issues with floating point effects when the player is supported by geometry. In these cases, we want something more discrete that can concretely tell us which side of the geometry we are on. Furthermore, I wanted something that could more easily tell me whether, when walking along a surface, there was another surface to walk on after passing the end of the current segment. These thoughts lead me to the trusty Delaunay mesh data structure.
I’ve written blog posts that use constrained Delaunay meshes (aka DCELs) a fewtimesnow. Now that I know this data structure exists, I keep finding uses for it.
Delaunay meshes are triangle meshes in which we try to keep the triangles as equilateral as possible (to avoid very fractured / long and narrow tiles). Constrained Delaunay meshes are the same thing, except we force certain edges to exist. This makes these meshes useful for things like representing map geometry, where we force walls to be edges in the mesh:
The E1M1 map from DOOM, after I’ve loaded it into a constrained Delaunay mesh
I actually advised that one use constrained Delaunay meshes for 2D player movement in this previous blog post. I hadn’t initially followed my own suggestion because the mesh is dependent on both the level geometry and the player collider (each polygon is expanded by the player collider via a Minkowski sum). If the player ducks and their collider shrinks, then we have to change our mesh. If an enemy has a different sized collider, then it should use a different mesh. If a piece of level geometry moves, such as a platform or door, then we need to change the mesh. All of these problems are still problems, but I’ll have to find workarounds for them. Nevertheless, I’m going with exactly what I spelled out in that earlier blog post.
My new plan was to keep track of both the player’s position and the triangle the collision mesh that they find themselves in. The collision mesh already accounts for the player’s collision volume, so we need only keep track of their center position. When they move, we check to see if their motion would take then out of their containing triangle, and if it does, we handle collisions (if necessary) at the crossed edge.
Here we see the updated system, along with the collision mesh.
Building the Mesh
Since I’ve covered how the movement works before, I’d rather focus on how we get the mesh.
This mesh is obtained by expanding each polygon by the player’s collision volume, and then adding all resulting edges to the Delaunay mesh. These edges are constrained to ensure that they remain in the final mesh (otherwise, we might flip edges in order to make them more equilateral). The main difficulty arises in the fact that the edges of the expanded polygons may intersect, and we nevertheless have to support them in our final mesh:
For example, these expanded polygons overlap.
I had previously tackled this using polygon unions via Clipper2. I could do that again, but want to avoid the loss of information that comes with replacing two or more overlapping polygons with a single polygon. For example, I might want to know which feature my player is standing on. For example, I may end up wanting to support removing objects. Plus, figuring out my own solution is a reward unto itself.
What I ended up doing was updating my method for constraining edges in a DCEL. During the development of TOOM, I had made it possible to load the DOOM level geometry and constrain the appropriate edges. Here we just added sides as we went, and constrained them after adding them:
for (Linedef linedef : linedefs) {
// Insert vertex a and b.
for (i16 i_vertex : {linedef.i_vertex_start, linedef.i_vertex_end}) {
const common::Vec2f& v = doom_vertices[i_vertex];
InsertVertexResult res = mesh.InsertVertex(v);
if (res.category == IN_FACE || res.category == ON_EDGE) {
mesh.EnforceLocallyDelaunay(res.i_qe);
vertex_indices[i_vertex] = mesh.GetVertexIndex(res.i_qe);
}
}
// Now ensure that there is a line between A and B.
// If there is not, flip edges until there is.
QuarterEdgeIndex qe_a =
mesh.GetQuarterEdge(vertex_indices[linedef.i_vertex_start]);
QuarterEdgeIndex qe_b =
mesh.GetQuarterEdge(vertex_indices[linedef.i_vertex_end]);
QuarterEdgeIndex qe_ab = mesh.EnforceEdge(qe_a, qe_b);
mesh.ConstrainEdge(qe_ab);
// sidedef and passability stuff...
}
For example, we might add the vertices for A and B, and then end up with a mesh that is locally Delaunay but does not have an edge between A and B:
The call to EnforceEdge then flips CD in order to get the edge we want:
This approach worked for the DOOM maps because the sides there never crossed over one another. For example, if CD is another enforced edge, there was no way to also enforce AB by flipping edges:
I needed a more general way to specify two points A and B and then adjust the mesh such that those two points are joined by a straight line, potentially via multiple colinear segments.
The updated algorithm has an outer loop that alternately tries tackling the problem from A to B and from B to A. As soon as it finds or produces a connection, it terminates. If it fails to make progress from both directions, it exits out.
Attempting to make progress from A to B starts by finding the edge from A that is as close as possible in a counter-clockwise / right-hand sense to the desired segment AB:
We’ll call the other side of this segment C. If C = B, we are done, as then AB already exists.
It is possible that C lies along AB. When this happens, we replace A with C and return to the outer loop:
If we haven’t returned yet, then AC is counter-clockwise of AB. We need to flip CD, where D is on the other side of AB from C. If CD is not constrained, we can flip it.
The most interesting case occurs when CD is constrained. We cannot flip CD because it is fixed – we want it to remain an edge in the final mesh. So we instead introduce a new point E at the intersection of AB and CD, split CD at E, and add AE and EB:
We can see this in action in the TOOM mesh editor, updated with this method:
Here we have a constrained edge
Here we have to flip or intersect multiple edges
So far I’m just building a mesh with constrained edges matching those of the expanded collision polygons, and only have a single such mesh to reflect the player’s collision polygon. I do not yet track which polygon corresponds to which triangle, nor do edges otherwise have any special properties. Nevertheless, what I have is sufficient for basic 2D player movement.
The player (and any other entity) has a 2D position and a triangle face in which their position lies. The 2D position is continuous, but the triangle face is discrete. Having something discrete resolves the issue with floating point rounding – we know exactly which side of a line we are (supposed to be) on.
Note that the constrained Delaunay mesh does not actually contain the white collider polygons:
I’ll probably talk about this more next time, but movement physics currently get processed in two different ways. If the moving entity is airborne, then they move according to the standard laws of physicsApply acceleration, use velocity and delta t to get future position, sweep along the line segment, and check for triangle traversals.. If we collide with a solid edge, we lose all velocity going into it and then continue on in our new direction. Colliding with a solid edge can result in the agent transitioning to the supported state.
If the moving entity is standing or walking on a surface, i.e. is supported, we handle things a bit differently. Here their velocity is projected along the supporting surface. Any time they move past the end of the supporting surface, we find the next supporting edge (if there is one) and allow them to transition over. If they are moving fast enough or there is no supporting edge to connect to, they transition to being airborne.
The player walking across supporting surfaces
Conclusion
Constrained Delaunay meshes resolved many of the tricky issues that arose with tackling 2D player movement for arbitrary terrain geometry. Knowing which face a player is in is a discrete fact that removes the issues with numeric imprecision in general polygon-on-polygon collisions, without having to resort to shunting. It also naturally allows us to leverage the mesh to query things like the next supporting face.
The code for the updated Delaunay mesh is available here.
In the future I hope to support additional motion states, such as climbing on ladders and surfaces. I’ll want to add support for dropping down “through” a platform, wall slides and jumps, and some other basic sidescroller movement features. I want to add 3D GPU graphics, and with that, animations and combat. It turns out there is a lot that goes into a basic game.
One of the most interesting parts of the Convex Optimization lectures by Stephen Boyd was the realization that (continuous) optimization more-or-less boils down to repeatedly solving Ax = b. I found this quite interesting and thought I’d write a little post about it.
Constrained Optimization
An inequality-constrained optimization problem has the form:
One popular algorithm for solving this problem is the interior point method, which applies a penalty function that prevents iterates \(\boldsymbol{x}^{(k)}\) from becoming infeasible. One common penality function is the log barrier, which transforms our constrained optimization problem into an unconstrained optimization problem:
The penalty scalar \(\rho > 0\) starts small and is increased as we iterate, reducing the penalty for approaching the constraint boundary.
We thus solve our constrained optimization problem by starting with some feasible \(\boldsymbol x^{(0)}\) and then solving a series of unconstrained optimizations. So the first takeaway is that constrained optimization is often solved with iterated unconstrained optimization.
Here we see the an objective function (blue) constrained with \(x \geq 1\) (gray line). The red line shows the penalized surrogate objective. As we increase \(\rho\) and optimize, our solution moves closer and closer to \(x = 1\).
Unconstrained Optimizaton
We have found that constrained optimization can be solved by solving a series of unconstrained optimization problems.
An unconstrained optimization problem has the form:
If \(f\) is continuous and twice-differentiable, we can solve this problem with Newton’s method. Here we fit a quadratic function using the local gradient and curvature and use its minimum (which we can solve for exactly) as the next iterate:
Continuous functions tend to become bowl-shaped close to their optimas, so Newton’s method tends to converge extremely quickly once we get close. Here is how that looks on the banana function:
What this means is that unconstrained optimization is done by repeatedly calculating Newton steps. Calculating \(\boldsymbol H(\boldsymbol x^{(k)})^{-1} \nabla f(x^{(k)})\) is the same as solving Ax = b:
\[\boldsymbol H\boldsymbol d = \nabla f\]
Well alrighty then, optimization is indeed a bunch of Ax = b.
What does this Mean
There are several takeaways.
One is that its nice that complicated things are just made up of simpler things that we can understand. One can take comfort in knowing that the world is understandable.
Another is that we should pay a lot of attention to how we compute Ax = b. Most of our time spent optimizing is spent solving Ax = b. If we can do that more efficiently, then we can optimize a whole lot faster.
Estimating Solve Times
How long does it take to solve Ax = b? Well, that depends. Let’s say we’re naive, and we have a random \(n \times n\) matrix \(\boldsymbol A\) and a random vector \(\boldsymbol b\). How fast can our computers solve that?
Well, it turns out that my laptop can solve a random \(1000 \times 1000\) linear system in about 0.35s:
We can plot the computation time vs. matrix size:
At first glance, what we see is that yes, computation grows as we increase \(n\). Things get a little fishy though. That trend line is \(n^2/10^7\), which is not the \(O(n^2)\) growth that we expected. What is going on?
While Julia itself is running single-threaded, the underlying BLAS library can use up to 8 threads:
Our system is sneakily making things more efficient! If we force BLAS to be single-threaded, we get the \(O(n^3)\) growth we expected after about \(n=200\) (Thank you Julia Discourse!):
What we’ve learned is pretty important – there are speedups to be had that go beyond traditional big-O notation. If you have an optimization problem, chances are that multithreading and SIMD can speed things up for you. If you have a really large optimization problem, maybe you can distribute it across multiple machines (Proximal methods, for example, are often amenable to this).
We also learned how mind-bogglingly fast linear algebra is, even on a single thread. Think about how long it would take you to manually solve Ax = b for \(n=2\) with pen and paper, and how fast the same operation would take on your machine.
If it takes 100 Newton iterates to solve an unconstrained optimization problem, then you can solve a problem with a thousand variables in 3.5s. Multiply that by another 100 for a constrained optimization problem, and you’re looking at solving a constrained optimization problem with \(n=1000\) in under 6 minutes (and often in practice, much faster).
We live in an era where we can solve extremely large problems (GPT-4 is estimated to have 1.7 billion parameters), but we can also solve smaller problems really quickly. As engineers and designers, we should feel empowered to leverage these capabilities to improve everything.
Special Structure
Appendix C of Convex Optimization talks a lot about how to solve Ax = b. For starters, generally solving \(\boldsymbol A \boldsymbol x = \boldsymbol b\) for an arbitrary \(\boldsymbol A \in \mathbb{R}^n\) and \(\boldsymbol b \in \mathbb{R}^n\) is \(O(n^3)\). When we’re using Newton’s method, we know that the Hessian is symmetric and positive definite, so we can compute the Cholesky decomposition of \(\boldsymbol A = \boldsymbol F^\top \boldsymbol F\). Overall, this takes about half as many flops as Gaussian elimination [source].
There are a bunch of other types of structure that we can take advantage of when we know that it exists. The most obvious is when \(\boldsymbol A\) is diagonal, in which case solving Ax = b is \(O(n)\).
One particularly interesting example is when \(\boldsymbol A\) has arrow structure:
When this happens, we have to solve:
\[\begin{bmatrix} \boldsymbol A & \boldsymbol B \\ \boldsymbol B^\top & \boldsymbol C \end{bmatrix} \begin{bmatrix} \boldsymbol x \\ \boldsymbol y \end{bmatrix} = \begin{bmatrix}\boldsymbol b \\ \boldsymbol c\end{bmatrix}\] where \( \boldsymbol A\) is diagonal, block-diagonal, or banded; \(\boldsymbol B\) is long and narrow; and \(\boldsymbol C\) is dense but comparatively small.
Solving a system with arrow structure can be done by first solving \(\boldsymbol A \boldsymbol u = \boldsymbol b\) and \(\boldsymbol A \boldsymbol V = \boldsymbol B\), which are both fast because we can exploit \( \boldsymbol A\)’s structure. Expanding the top row and solving gives us:
\[\boldsymbol x = \boldsymbol A^{-1} \left( \boldsymbol b – \boldsymbol B \boldsymbol y \right) = \boldsymbol u \> – \boldsymbol V \boldsymbol y\]
We can substitute this into the bottom row to get:
\[\begin{aligned} \boldsymbol B^\top \boldsymbol x + \boldsymbol C \boldsymbol y & = \boldsymbol c \\ \boldsymbol B^\top \left( \boldsymbol u \> – \boldsymbol V \boldsymbol y \right) + \boldsymbol C \boldsymbol y & = \boldsymbol c \\ \left( \boldsymbol C – \boldsymbol B^\top \boldsymbol V\right) \boldsymbol y & = \boldsymbol c \> – \boldsymbol B^\top \boldsymbol u\end{aligned}\]
Having solved that dense system of equations for \(\boldsymbol y\), we can back out \(\boldsymbol x\) from the first equation and we’re done.
You might think that arrow structure is a niche nicety that never actually happens. Well, if you look at how packages like Convex.jl, cvx, and cvxpy solve convex optimization problems, you’ll find that they expand and transform their input problems into a partitioned canonical form:
\[\begin{aligned}\underset{\boldsymbol x}{\text{minimize}} & & d + \boldsymbol c^\top \boldsymbol x + \frac{1}{\rho} p_\text{barrier}(\boldsymbol x) \\ \text{subject to} & & \boldsymbol A^\top \boldsymbol x = \boldsymbol b\end{aligned}\]
we can apply a primal-dual version of Newton’s method that finds a search direction for both \(\boldsymbol x\) and the dual variables \(\boldsymbol \mu\) by solving:
\[\begin{bmatrix} \nabla^2 p_\text{barrier}(\boldsymbol x) & \boldsymbol A^\top \\ \boldsymbol A & \boldsymbol 0\end{bmatrix} \begin{bmatrix}\Delta \boldsymbol x \\ \boldsymbol \mu \end{bmatrix} = – \begin{bmatrix}\rho \boldsymbol c + \nabla p_\text{barrier}(\boldsymbol x) \\ \boldsymbol A \boldsymbol x – \boldsymbol b\end{bmatrix}\]
Wait, what is that we see here? Not only does this have arrow structure (The Hessian matrix \(\nabla^2 p_\text{barrier}(\boldsymbol x)\) is block-diagonal), but the tip of the arrow is zero! We can thus solve for the Newton step very efficiently, even for very large \(n\).
So this esoteric arrow-structure thing turns out to be applicable to general convex optimization.
Conclusion
If you take a look at university course listings, there is a clear pattern of having classes for “Linear X” and “Nonlinear X”. For example, “EE263: Introduction to Linear Dynamical Systems” and “E209B Advanced Nonlinear Control”. You learn the linear version first. Why? Because we’re really good at solving linear problems, so whenever a problem can be framed or approximated in a linear sense, we’ve either solved it that way first or found its easiest to teach it that way first. Anything can be locally approximated by a 1st-order Taylor series and we can just work with a linear system of equations.
I love that insight, but I also love the implication that it has. If we can understand how to work with linear systems, how to exploit their structure and how to solve them quickly to sufficient accuracy, then we can speed up the solution of pretty much any problem we care about. That means we can solve pretty much any problem we care about better, more often, to a higher-degree of precision, and thereby make the world a better place.