Alg4Opt – Exporting Alt Text

May 3, 2026 by timw

Mykel Kochenderfer and I are finishing up the second edition of Algorithms for Optimization with MIT Press. The first edition took about five years, and the second edition has *only* taken… another five years. You can read it here.

What started as a simple extension with three new chapters became much more extensive. We overhauled many existing chapters, added new sections to others, and generally touched up the entire book. It has earned the title of second edition.

Modern technical textbooks include alt text, which are short annotations attached to images that allow visually impaired readers to receive those descriptions via screen readers. MIT Press authors provide alt text for every graphic element by submitting a spreadsheet in addition to the manuscript. Having alt text definitions live in a different place than our LaTeX source immediately felt like a bad idea. We don’t want to rely on memory to keep the alt text document up to date every time we change a figure or introduce a new one.

I decided very early on that I wanted the alt text descriptions to be alongside each graphical element in the LaTeX source. I defined a macro, \alttext{...}, and use that in every graphical element:

\begin{marginfigure}
    \centering
    \begin{tikzpicture}[
    	->, >=stealth',
    	level/.style={sibling distance = 1.5cm/#1, level distance = 1cm},
    	mynode/.style={circle, minimum size=7mm, align=center},
    	terminal/.style={mynode, draw=black, fill=none},
    	nonterminal/.style={mynode, draw=pastelBlue, fill=pastelBlue},
    	]
    	\node [terminal] {$+$}
    	child{ node[terminal] {$x$} }
    	child{ node[terminal] {$\ln$}
    		child{ node[terminal] {$2$}
    		}
    	};
    \end{tikzpicture}
    \alttext{A directed acyclic graph with a root plus node, a left-side x node, and a right-side natural log node with a two-node child.}
	\caption{
        \label{fig:expr-ln}
        The expression $x + \ln 2$ represented as a tree.
    }
\end{marginfigure}

These annotations let us update the alt text any time the figure is changed. The macro evaluates into nothing (\newcommand{\alttext}[1]{}), so it doesn’t affect document compilation. Great!

Unfortunately, MIT Press still needed a spreadsheet containing the figure numbers, captions, and their corresponding alt text. I needed an automated way to export the alt text annotations from the LaTeX source.

The First Attempt: Per-Line Scanning

My initial approach was to write a Julia script that scanned each LaTeX source file line-by-line and used regex matching to identify graphical elements and the \alttext macros. This script served a few purposes. First, it could identify which graphical elements still lacked alt text entries. Second, it allowed us to automatically check that our alt text entries adhered to MIT Press standards, such as avoiding certain characters and staying below a length limit. Finally, it packaged them up and exported the data to the required spreadsheet.

The first challenge was figure identification. The script looked for environments like \begin{tikzpicture} and commands like \plot. Unfortunately, there was more nuance here than one would have initially thought. TikzPicture environments can be nested, and in a few cases we define a new command containing a TikzPicture environment, only to invoke it later with a \protect to inject it safely into a caption. We also have \begin{ignore} environments, and have to ignore graphical elements in those.

After graphical elements were identified, I searched within them for an alt text entry. This was done with a simple lookahead search. Not ideal, but it worked fairly well.

function find_alttext(lines, line_index_lo::Int, line_index_hi::Int)::String
    for line in lines[line_index_lo:line_index_hi]
        m = match(r"\\alttext\{([^}]+)\}", line)
        if isa(m, RegexMatch)
            return strip(m[1])
        end
    end
    return ""
end

MIT Press wanted the figure numbers for each alt text entry. That is, if a figure has a caption with, e.g., “Figure 5.7”, we wanted 5.7 exported in the spreadsheet. LaTeX does all of the numbering at compile time, which is extremely convenient, but that makes it harder to infer from the source. The script had to keep track of figure numbers, incrementing them over time, but only for graphical elements that had such captions. The script would try to figure out whether a caption existed, but would only assign a new figure number if the graphical element was actually a figure or margin figure.

This script was used for the first version of the spreadsheet that we sent to MIT Press. It worked really well with respect to identifying figures that didn’t have alt text, and verified the content of the entries. Unfortunately, it did have some issues with figure number identification, and I ended up having to manually scan through all figures in the book and manually update the spreadsheet – exactly what I had wanted to avoid doing.

A Better Way: Lexing and Parsing

LaTeX source code is, well, code. And code is fundamentally hierarchical. Thinking of source code as a list of lines has a lot of downsides. Instead, to extract the data accurately, the script needed to understand the structure of the document more like the LaTeX compiler does. I started over, writing a proper lexer and parser, thereby providing a simple Abstract Syntax Tree (AST).

A lexer’s job is to convert the source code – a long bytes array – into a sequence of manageable tokens. Reasoning about a character sequences like "\alttext{this is a figure}" is a lot harder than reasoning about <command><brace_open><text><brace_close>.

Each Token is effectively an enum that applies to a range of characters:

@enum TokenKind begin
    TOKEN_TEXT        # Raw text content
    TOKEN_COMMAND     # A LaTeX command starting with \, e.g. \alttext
    TOKEN_BRACE_OPEN  # An opening curly brace {
    TOKEN_BRACE_CLOSE # A closing curly brace }
end

struct Token
    kind::TokenKind
    i_byte_lo::Int
    i_byte_hi::Int
end

The next step, parsing, does the work of relating tokens to one another to form a tree. For example, in the running example, the text is an input into the \alttext command, so it becomes a child:

This tree lets us represent larger structures more efficiently. For example, the marginfigure code sample at the top ends up being represented as a tree:

This tree structure makes it much easier to tell whether a caption is defined for the margin figure, or whether there is an alt text command, or whether the marginfigure itself is inside an ignore environment.

Nodes are more complicated than Tokens, but not by much:

@enum NodeKind begin
    NODE_ROOT    # The root node of the abstract syntax tree (AST)
    NODE_TEXT    # A text node
    NODE_GROUP   # A group of nodes enclosed in curly braces
    NODE_COMMAND # A command node, optionally with arguments as children
    NODE_ENV     # An environment like \begin{...} ... \end{...}
end

struct Node
    kind::NodeKind
    i_byte_lo::Int      # index into the original byte array of the first byte corresponding to this node
    i_byte_hi::Int      # index into the original byte array of the last byte corresponding to this node
    i_token_lo::Int     # index into the tokens array of the first token corresponding to this node
    i_token_hi::Int     # index into the tokens array of the last token corresponding to this node
    i_name_lo::Int      # index into the original byte array of the first byte of the command name
    i_name_hi::Int      # index into the original byte array of the last byte of the command name
    i_parent::Int       # parent node index, or zero otherwise
    i_first_child::Int  # first child node index, or zero otherwise
    i_sibling_next::Int # first sibling node index - circular list
    i_sibling_prev::Int # previous sibling node index - circular list
end

(This struct definition is very similar to the Large Array of Things struct from last month.).

It supports five types of nodes. A true LaTeX compiler would of course do more, but we only need these. The root node is the root of the abstract syntax tree for the file being processed. Text nodes are basic text. Group nodes are formed from nodes enclosed by curly braces. Command nodes are commands like \alttext. Environment nodes represent a \begin{the_env}...\end{the_env} pair.

Using an AST made the alt text export logic much cleaner and more robust. This helped with the numbering issue, as it was easier to detect when a caption was defined, and whether a graphical element was inside an example or table.

We were able to detect when a graphical element was defined inside a \newcommand call, and added some basic tracking to find invocations of the defined command (typically after \protect). These deferred macros were really hard to do with the previous system.

Conclusion

Yay! We built a mini-compiler to avoid filling out a spreadsheet. That might sound wasteful, but for a five year project, spending a few days to ensure long-term maintainability is an easy decision.

What started as a simple string-matching script evolved into a LaTeX AST parser. Using the right data structure for the problem at hand made the exporting task a lot simpler. This process overall gives us the best of both worlds – our alt text lives next to the LaTeX for the graphical elements and it can be automatically exported into the spreadsheet that our publisher needs.

Into Depth – Large Array of Things

March 31, 2026 by timw

A depiction of the change in the data structure used to represent game entities, going from type-specific lists with a top-level entity table to a single large array of things.

Last month was a big overhaul of the underlying grid representation in order to be able to handle portals. This month ended up being another major refactor, but instead of the grid, I overhauled how game entities are stored.

I wanted to get Pickups back into the game in order to get the fundamental game loop: Heroes enter levels, gather Pickups, and exit to unlock items that make them more powerful. Once Pickups were working, I wanted to add other core mechanics like ropes (for climbing) and buckets (which hold Pickups and can be hoisted via ropes).

All of this logic made it very clear that my current way of interacting with entities was too cumbersome. I had a top-level Entity type that could be looked up by any entity’s ID, and then would itself contain a type enum and a type-specific ID in order to look up the specialized data in the type-specific list. That looked something like:

// Grab the hero.
const Entity* entity = Get(stage.pool_entity, entity_id);
ASSERT(entity != nullptr && entity.type == EntityType::HERO);
Hero* hero = GetMutable(stage.pool_hero, entity.type_id);

After this change, there is no indirection. All entities are in a single array:

// Grab the hero.
const StageThing* thing = Get(stage.things, entity_id);
ASSERT(thing.type == EntityType::HERO);
// Now we can directly access `thing.hero` and do stuff.

Making the transition to a Large Array of Things' data structure removed this annoying indirection. More importantly, it simplified how concepts are shared across entities (e.g., all StageThings now have a cell index). That shared context streamlined my event system - an AttachTo event no longer needs special logic for buckets vs. ropes vs. heroes – and drastically simplified my undo/redo system.

With this new architecture, I was finally able to implement some basic gameplay:

This video shows the state of the game at the time of writing. We see the rope being equipped before entering a basic level. The knight then deploys the rope, which the sorceress can then use to climb down, pick up a relic, and then climb back out. Graphics and UI are very rough first passes.

The rest of this post will cover why I used the original approach, what the Large Array of Things is, and how it led to these improvements.

Old Method: Type-Specific Lists

I started with type-specific lists for every entity type in my game, following what I understood was Billy Basso’s approach on Animal Well. Every entity type gets a struct, and we maintain an array of each struct type that has some max capacity, and we keep the array densely packed as entities are added or removed:

Keeping the live entities packed in the front makes iteration over the entities really fast. We only need to iterate over the first N elements, and they are all sequential in memory.

However, keeping the entities packed means that if an entity is deleted, we will move other entities around:

The result is that we can’t reliably point to an entity with a pointer, since its location in memory can change. To solve this, each ObjectPool maintained an extra layer of indirection. Each entity was referred to by an ID, which indexed into a lookup table to find the struct’s actual array index. All of this was managed internally by the pool, so the user never had to track it. I didn’t find this problematic at all.

At the beginning of game development, I was iterating through my lists of types quite a lot. Rendering was effectively:

for (hero in heroes) {
    RenderHero(hero)
end
for (rope in ropes) {
    RenderRope(rope)
end
// etc.

That approach was fine before I introduced portals. Now, the same tile might show up in multiple places on the screen. I was no longer able to just loop through the entities to render them; the entities themselves didn’t know where their cells were located in the final view. The iteration had to go the other way:

for (x in view.lo.x to view.hi.x) {
    for (y in view.lo.y to view.hi.y) {
        cell = grid.cells[x][y]
        for (entity in cell) {
            RenderEntityAt(entity, x, y)
        }
    }
}

With spatial iteration taking over, the primary benefit of densely packed lists completely disappeared. Yet, I was still paying the cost of indirection—which, honestly, was more about code line-count overhead than runtime performance—every single time I accessed an entity.

Old Method: Top-Level Entity List

The type-specific lists were not sufficient when it came to entities referring to other entities. For example, a rope might refer to the entity holding it. Initially, maybe I knew that would always be a hero, and I could use the hero’s pool index. However, if I later wanted to additionally allow tying a rope to, say, a piton, I’d have to use some cumbersome union. The same issue applied to held objects. A hero might hold a rope, a bucket, or a pickup. If I used a union to store the reference, I would have to update that union—and all the branching logic tied to it—every time I added a new type of holdable item to the game. I needed a more general entity reference.

The solution I went with was another list, this time of a generic Entity struct that would contain both the entity’s type enum and its type-specific ID. I also stored additional IDs that let me refer to entities generically, across levels (a feature that was never fully fleshed out). The resulting struct was small, and just enough to support the double-lookup:

struct Entity {
    // The type of entity enum
    EntityType type;
    // The identifier for the entity among all entities in a stage
    EntityId id;
    // The entity's index into the object pool for its type
    EntityId type_id;
    // A globally unique ID for the entity for
    // referring to an entity in the game across all levels.
    // All entities defined for a level have a UID.
    // For other entities (like those spawned temporarily in a stage),
    // this is kInvalidEntityUid.
    EntityUid uid;
};

As discussed earlier, this system required that, given an EntityId, I first look up the Entity struct and then use the entity’s type to identify the type-specific pool and access the type-specific data via the type ID. In addition, actions on entities had to be specialized to each entity type, which made the event system complicated. Finally, I wasn’t taking advantage of the tightly packed object lists, so there wasn’t much keeping me on the current system other than momentum.

Large Array of Things

I also learned about the Large Array of Things from the Wookash podcast, in an episode with Anton Mikhailov. This data structure is basically and array of fat structs – rather than having a different collection for every entity type, you define _one_ struct that all entities, or Things, rather, use. The data structure is extremely simple – just an array of Things:

If we want lists, we can hook Things together using intrusive pointers. Any Thing can “point” to another Thing via its integer index. We get a list by maintaining an external index to the head Thing:

Iteration through the list simply starts with the first Thing and goes until the index is invalid:

int index_next = index_first_hero;
while (index_next != kInvalidThingIndex) {
   Thing& thing = things[index_next];
   DoSomething(thing);
   index_next = thing.index_same_type_next;
}

That works just fine, but we do have to (1) make sure to initialize all things with index_same_type_next = kInvalidThingIndex and (2) if we accidentally use thing.index_same_type_next with a bad index, we could go out-of-bounds. Anton recommended making all such reference lists circular:

This means thing.index_same_type_next should always be valid. If a Thing is the only one of its type, it will will point to itself.

Iteration is very similar, and has the advantage that one can start from anywhere in the list:

int index_next = index_first_hero;
do {
   Thing& thing = things[index_next];
   DoSomething(thing);
   index_next = thing.index_same_type_next;
} while (index_next != index_first_hero);

In addition to lists, indexing can also be used to represent trees:

Each Thing has a head index to a list of children, which then reference each other using sibling indices.

It makes less sense here for the head index, index_first_child, to simply point back at the same Thing, since a parent-child relationship is not a circular list. Anton Mikhailov recommended using zero for the invalid index rather than a truly invalid index like -1 in order prevent crashes. Yes, you should never access a Thing at an invalid index, but at least your game doesn’t explode in prod if a rare code path does. Plus, this means zero-initialization defaults those indices to invalid.

As a side effect, the very first Thing is never used. That is a small price to pay, especially if you’re likely going to allocate 10,000.

Lastly, Anton Mikhailov said it is often useful to use doubly-linked lists instead of singly linked lists. This makes insertion and deletion O(1). For trees, it is very useful for a child to know its parent.

So what does a Thing struct look like? Here is my current implementation:

enum class StageThingType : u8 {
    FREE         = 0, // An inactive stage thing
    HERO         = 1, // aka Player Character
    INTERACTABLE = 2,
    ROPE_SEGMENT = 3,
    COUNT        = 4, // Sentinal value
};

// A thing whose state we track in the stage.
// This follows the Large Array of Things approach.
struct StageThing {
    // The type of thing it is.
    StageThingType type;

    // The generational index of the thing in the array of things.
    ThingId id;

    // The index of the next thing of the same type.
    // This is a circular list.
    int index_same_type_next;
    int index_same_type_prev;

    // Which cell in the stage it is in.
    CellIndex cell_index;

    // The index of the next thing in the same cell.
    // This is a circular list.
    int index_same_cell_next;
    int index_same_cell_prev;

    Direction facing_dir;
    // Tree links.
    // Parents and children are always in the same cell index.
    // The sibling list is circular.
    int index_parent;
    int index_first_child;
    int index_next_sibling;
    int index_prev_sibling;

    union {
        StageThingHero hero;
        StageThingInteractable interactable;
        StageThingRopeSegment rope_segment;
    };
};

I went with a union of type-specific structures for now, since it was the most natural conversion of what I was translating from. However, team fat struct may end up convincing me. We’ll see.

We need an additional struct to store the array and the list heads:

struct LargeArrayOfStageThings {
    StageThing things[MAX_NUM_THINGS];
    int count;

    // An array of heads for doubly-linked lists.
    // Index 0 (StageThingType::FREE) is the free list.
    int type_list_heads[(int)StageThingType::COUNT];

    // An array of heads for doubly-linked lists of things in each cell.
    int cell_first_thing[GRID_MAX_X][GRID_MAX_Y];
};

As we can see, the StageThing type zero is for free / inactive Things, letting us use the same list to quickly grab the next available Thing any time we spawn a new one, or stitch it into the list if a Thing is no longer needed. Then there are a bunch of methods that maintain invariants, such as initialization, getting a Thing, attaching a child to a parent, etc.

Indices maintained by the `LargeArrayOfStageThings` itself, such as the linked type list, can be simple integers. Other indices, especially external references, shouldn’t use direct indices since the underlying Thing might be removed in the interim. (This is the same problem as keeping a pointer reference to an Entity.)

The standard mitigation, which I was already employing in my ObjectPools previously, is to use a generational index. Instead of a bare index, it is an index paired with a generation counter:

struct ThingId {
    int index;  // index in the array of things
    u32 gen;    // generation counter
}

Retrieving a Thing via such an ID simply looks at the Thing at the given index, but also checks to make sure the generation counter matches. If it does, we return it. If the Thing has been deleted, or if it has been re-used, the generation counter will have been incremented, and the counter will not match. Since we’re using a u32, there is basically zero risk of wrap around. (Though you can always use u64 if that’s a problem for you.)

Sharing Concepts

Having the shared StageThing struct means we can share concepts across all entity types. The first example is the cell index. Every StageThing can be placed in a cell. This doesn’t mean I have to use it, but so far I am. The direct result is that I can have a single codepath for PlaceInCell, which doesn’t have to have branching logic for heroes vs. ropes vs. enemies, and the LargeArrayOfStageThings can automatically update the internal list of things in each cell.

This drastically simplified my event system. Before this change, the AttachTo event required distinct code paths for every single entity type. Now, attaching is universal for all entities—it is simply a parent/child relationship.

You’ll notice that the StageThingType enum does not contain Pickup or Bucket types. I actually consolidated those into a single type – Interactable. This happened when I started considering how a Pickup could be set down or picked up, but so too could a Bucket. The Bucket simply had additional capabilities, such as allowing other things to be placed in it.

The aha moment here is recognizing that we care about functionalities rather than what a thing is categorically. So, the Interactable has a bitmask that identifies what the thing can do – can it be placed, hung, acquired, used to contain other things, etc? When it comes to the code logic, that is ultimately what it needs to know. Do I or do I not run the logic for this capability?

This is exactly the insight the fat struct folks advocate for. Seeing this simplification, I am seriously tempted to go all-in on the fat struct approach.

The final part of simplification was now recognizing that all entities are contained in at most a single cell. Prior to this, my rope entity spanned multiple cells. I now have a RopeSegment type instead, which is exactly one cell and has links to the RopeSegment it is connected to above or below. This works really well with the centralized cell lists and handles portals seamlessly.

Here we see a rope being lowered through a portal. The sorceress then climbs down it.

Undo and Redo

I want undo and redo to be supported from the get-go, and to “just work”. Every action must be reversible, and after being reversed, we should be able to re-commit it.

In this post, I covered the event system. Any action builds out a schedule of events that then can be played out. I supported undo and redo by collapsing the schedule into just the set of events needed to get the overall state change needed. For example, if a hero moves three times, which are three cell index change events, I could collapse that into just one cell index change event for undo / redo purposes.

Unfortunately, the logic to do this got rather involved. Any new field that I added to my entities might require a new custom event that I would then have to detect and inject. I also had several “larger” events like hoisting a rope up or down that manifested as a multitude of smaller events. Moving a rope up means adjusting all RopeSegments in the chain. I’d then have to iterate over all of those to see if there were diffs, and figure out the right events to add to the undo / redo sets to get the right result. A big headache.

Instead, I shifted from a delta-based approach to a snapshot approach. I implemented a single, special master event that lets me replace an entire StageThing. That’s it. Now, undo / redo entirely consist of these snapshot events, which overwrite the entire struct (plus the pass turn event – that isn’t in the large array of things). Executing an undo or redo simply applies these state-overwrites and then triggers a rebuild of type_list_heads and cell_first_thing in the LargeArrayOfStageThings. Incredibly simple. Since we’re not undoing and redoing all that often, the cost of that rebuild is insignificant.

Conclusion

I’m happy to have this entity refactor behind me, but I also feel like it was a great architectural experience and showed how simplicity can compound. This new foundation feels like a good one to build out from. I plan to start expanding on these core mechanics and perhaps touch up some of those rough first-pass graphics. As always, no promises on specifics! We will let the project dictate how it should evolve next.

Into Depth – Action System

February 2, 2026 by timw

Current title screen using text rendering with the Arial TrueType fonts.

Last month we covered text rendering, which was necessary for getting the scaffolding up that supports the over-arching gameplay loop. We have a title screen, a level select screen, codex screens for seeing information about unlocked heroes and relics, and a level results screen that summarizes what was gained when completing a level. All of these are rudimentary first stabs, but ya got to make it exist first.

A lot happened in the last month, some of which I might cover in future posts. I’m not going to list everything every time, but it is interesting to see just how much stuff goes into making some sort of usable interactive experience when you’re doing a lot from scratch.

Added basic hero avatars to display when the hero is selected.
Added a tweak file system for rapidly tuning parameters.
Expanded the hero struct to include a hero level state to distinguish between in-level heroes and those that have not yet been deployed and those that have exited a level.
Added new screens.
Generalized my UI panel work to have fancier button logic that properly detects button presses, accounting for when the player clicks elsewhere but releases over the button, or presses down on the button but releases elsewhere.
Moved the game interface to simply receive one large memory buffer that the game itself then chops up into whatever smaller allocations and arenas it needs.
Introduced local mesh assets that can be directly rendered via the triangle shader from the previous post, used for basic quads and things like the selection outline.
Introduced the grid view. More on that in a bit.
Added a way to quickly identify which entities are in which cells in the stage, now more necessary due to the grid view.
Introduced schedules and simulating consequences up front rather than live.
Added the action selection panel and action sub-selection interfaces for hero deployment, move selection, and turn passing. The focus of this blog post.
Removed a bunch of earlier code pre-move selection where the active hero would move one tile or perform one action per key press.

The game now looks like this:

Obviously, all art, layout, UI, etc. is an extremely early first cut and likely not the final version. We do, however, see the basic framework for a game.

Action Panel

The main thing I want to talk about this week is the fledgling action system, starting with the action panel:

The action panel shows the active hero’s avatar, their name, their level state, and has a series of action buttons.

You’ll notice that the panel has the same color as the background. Eventually, when the field of view includes shadow casting, it should just blend with the shadows. Here is my Google slides mockup of what I’m roughly working toward:

Eventually we’ll see the whole party, along with whatever status bars we need to see, and the actions available to the hero will look fancier. Most notably, I intend to render little equilateral triangles to represent the action points available for various moves. I’m not 100% settled on how that would work, but something like that will happen.

The game loop inside a level tracks which actor is active, and then loops through four states:

Generate Actions
Select Action
Play Schedule
Done

The first state is where the game looks at the current state and generates the actions available to the actor. For heroes, these then show up as options in the action panel. The user can then select an action and get access to its dedicated UI, and use that to determine the details.

Here we’ve selected the DEPLOY action and we get a user interface for selecting which cell to deploy the hero to.

Once the user commits to an action, the action generates a schedule. This is the sequence of events that represent the outcome of the action. For example, passing the turn simply produces an event that moves the game to the next actor. Selecting a move to a cell produces a more complicated schedule that traverses multiple cells in sequence, and may involve posture changes like changing from standing to climbing.

Most importantly, a schedule contains the complete outcome of the action. Previously, if the actor planned to move, I was having the game check live, as the player moved, for triggered events like ending up over an empty shaft and then moving into a falling state. This fragments the logic and makes it harder to test planning and consequence code, as we don’t really know the consequence of an action until it is tediously simulated out over many iterations.

Instead, the schedule does all necessary consequence simulation during construction, and the game then just needs to play that out until it has completed. Given that we have the schedule, it is also quite easy to undo the entire action without having to figure out some weird reverse simulation.

This doesn’t look terribly complicated to implement, but it is actually the system that flummoxed me the most in my previous project attempt, particularly when it came to changing which actions were available based on the actor’s equipment and in enabling undo. I am much happier with how this latest rendition is set up.

The schedule is, at its core, just a DAG of events stored in a topological order:

struct Schedule {
    // Events are in a topological order
    u16 n_timeline;
    Event* timeline; // allocated on the Active linear allocator

    // The net outcome.
    u16 n_outcome;
    Event* outcomes; // allocated on the Active linear allocator

    // The opposite of the net outcome, since events are not inherently reversible.
    u16 n_undo;
    Event* undos; // allocated on the Active linear allocator
};

The schedule contains the full timeline, plus a compressed net outcome that only contains the events necessary to encode the overall delta. For something like a pass action, the timeline and the outcome are the same. For a move, where the actor traverses multiple cells, the timeline contains the sequence of cell traversals but the outcome only contains one cell change – from source to dest.

The schedule also contains the set of events needed to undo the action. This is very similar to the outcome, just reversed. The game events don’t all contain the information necessary to be reversed, so we construct the undo events separately.

A simple schedule for moving a hero. The timeline contains multiple cell transitions, but the outcome and undo each consist of a single event.

Events are small, composable game state deltas:

struct Event {
    u16 index;  // The event's index in the schedule. (Events are in a topological order)
    u16 beat;   // The beat that this event should be executed on. Events sharing a beat happen concurrently.
    EventType type;

    union {
        EventEndTurn end_turn;
        EventSetHeroLevelState set_hero_level_state;
        EventMove move;
        EventSetCellIndex set_cell_index;
        EventSetFacingDir set_facing_dir;
        EventSetHeroPosture set_hero_posture;
        EventAttachTo attach_to;
        EventOnBelay on_belay;
        EventOffBelay off_belay;
        EventHaulRope haul_rope;
        EventLowerRope lower_rope;
        ... etc.
    };
};

Each event knows its event type and then contains type-specific data. This is a pretty straightforward way to interleave them without annoying object-oriented inheritance code.

Events also contain beats. The game logic is discrete, and in order to have events run concurrently, we store them in the same beat:

Whenever we advance a beat, we apply all events in that beat and trigger any animations or whatnot and use that to determine how long to wait until we start the next beat. This keeps the event system clean (it doesn’t need to know how long a given animation will take, or even what animation is associated with an event), and gives us one centralized place for triggering animations and sounds.

When the schedule is fully played out, the game enters the done state and checks to see if the level is done. If not, it goes back to action generation.

The state for active levels thus includes the core game state (the stage), which actor’s turn it is, the active schedule (if any), the schedule playback state (event index, time in beat), and data for all of this turn’s actions:

struct ScreenState_Active {
    // The active playable area and the entities in it.
    Stage stage;

    // Stores data for the active actor's turn.
    Turn turn;

    // Stores the events that are scheduled to run.
    Schedule schedule;

    // The playback state of the schedule.
    PlaybackState playback_state;

    // The index of the active actor.
    int i_actor = 0;
    int i_actor_next;

    // Where we allocate the action data.
    // This allocator is only reset when actions are regenerated.
    LinearAllocator action_allocator;
};

Pretty clean when it comes down to it.

The Turn contains the actions generated for the current actor. Each action has a name, which then makes it easy to render the buttons for the actions in the action panel.

Actions

An action is a discrete state change available to an actor, such as passing the turn, moving through the environment, or attacking another actor. The available actions depend on the game state — an actor that is not yet deployed has a deploy action but no move action, and an actor with a bow should get an attack action with a ranged UI whereas an actor with a sword can only select the tiles within melee range.

To handle these things flexibly, actions have methods that can be specialized on a per-action basis. In modern C++, one would probably use objects and inheritance to implement a bunch of action subclasses. I am avoiding classes and am not using inheritance at all, so instead we have a basic struct with some function pointers:

// A function pointer called to run the action UI for the action,
// once the action has been selected in the menu.
typedef void (*FuncRunActionUI)(GameState* game, RenderCommandBuffer* command_buffer, void* data, const AppState& app_state);

// A function pointer for a method that determines whether the action was committed.
typedef bool (*FuncIsActionCommitted)(const void* data);

// A function pointer for a method that builds a schedule from the action.
typedef bool (*FuncBuildSchedule)(GameState* game, void* data);

struct Action {
    char name[16]; // null-terminated

    // The key to press to perform this action
    char shortkey;

    // Data associated with the action, specialized per action type
    void* data;

    // Function pointers
    FuncRunActionUI run_action_ui;
    FuncIsActionCommitted is_action_committed;
    FuncBuildSchedule build_schedule;
};

We use the action name when displaying its button in the action panel, and its shortkey is available if the user doesn’t want to have to click the button with the mouse.

Each action also has a void* data member, which can be populated when the action is created and then used in the member functions. We’ll see an example of that shortly.

Generating the available actions is conceptually straightforward; just run a method for every action in the game that checks if that action is available, and if it is, allocates it, constructs it, and adds it to the action list:

void GenerateActions(GameState* game, const Hero& hero) {
    Turn& turn = game->screen_state_active.turn;
    turn.n_actions = 0;
    turn.i_action_selected = -1;

    if (MaybeGenerateAction_Deploy(&turn.actions[turn.n_actions], game, hero)) {
        turn.n_actions++;
    }
    if (MaybeGenerateAction_Move(&turn.actions[turn.n_actions], game, hero)) {
        turn.n_actions++;
    }
    if (MaybeGenerateAction_Pass(&turn.actions[turn.n_actions], game, hero)) {
        turn.n_actions++;
    }
    ...
}

This may seem too simple, but I think it is actually quite an advantage. I had previously been considering having various pieces of equipment be responsible for determining which actions they are associated with, and then having a way to store that metadata on the equipment, save it to disk, etc. Messy! Instead, I can just run all of these methods, every time, and if any require specific equipment, they can check for it and just quickly return false if it isn’t there.

Every action currently requires at least four methods: action generation, running the action-specific UI, a simple method that determines whether the user committed to the action, and a schedule generation. For the simple pass action, this doesn’t even need the void* data member:

// ------------------------------------------------------------------------------------------------
bool MaybeGenerateAction_Pass(Action* action, GameState* game, const Hero& hero) {

    strncpy(action->name, "PASS", sizeof(action->name));
    action->shortkey = 'p';
    action->run_action_ui = RunActionUI_Pass;
    action->is_action_committed = IsActionCommitted_Pass;
    action->build_schedule = BuildSchedule_Pass;

    return true;
}

// ------------------------------------------------------------------------------------------------
void RunActionUI_Pass(GameState* game, RenderCommandBuffer* command_buffer, void* data, const AppState& app_state) {
    // Nothing to do here.
}

// ------------------------------------------------------------------------------------------------
bool IsActionCommitted_Pass(const void* data) {
    return true;
}

// ------------------------------------------------------------------------------------------------
bool BuildSchedule_Pass(GameState* game, void* data) {
    ScreenState_Active& active = game->screen_state_active;

    // Build the schedule, which consists just of an end turn action.
    Schedule* schedule = &(game->screen_state_active.schedule);
    
    schedule->n_timeline = 1;
    schedule->timeline = (Event*)Allocate(&(active.action_allocator), sizeof(Event));
    ASSERT(schedule->timeline != nullptr, "BuildSchedule_Pass: Failed to allocate timeline!");
    
    CreateEventEndTurn(schedule->timeline, /*index=*/0, /*beat=*/0);

    // The outcome is the same.
    schedule->n_outcome = 1;
    schedule->outcomes = schedule->timeline;

    // The undo action: TODO

    return true;
}

The deploy action is more involved, and it does allocate a custom data struct:

struct DeployActionData {
    // List of legal entry cells for the hero.
    u16 n_entries;
    CellIndex entries[STAGE_MAX_NUM_ENTRIES];

    // The index of the entry we are looking at.
    int targeted_entry;

    // Whether the entry has been selected.
    bool entry_selected;
};

// ------------------------------------------------------------------------------------------------
void InitActionData_Deploy(DeployActionData* data) {
    data->n_entries = 0;
    data->targeted_entry = 0;
    data->entry_selected = false;
}

// ------------------------------------------------------------------------------------------------
bool MaybeGenerateAction_Deploy(Action* action, GameState* game, const Hero& hero) {

    if (hero.level_state != HERO_LEVEL_STATE_UNDEPLOYED) {
        // Hero does not need to be deployed
        return false;
    }

    ScreenState_Active& active = game->screen_state_active;
    const Stage& stage = active.stage;
    
    // Allocate the data for the action
    action->data = Allocate(&(active.action_allocator), sizeof(DeployActionData));
    DeployActionData* data = (DeployActionData*)action->data;
    InitActionData_Deploy(data);

    // Run through all stage entries and find the valid places to deploy
    for (u16 i_entry = 0; i_entry < stage.n_entries; i_entry++) {
        CellIndex cell_index = stage.entries[i_entry];
        
        ASSERT(!IsSolid(stage, cell_index), "Entry cell is solid!");

        // Ensure that the cell is not occupied by another hero
        if (IsHeroInCell(stage, cell_index)) {
            continue;
        }

        // Add the entry.
        data->entries[data->n_entries++] = cell_index;
    }

    if (data->n_entries == 0) {
        // No valid entries
        return false;
    }
    
    
    strncpy(action->name, "DEPLOY", sizeof(action->name));
    action->shortkey = 'd';
    // action.sprite_handle_icon = // TODO
    action->run_action_ui = RunActionUI_Deploy;
    action->is_action_committed = IsActionCommitted_Deploy;
    action->build_schedule = BuildSchedule_Deploy;

    return true;
}

// ------------------------------------------------------------------------------------------------
void RunActionUI_Deploy(GameState* game, RenderCommandBuffer* command_buffer, void* data, const AppState& app_state) {
    
    DeployActionData* action_data = (DeployActionData*)data;

    const TweakStore* tweak_store = &game->tweak_store;
    const f32 kSelectItemFlashMult = TWEAK(tweak_store, "select_item_flash_mult", 2.0f);
    const f32 kSelectItemReticuleAmplitude = TWEAK(tweak_store, "select_item_reticule_amplitude", 0.25f);
    const f32 kSelectItemFlashAlphaLo = TWEAK(tweak_store, "select_item_flash_alpha_lo", 0.1f);
    const f32 kSelectItemFlashAlphaHi = TWEAK(tweak_store, "select_item_flash_alpha_hi", 0.9f);
    const f32 kSelectItemArrowOffsetHorz = TWEAK(tweak_store, "select_item_arrow_offset_horz", 1.0f);
    const f32 kSelectItemArrowScaleLo = TWEAK(tweak_store, "select_item_arrow_scale_lo", 1.0f);
    const f32 kSelectItemArrowScaleHi = TWEAK(tweak_store, "select_item_arrow_scale_hi", 1.0f);

    // TODO: This is probably lagged by one frame.
    const glm::mat4 clip_to_world = CalcClipToWorld(command_buffer->render_setup.projection, command_buffer->render_setup.view);
    const glm::vec2 mouse_world = CalcMouseWorldPos(app_state.pos_mouse, 0.0f, clip_to_world, app_state.window_size);

    bool deploy_hero_to_target_cell = false;

    // Set the hero's location to one tile over the entry we are looking at.
    // This is a hacky way to handle the fact that the hero is not in the level yet

    // and that the camera is centered on the hero
    ScreenState_Active& active = game->screen_state_active;
    Hero* hero = active.stage.pool_hero.GetMutableAtIndex(active.i_actor);
    const CellIndex cell_index = action_data->entries[action_data->targeted_entry];
    MoveHeroToCell(&active.stage, hero, {cell_index.x, (u16)(cell_index.y + 1)});
    hero->offset = {0.0f, 0.0f};

    // Render the entrance we are currently looking at.
    {
        const f32 unitsine = UnitSine(kSelectItemFlashMult * game->t);

        RenderCommandLocalMesh& local_mesh = *GetNextLocalMeshRenderCommand(command_buffer);
        local_mesh.local_mesh_handle = game->local_mesh_id_quad;
        local_mesh.screenspace = false;
        local_mesh.model = glm::translate(glm::mat4(1.0f), glm::vec3(0.0f, -1.0f, RENDER_Z_FOREGROUND));
        local_mesh.color = kColorGold;
        local_mesh.color.a = Lerp(kSelectItemFlashAlphaLo, kSelectItemFlashAlphaHi, unitsine);

        {
            RenderCommandLocalMesh& reticule = *GetNextLocalMeshRenderCommand(command_buffer);
            reticule.local_mesh_handle = game->local_mesh_id_corner_brackets;
            reticule.screenspace = false;
            reticule.model = glm::translate(glm::mat4(1.0f), glm::vec3(0.0f, -1.0f, RENDER_Z_FOREGROUND + 0.1f));
            reticule.model = glm::scale(reticule.model, glm::vec3(unitsine * kSelectItemReticuleAmplitude + 1.0f));
            reticule.color = kColorWhite;
        }

        // Run button logic on the targeted entry.
        const Rect panel_ui_area = {
                .lo = glm::vec2(- 0.5f, - 1.5f),
                .hi = glm::vec2(+ 0.5f, - 0.5f)};
        const UiButtonState button_state = UiRunButton(&game->ui, panel_ui_area, mouse_world);

        if (button_state != UiButtonState::NORMAL) {
            local_mesh.color.x = Lerp(local_mesh.color.x, kColorWhite.x, unitsine);
            local_mesh.color.y = Lerp(local_mesh.color.y, kColorWhite.y, unitsine);
            local_mesh.color.z = Lerp(local_mesh.color.z, kColorWhite.z, unitsine);
        }

        if (button_state == UiButtonState::TRIGGERED) {
            deploy_hero_to_target_cell = true;
        }
    }

    bool pressed_left = IsNewlyPressed(app_state.keyboard, 'a');
    bool pressed_right = IsNewlyPressed(app_state.keyboard, 'd');

    // Render two arrow-like triangles that let us switch between options.
    {
        const f32 unitsine = UnitSine(game->t);
        const f32 scale = Lerp(kSelectItemArrowScaleLo, kSelectItemArrowScaleHi, unitsine);

        const glm::vec2 halfdims = glm::vec2(0.433f * scale, 0.5f * scale); // NOTE: Rotated

        { // Left triangle
            const glm::vec2 pos = glm::vec2(-kSelectItemArrowOffsetHorz, -1.0f);
            
            const Rect panel_ui_area = {
                    .lo = pos - halfdims,
                    .hi = pos + halfdims};
            const UiButtonState button_state = UiRunButton(&game->ui, panel_ui_area, mouse_world);

            RenderCommandLocalMesh& local_mesh = *GetNextLocalMeshRenderCommand(command_buffer);
            local_mesh.local_mesh_handle = game->local_mesh_id_triangle;
            local_mesh.screenspace = false;
            local_mesh.model =
            glm::scale(
                glm::rotate(
                    glm::translate(glm::mat4(1.0f), glm::vec3(pos.x, pos.y, RENDER_Z_FOREGROUND)),
                    glm::radians(90.0f),
                    glm::vec3(0.0f, 0.0f, 1.0f)
                ),
                glm::vec3(scale, scale, 1.0f)
            );
            local_mesh.color = kColorGold;

            if (button_state != UiButtonState::NORMAL) {
                local_mesh.color = glm::mix(local_mesh.color, kColorWhite, unitsine);
            }

            // Check for pressing the button.
            if (button_state == UiButtonState::TRIGGERED) {
                pressed_left = true;
            }
        }
        { // Right triangle
            const glm::vec2 pos = glm::vec2(kSelectItemArrowOffsetHorz, -1.0f);
            
            const Rect panel_ui_area = {
                    .lo = pos - halfdims,
                    .hi = pos + halfdims};
            const UiButtonState button_state = UiRunButton(&game->ui, panel_ui_area, mouse_world);

            RenderCommandLocalMesh& local_mesh = *GetNextLocalMeshRenderCommand(command_buffer);
            local_mesh.local_mesh_handle = game->local_mesh_id_triangle;
            local_mesh.screenspace = false;
            local_mesh.model = 
            glm::scale(
                glm::rotate(
                    glm::translate(glm::mat4(1.0f), glm::vec3(pos.x, pos.y, RENDER_Z_FOREGROUND)),
                    glm::radians(-90.0f),
                    glm::vec3(0.0f, 0.0f, 1.0f)
                ),
                glm::vec3(scale, scale, 1.0f)
            );
            local_mesh.color = kColorGold;

            if (button_state != UiButtonState::NORMAL) {
                local_mesh.color = glm::mix(local_mesh.color, kColorWhite, unitsine);
            }

            // Check for pressing the button.
            if (button_state == UiButtonState::TRIGGERED) {
                pressed_right = true;
            }
        }
    }


    // Process the presses, which can come from keys or clicking the arrows.
    if (pressed_left) {
        CircularDecrement(action_data->targeted_entry, (int)action_data->n_entries);
    } else if (pressed_right) {
        CircularIncrement(action_data->targeted_entry, (int)action_data->n_entries);
    }

    if (deploy_hero_to_target_cell) {
        action_data->entry_selected = true;
    }
}

// ------------------------------------------------------------------------------------------------
bool IsActionCommitted_Deploy(const void* data) {
    const DeployActionData* action_data = (DeployActionData*)data;
    return action_data->entry_selected;
}

// ------------------------------------------------------------------------------------------------
bool BuildSchedule_Deploy(GameState* game, void* data) {
    ScreenState_Active& active = game->screen_state_active;

    const Hero* hero = active.stage.pool_hero.GetAtIndex(active.i_actor);
    const CellIndex src = hero->cell_index;
    const CellIndex dst = {src.x, (u16)(src.y - 1)};

    // Build the schedule, which changes the hero's level state and moves them to the entry (1 cell down).
    Schedule* schedule = &(game->screen_state_active.schedule);
    
    {
        schedule->n_timeline = 2;
        schedule->timeline = (Event*)Allocate(&(active.action_allocator), schedule->n_timeline * sizeof(Event));
        ASSERT(schedule->timeline != nullptr, "BuildSchedule_Deploy: Failed to allocate timeline!");
        
        CreateEventSetHeroLevelState(schedule->timeline, /*index=*/0, /*beat=*/0, hero->id, HERO_LEVEL_STATE_IN_LEVEL);
        CreateEventMove(schedule->timeline + 1, /*index=*/1, /*beat=*/0, hero->id, Direction::DOWN, src, dst);
    }

    // The outcome is the same.
    schedule->n_outcome = 1;
    schedule->outcomes = schedule->timeline;

    // Undo: TODO

    return true;
}

Move Actions

The move action is significantly more complicated than the deploy or pass actions.

The move action highlights all valid target cells, allowing the player to select one to move to. Once selected, the shortest path to that cell is taken by the hero, and all consequences are simulated (e.g. falling, triggering a trap) and built into a schedule.

The move action UI, which here has three valid target cells (since falling ends the search). The tile under the mouse cursor is highlighted in yellow, has square angle brackets, and the shortest path is shown as a trail of equilateral triangles.

In order to support these features, the move action logic needs to know what the reachable cells are and what the shortest paths to them are. This is achieved by running Dijkstra’s algorithm from the actor’s initial state. The state space is not merely cell positions, but also includes the actor’s facing direction and their posture (standing, on ladder, etc.).

There is one additional point of complication, and that is that we want this game to only reason about cells that are visible from the hero’s current vantage point, and we eventually want to support non-Euclidean connections like portals:

In this mockup, the key is visible twice because the hallway has a portal loop-back connection.

In order to achieve this, we introduce a new representation of the level geometry visible to an actor, the GridView:

struct GridView {
    // The grid tile the view is centered on
    CellIndex center;

    CellIndex cell_indices[GRID_VIEW_MAX_X][GRID_VIEW_MAX_Y];
    u8 flags[GRID_VIEW_MAX_X][GRID_VIEW_MAX_Y];
};

This view has a finite size, much smaller than the overall level grid, that is big enough to fit the screen. The view is always centered on an actor, and the cells in the view are then indices into the cells in the underlying level grid. If a level grid cell is connected by a non-Euclidean portal to another cell, the view can just index into the correct cells on either side of the portal. Constructing the grid view is a straightforward breadth-first search from the center tile.

Note: I am actually temporarily making this simpler than it really is. To do this properly, we’ll need a fancier data structure that is not a grid but can handle sectors, because one view cell may actually contain view sectors of multiple level cells:

This view cell is visible twice, once with a cell containing a key, with this portal (magenta) set up.

We thus run Dijkstra’s algorithm in this grid view, starting from the center cell where the actor is. The same cell may be visible multiple times in the grid view, and we will correctly be able to route to that cell via multiple paths.

The search assigns a cost for each state change, and only searches up to a maximum cost. Very soon, actors will have action points to spend per turn, and it won’t be possible to move further than are affordable given the action points currently available.

The move action custom data struct is:

struct MoveActionData {
    // We can only ever move to a visible tile, so for now, we can
    // just allocate a grid's worth of potential targets.

    // The cheapest cost to reach each move state.
    u16 costs[GRID_VIEW_MAX_X][GRID_VIEW_MAX_Y][/*num directions=*/2][kNumHeroPostures];

    // The parent state on the shortest path that arrives at the given state.
    // I.E., if [1][2][LEFT][STANDING] contains {2,2,LEFT,STANDING}, then we took a step over.
    // The root state points to itself.
    MoveState parents[GRID_VIEW_MAX_X][GRID_VIEW_MAX_Y][/*num directions=*/2][kNumHeroPostures];

    // Used to track whether a view cell has been visited.
    u32 visit_frame;
    u32 visits[GRID_VIEW_MAX_X][GRID_VIEW_MAX_Y];
    
    ViewIndex view_index_target;  // Target view cell index for the move.
    bool entry_selected;
};

Having completed the search, we are able to render all reachable tiles, render a reticle over the cell the user’s mouse is over, and if the cell is reachable, we can backtrack over the cell’s parents to render the cells traversed to get there. (Since states include more than just cell changes, and we don’t want to render multiple times to the same cell, we also store a u32 visit frame that we can use to mark cells we have rendered to in order to avoid rendering to the same cell multiple times.)

Finally, when the user clicks on a cell to commit the action, we compute the schedule by:

Extracting the shortest path by traversing back to the source node.
Writing the shortest path out one state change at a time.
Simulating consequences (like falling) after every step, and if any consequences do take place, ending the planned schedule there and appending all consequence events.

This process makes a copy of the current state and applies all changes there. Making a copy, while taking memory, has the advantage of not polluting the actual game state and giving us a second Stage to compare the current Stage to in order to get the overall schedule delta.

Conclusion

With the action system in place, the foundation is laid for actual gameplay. Next time, we’ll look at introducing some core entities back in (ropes, buckets, relics) and managing the overall gameplay cycle.

A TrueType Font

November 30, 2025 by timw

TrUeTyPe FoNt

Early in 2025, I had stumbled on concept images of a font with triangular glyphs. Something about it was appealing. I ended up crafting a similar concept, but extending it so that the equilateral triangles could tile together using both upward and downward facing triangles:

The upward and downward facing glyphs for the triangle font, from A to Z.

I was implementing OpenGL font rendering in C++, using Sean Barrett’s TrueType library to load fonts and construct a font atlas. Somewhere in this whole process I decided it would be fun to implement a font I had conceptualized months ago for real, so to speak, as a TrueType font.

TrueType doesn’t support conditional logic for alternating glyph directions like this, so in practice I achieve the effect by making all uppercase characters be upward facing triangles and all lowercase characters be downward facing triangles. The text below is “HeLlO wOrLd”:

HeLlO wOrLd

Having my own game that could load and use the font was helpful for debugging, once the binary was loadable, since I could step through it with a debugger. I also found fontdrop and the hex editor useful.

You can download the font here.

Architecture

There are two fundamental responsibilites: defining the font in memory and serializing it to .ttf:

FontDefinition font = CreateFont();
bool success = ExportFont(&font);

Separating font definition from binary export keeps the system reusable and avoids hard-coded constants leaking into the writer.

CreateFont populates a builder struct:

FontDefinition font;
assert(InitFontDefinition(&font,
    /*units_per_em=*/1000, /*ascent=*/866,
    /*descent=*/0, /*line_gap=*/0,
    /*family_name=*/"TeSsElLaTe",
    /*subfamily_name=*/"Regular",
    /*unique_name=*/"TeSsElLaTe rEgUlAr v1.0",
    /*full_name=*/"TeSsElLaTe rEgUlAr")
  && "Failed to initialize font");

// Create the missing glyph (.notdef) - a simple square
// This glyph is shown when a character is not found in the font
const GlyphId missing_glyph = StartGlyph(&font, /*codepoint=*/0, advance_width, left_side_bearing);
assert(missing_glyph != 0xFFFF && "Failed to start missing glyph");

// Create a square contour with 4 points
assert(StartContour(&font) && "Failed to start contour for missing glyph");
assert(AddPoint(&font, 0, 0, 1) && "Failed to add point 0");
assert(AddPoint(&font, 1000, 0, 1) && "Failed to add point 1");
assert(AddPoint(&font, 1000, 1000, 1) && "Failed to add point 2");
assert(AddPoint(&font, 0, 1000, 1) && "Failed to add point 3");
assert(EndContour(&font) && "Failed to end contour for missing glyph");
assert(EndGlyph(&font) && "Failed to end missing glyph");

The font definition is kept in a truetype.hpp header, along with builder helpers like StartContour and AddPoint. After defining all of the glyphs, we can also add kerning pairs. The default xadvance is set for subsequent equilateral triangles, and I reduce that for pairs that nest together.

ExportFont needs to open a file and write the binary .ttf file. The format consists of a directory listing the tables and their offsets, followed by the tables themselves, padded to 4-bytes. Each table has a checksum, and the data is exported as big-endian (not the default when using fwrite).

I wanted to calculate the table checksums as I wrote to disk to avoid multiple passes over the data. This requirement was realized via a TableWriter struct along with some basic helper methods like WriteU16BE(&writer, value), which exports the value in big-endian and updates the checksum.

struct TableWriter {
    FILE* file;
    u32 checksum;

    u8 word_buffer[4];
    u32 word_buffer_index; // 0-4
};

All WriteXXBE helper methods call down to a void FeedBytesToChecksum(TableWriter* writer, const u8* data, u32 size); method, which appropriately updates the checksum and writes the data to the file.

After writing placeholder values for the directory, the exporter writes all of the tables, in alphabetical order. Each table is written as follows:

// Write cmap table
writer.checksum = 0; // Reset.
TableInfo* table_info = &table_infos[table_index];
table_info->offset = (u32)ftell(writer.file);
table_info->length = WriteCmapTable(&writer, font);
table_info->checksum = writer.checksum;
PadTo4ByteAlignment(&writer);
printf("Wrote %s table (%u bytes, offset %u, checksum 0x%08X)\n",
    table_tags[table_index], table_info->length, table_info->offset, table_info->checksum);
table_index++;

Offset values are captured directly from the file stream as each table is written to disk, enabling a single streaming pass without rewinding or recomputing. Each table has its own WriteXXXXTable method, all of which were kept in a truetype_export.hpp header. This made it easy to iterate on the code.

We then go back and update the table directory. Easy peasy.

Finishing a Vertical Slice

I didn’t know whether the font exporter was working until I was able to load the resulting .ttf file in another program. A vertical slice lets you de-risk binary exporters early. I prioritized this by not immediately defining all glyphs and instead just defined the undefined glyph (.notdef), space, and ‘A’. I implemented the minimum necessary set of TrueType tables that could be loaded by tools like stb_truetype and the browser inspector, helping me confirm correctness before scaling out to A–Z and a-z.

Having a testable vertical slice is essential when coding on any project, whether solo or on a team. Coding without knowing whether your code is working is the same as flying blind. A hex editor provides directional confidence that the structure is forming correctly, but real validation only comes when another program successfully loads the font.

Bugs

Sometimes it is interesting to look at what sorts of issues you run into. Understanding how a process fails tells you where to focus on improvements. The bugs that I spent time investigating were:

Glyph alignment (segfault). Each glyph must be word-aligned. IMO, this is not at all obvious from the glyf table documentation.
Malformed contours as a result of exporting glyph vertices rather than relative vertices. This was clearly laid out in the documentation.

Image from developer.apple.com.

That was enough to get it working with stb_truetype and my font rendering. In order to get the chrome console to be happy, I additionally had to:

Fix the cmap length, which was miscalculated. Code like this is not ideal:

// Calculate subtable length
// Format 4 header: 14 bytes (format, length, language, segCountX2, searchRange, entrySelector, rangeShift)
// Data: reservedPad (2) + 4 arrays of seg_count u16 values (seg_count * 8)
const u16 subtable_length = 14 + 2 + (seg_count * 8);

Needing to additionally export the OS/2 table, which includes metrics needed by Windows. The chrome inspector straight up told me to add this.
Table alignment (browser load failure). All tables must be 4-byte aligned. This was in the top-level docs:

Image from apple.developer.com.

Lastly, kerning was working in my font rendering but not for all characters in chrome. It turned out that almost all of my glyphs start at (0,0), but a few, like ‘d’, did not. Chrome was rendering these further to the left. I had to set the left side bearing for those glyphs.

So two counts of failing to use alignment. Seems like a trend I can be more aware of. Despite chasing binary alignment and metrics, I ran into no memory safety issues like null pointers or leaks — problems typically cited as risks of low-level code.

Initializer Lists

I am trying to write in a Muratori-inspired minimal C++ style. That is, C++ without a lot of the C++ features. Avoid classes, macros, templates, and the standard libraries. Why? Ostensibly because simpler code is sufficient, and then easier to understand and faster to compile / execute. Though doing it because I think it is interesting and I admire people like Casey and attitudes like this one from Chris Wellons is also a perfectly valid reason.

I was able to author everything to adhere to this style, but found that I really did want to use initializer lists to simplify glyph creation:

assert(AddGlyph(&font, 'A', advance_width, left_side_bearing,
        {{A, D, N, O, E, B, C}, {S, L, R}}) && "Failed to add A glyph");

I totally could do that without it:

const GlyphId glyph = StartGlyph(&font, 'A', advance_width, left_side_bearing);
assert(glyph != 0xFFFF && "Failed to start glyph");

assert(StartContour(&font) && "Failed to start contour");
assert(AddPoint(&font, font.units_per_em*A.x, font.units_per_em*A.y, 1) && "Failed to add point");
assert(AddPoint(&font, font.units_per_em*D.x, font.units_per_em*D.y, 1) && "Failed to add point");
assert(AddPoint(&font, font.units_per_em*N.x, font.units_per_em*N.y, 1) && "Failed to add point");
assert(AddPoint(&font, font.units_per_em*O.x, font.units_per_em*O.y, 1) && "Failed to add point");
assert(AddPoint(&font, font.units_per_em*E.x, font.units_per_em*E.y, 1) && "Failed to add point");
assert(AddPoint(&font, font.units_per_em*B.x, font.units_per_em*B.y, 1) && "Failed to add point");
assert(AddPoint(&font, font.units_per_em*C.x, font.units_per_em*C.y, 1) && "Failed to add point");
assert(EndContour(&font) && "Failed to end contour");

assert(StartContour(&font) && "Failed to start contour");
assert(AddPoint(&font, font.units_per_em*S.x, font.units_per_em*S.y, 1) && "Failed to add point");
assert(AddPoint(&font, font.units_per_em*L.x, font.units_per_em*L.y, 1) && "Failed to add point");
assert(AddPoint(&font, font.units_per_em*R.x, font.units_per_em*R.y, 1) && "Failed to add point");
assert(EndContour(&font) && "Failed to end contour");

assert(EndGlyph(&font) && "Failed to end glyph");

You can see why I wanted to save myself the typing.

Unfortunately, in addition to including <initializer_list>, I also ended up including <vector> because I was calling AddGlyph with a transform for all downward facing glyphs:

assert(AddGlyph(&font, 'f', advance_width, left_side_bearing,
        ReflectToDownwardGlyph({{A,B,F,J,N,U,V,S,G,C}}, t))
        && "Failed to add f glyph");

Unfortunately, I couldn’t have ReflectToDownwardGlyph modify an initializer list and produce a new one. Instead, I had to return an std::vector. Oh well.

There probably are reasonable ways to do this in an minimal style. If you happen to know, please send me a message!

Conclusion

It was fun to work on a self-contained, somewhat artistic project. I got a chance to try both the Cursor and Antigravity agentic IDEs, both of which worked quite well.

I’m not sure that I’ll be able to author posts monthly, but hopefully this is a good start to returning to my creative outlet. Happy Holidays!

Hiatus

April 4, 2025 by timw

After several years of monthly posts, I am stepping back from the blog for a bit. We’re having a baby!

In the mean time, please stay creative, engaged, and curious.

Grandmother Cells and Black Swans

March 6, 2025 by timw

When I was first learning about deep learning, the teacher brought up an issue with image classifiers and black swans. I would call this the black swan problem, but it turns out that has a related but different meaning, so let’s go with swan classifier generalization problem. It goes like this:

If you train a classifier to identify objects in images, and one of those categories is swans, that classifier will tend to be training exclusively or predominantly on white swans. At deployment it is likely to misclassify images of Australian black swans.

This isn’t all that surprising. Generalization in deep learning is hard.

What is surprising is:

Most humans familiar with swans would be able to recognize an Australian black swan as a swan.

In other words, humans are good at generalization and are able to do it better than many of our traditional deep learning tools, at least in comparison to ImageNet and other, older, deep image classifiers.

This post is an exploration into deep learning, classification, and generalization, and about why I think feature embeddings go a long way to building a better knowledge representation. None of this is particularly new or insightful – I just find it useful to work it out and tie it all together.

Grandmother Cells

A deep image classifier takes as input an image and produces as output a softmax distribution over a discrete set of categories:

Conventional knowledge is that deeper levels reason about more sophisticated, higher-level features. Very early convolutional layers typically only have access to small neighborhoods of pixels, so learn local features like edge detections and textures. Deeper layers might piece together larger structures like a bird beak or ripples in a lake. Finally, all of this information is brought together to produce the final classification.

Under such an approach, you literally have one, single neuron at the end responsible for firing when it thinks the image has a swan. The more intensely its output value, the more confident the swan prediction.

There have historically been two opposing views on the relationship between brains and behavior — the localist view that specific brain regions are responsible for specific behaviors, and the holistic view that neural activity is spread out throughout the nervous system. A critic of the localist view might think it absurd that there is a single neuron somewhere in your head that fires over the concept of “Grandmother”.

A grandmother cell is just that — a neuron exclusively dedicated to one high-level but specific concept. (Funny enough, there was a lot of hubbub in 2005 about recordings that suggested that a single neuron had been found that triggers only for Jennifer Aniston.)

By and large, researchers do not believe that the best way to represent knowledge is through a 1:1 representation such as grandmother cells. As such, it may not come as a surprise when a traditional convolutional deep image classifier like ImageNet struggles to classify black swans.

Sparse Encodings

A traditional image classifier is structured to prefer sparse outputs. A confident prediction should produce a high value in the appropriate category and very low values elsewhere:

If the model is uncertain, it is forced to make the appropriate trade-off to assigning some probability mass to the other potential categories. That might mean assigning some likelihood to other black birds:

This sort of representation might be convenient for the output of a classifier, but it isn’t all that useful for reasoning. If I am trying to think about what it means for an object to be a swan, I don’t want to simply know it is a swan, I want to think about where I might find it (e.g. on a lake), what it might do (e.g. honk at me), and sure, what it looks like (e.g. tends to be white-feathered).

A reasoning network that receives the fact that there is a swan would thus have to unpack this discrete bit of knowledge into these myriad facts:

Worse yet, if you have an uncertain input, you have to unpack all contributors and figure out how to combine them:

Working directly with the discrete bit of knowledge is fragile. If \(P(\text{swan})\) is low, then the network just can’t associate the object to a swan. However, if we’re working with the distributed properties and associations of a swan, its a whole lot easier to get to swan if one property (color), is unusual.

The third thing going on here is that sparse representations don’t use the state space as efficiently. If my reasoning network receives a \(128-\) dimensional vector, and we’re working with one-hot encodings where everything but one dimension is zero, then we can only represent 128 different concepts. In contrast, if we’re willing to use the whole state space, we can represent more or less any number of concepts.

A 2D embedding for the MNIST digits. The left-side shows how digits are mapped to the embedding, and the right shows how samples from the space produce digit images. Images from Algorithms for Decision Making.

Discrete representations are thus hard to work with, fragile, and wasteful.

Transformer Embeddings

You might think that the transformer model suffers from this same problem of discrete reasoning, as they operate on sequences of one-hot tokens. However, these discrete tokens are immediately mapped to a rich embedding vector:

The encoder literally has a separate high-dimensional embedding vector for each discrete token. If we have a vocabulary of \(m\) unique tokens and an \(n\)-dimensional embedding space, then our embeddings are given by an \(n \times m\) matrix. Multiplying by the one-hot token extracts the embedding vector:

\[\boldsymbol{e}^{(i)} = \begin{bmatrix}\boldsymbol{e}^{(1)}, \boldsymbol{e}^{(2)}, \cdots \boldsymbol{e}^{(i)}, \cdots, \boldsymbol{e}^{(m)} \end{bmatrix} \begin{bmatrix}0 \\ 0 \\ \vdots \\ 1 \\ \vdots \\ 0\end{bmatrix}\]

Transformers learn what values to assign to these embedding vectors. As such, they can pack a lot of meaning into those \(n\) dimensions, far more than would be used if it was stuck operating on \(m\) discrete categories.

I talk about transformers in Transformers, How and Why They Work, but gloss over what these embedding vectors really give us. Transformer layers are best thought of as taking the input embedding vectors, which each point in some direction, and incrementally rotating them to point in other directions. (There’s a great video of this by 3Blue1Brown.)

Keep in mind that these are actually very high-dimensional feature spaces.

The initial embedding value is the one the raw token is associated with. The final embedding value is the output of all of the transformer layers, right before a final set of affine layers to go from an \(n\)-dimensional embedding to an \(m\)-dimensional set of logits for the next token.

That means:

the final embedding should have of the information for predicting the next token
the initial embedding value have the superposition of all meanings the general token

Those are two very different things – hence the need for all those transformer layers and incremental updates.

A final embedding that predicts the next word in “the cat sat on the” needs to capture all the things that a cat might sit on (e.g. laps, mats), as well as adjectives of places cats might sit (e.g. warm laps), and who knows what else people append to that sentence. There is no way that we could capture that superposition of meaning with only \(m\) discrete options.

Interestingly, an initial embedding value also needs to represent a superposition of concepts. The embedding for “mat” for example, might mean a nice place for a cat to sit in one context, but could also be a large concrete slab, or a thick wad of hair.

That means the initial embedding for “mat” should lie in a similar direction as those other concepts. At the very least, it would likely have subcomponents of its \(n\)-dimensional feature space that lie in similar directions. A part of “mat” and “slab” will align for the meaning they share, and likely a different part of “mat” and “matted hair” would align for the meaning those two words share.

This all shows how Transformers are able to assign juxtapositions of meanings using embedding vectors. It can do this in part because the continuous feature space allows for cramming a lot of meanings into the same number of dimensions, allows for smoothly interpolating between meanings, and because a single direction can be made up for sub-directions that have their own meanings. That is exactly what we’re looking for when we want to predict a black swan when we know we need a word for a feathered creature with a big beak on a lake that honks and happens to be colored black.

Why do Transformers learn all this? Because they have to in order to predict the next token accurately. Language is incredibly rich and carries all of these layered meanings.

More Holistic than Local?

One of the big takeaways for me is that knowledge representation using transformer-like dense embeddings is able to pack in and superimpose many concepts, and ends up being less fragile as a result. If enough of the concepts point to a swan, we can still deduce that we need a swan.

I am not a neuroscientist, but I find it highly likely that true grandmother neurons are quite rare. Instead, meaning is more likely to be found packed into subspaces represented by groups of neural firing patterns.

Similarly, I am reminded of writing software for robotics applications. If you’re writing code yourself, then you’re likely basing your reasoning on a comparatively small set of concepts. Take a sidewalk delivery robot for example. A reasonable programmer might try to enforce that the sidewalk delivery robot never cross a light on red. However, said reasonable programmer might get sad when they run into a case where an intersection is under construction and the light is red, but the path is open to pedestrians and delivery robots:

The world is a complicated place, and we quickly find that the number of cases our coded heuristics can handle is quite small. We can easily set ourselves up for failure if we code like grandmother cells. At the very least, code that makes declarative statements needs to be very careful to make declarative statements about what it really is judging — whether the contextual scene is appropriate for crossing or not rather than whether the light is red. As we’ve already learned – getting hung up on color is what motivated this blog post in the first place.

Can we rewrite robotics logic to use embedding vectors? Perhaps. Unfortunately, transformer embeddings are fairly inscrutable to anyone other than the transformer. Interpretable machine learning is still a nascent field.

I think transformers are incredible, but I also think some fundamental properties are missing. They don’t really understand the world yet, not really. AI fails in embarrassing ways. We as humans can write code to reason about \(m\) discrete things. We have a much harder time reasoning about all of the overlapping subtleties of the real world.

Conclusion

This post doesn’t really have a decisive answer. In fact, I hope it serves as food for thought.

I think it is worth pondering these questions as we move forward in this bold new world of Software 2.0, where everything is a transformer and we can do more than we ever could before but don’t really understand how it works or just how far it can take us.

I think this is an incredibly exciting time. It seems like we are very close to cracking the nut of “how we think”. Heck, even John Karmack started working on AI because he feels similarly.

There is likely still something to be learned from how the human brain works. It is the reference model that we have, the irrefutable evidence that there are systems that can reliably learn and then reliably perform well on real-world tasks. Intelligences that know that crossing at red when the road is blocked to cars. Intelligences that more-or-less do the right thing, with very few training examples. Those intelligences may not have figured it out just yet, but perhaps by searching more within, they can finally get to the bottom of things.

Coding your own Tools

February 1, 2025 by timw

We’re working on finalizing the 2nd Edition of Algorithms for Optimization. We originally set out to add three new chapters, but in addition to that overhauled most of the book. Its pretty exciting!

These projects are so big, and so long, that you’ll inevitably run into all sorts of new challenges because time has passed, some dependencies are no longer supported, you try to do things better / a different way, or the MIT Press requirements have changed. That last one is the inspiration for this blog post.

I ended up writing some new tooling to support alttext. Short for alternative text for images, alttext is textual content that can be associated with an image and read out to someone using a screen reader. It wasn’t part of the submission materials when we wrote Algorithms for Optimization v1 and Algorithms for Decision Making, but this time around, MIT Press asked that we supply alttext for every figure in a big spreadsheet. New challenge!

Mykel and I are somewhat different when it comes to being textbook authors. Most authors submit large Word documents with disparate images and let the MIT Press team handle the final text layout. Not us. We provide the final printable PDF.

Our setup is quite nice. It is all under source control, we have a ton of control over how everything looks, and we have everything for the book in one place.

When I saw the ask to supply this additional spreadsheet, I instantly became worried that having a separate sheet could cause problems. That sheet needs to be kept in-sync with the textbook — if any figures are added or removed, we want to make sure they are also added or removed from the sheet. The sheet is also a somewhat inconvenient place to write the alttext. Ideally it would be defined in the LaTeX documents, alongside the figure that it describes. Most importantly, we need to know if we’re missing any alttext.

Storing Tests by Code

We already have some nice technology in our textbook-writing workflow that lets us use the algorithms that we present to the reader to both generate figures and author + execute unit tests.

We present our algorithms using Pythontex and algorithm environments:

\begin{algorithm}
  \begin{juliaverbatim}
    diff_forward(f, x; h=1e-9) = (f(x+h) - f(x))/h
    diff_central(f, x; h=1e-9) = (f(x+h/2) - f(x-h/2))/h
    diff_backward(f, x; h=1e-9) = (f(x) - f(x-h))/h
  \end{juliaverbatim}
  \caption{...}
\end{algorithm}

The juliaverbatim blocks get typeset, but they aren’t executed.

We have a script that parses our source files for algorithm blocks and exports the juliaverbatim contents into a big Julia source file belonging to an Alg4Opt.jl Julia package. We can then load this package when executing Pythontex blocks that do execute, for generating our figures.

We have had unit testing since the beginning. When we first wrote Algorithms for Optimization, we had the unit tests in a separate directory, written in the test files for the Alg4Opt.jl Julia package we exported to. That worked, but the tests were written in an entirely different place than the methods. Sound like storing alttext somewhere other than the figures?

We ended up defining a no-op LaTeX environment:

\excludecomment{juliatest}

and then add those after every algorithm block:

\begin{juliatest}
let
  for (f,x,∂) in [(x->x,    0.0, 1.0),
                  (x->x,    1.0, 1.0),
                  (x->x,    1.0, 1.0),
                  (x->x^2,  0.0, 0.0),
                  (x->x^2,  1.0, 2.0),
                  (x->x^2, -1.0,-2.0)]
    @test isapprox(diff_forward(f, x), ∂, atol=1e-6)
    @test isapprox(diff_central(f, x), ∂, atol=1e-6)
    @test isapprox(diff_backward(f, x), ∂, atol=1e-6)
  end
end
\end{juliatest}

We then parse the LaTeX source files in the same way we do for the algorithm blocks, and export the contents of any juliatest block as unit tests.

Storing the tests next to the algorithms makes things a lot nicer.

Pulling Alttext

I decided that I could do something very similar for alttext.

I defined a dummy command that like juliatest does nothing when compiling the book, but lets us put the alttext content into it:

\newcommand{\alttext}[1]{}

We can then use it in the source code to define the alttext alongside the figure:

\caption{
  A one-dimensional optimization problem.
  Note that the minimum is merely the best in the feasible set---lower points may exist outside the feasible region.
  \label{fig:one-d-opt-prob}
  \alttext{A line chart with a single undulating curve and an interval 
           containing a local minimum identified as the feasible set.}
}

I then wrote a script that runs through our source files and finds all figure and marginfigure blocks, and searches for such a command. If it finds it — great, we can pull out the alttext content and export it to that spreadsheet we need. If not, we can print out a warning that that figure (whose \label ID we also extract), is missing alttext. A nice, simple scripted solution.

Expand this to view the full script.

using Printf

mutable struct FigureEntry
    file_index::Int    # Index into chapter files
    line_index_lo::Int # Index into the chapter's lines at which the \begin resides
    line_index_hi::Int # Index into the chapter's lines at which the \end resides
    label::String      # Figure label, as defined by \label command (or empty)
                       # Figures may not have labels if they are in solutions or examples.
    alttext::String    # Alt text, as given by an \alttext command (or empty)
                       # Every figure is expected to have alttext for the final deliverable.
end

function is_start_of_block(str, block)
    return startswith(str, "\\begin{$block}")
end

function is_end_of_block(str, block)
    return startswith(str, "\\end{$block}")
end

function get_files(; chapter_regex::Regex = r"include\{chapter")
    retval = String[]
    for line in readlines("optimization-chapter.tex")
        if occursin(chapter_regex, line)
            m = match(r"chapter/\S*(?=\})", line)
            @assert isa(m, RegexMatch)
            push!(retval, m.match*".tex")
        end
    end
    return retval
end

function find_matching_paren(str::String, starting_index::Int=something(findfirst(isequal('('), str), 0))
    @assert str[starting_index] == '('
    nopen = 1
    i = starting_index
    n = lastindex(str)
    while nopen > 0 && i < n
        i = nextind(str,i)
        nopen += str[i] == '('
        nopen -= str[i] == ')'
    end
    return nopen == 0 ? i : -1
end

"""
Find the text for a label, such as "fig:gradient_descent_rosenbrock" from
\\label{fig:gradient_descent_rosenbrock}

There should only ever be one \\label entry. In the event that there are multiple,
this methods returns the first one.
If no label is found, this method returns an empty string.
"""
function find_label(lines, line_index_lo::Int, line_index_hi::Int)::String
    for line in lines[line_index_lo:line_index_hi]
        m = match(r"\\label\{([a-zA-Z0-9_:\\-]+)\}", line)
        if isa(m, RegexMatch)
            return m[1]
        end
    end
    return ""
end

"""
Find the alttext for a figure, which is contained inside an \\alttext{} command.
There should only ever be one \\alttext entry per figure. In the event that there are multiple,
this methods returns the first one.
If no alttext is found, this method returns an empty string.
"""
function find_alttext(lines, line_index_lo::Int, line_index_hi::Int)::String
    for line in lines[line_index_lo:line_index_hi]
        m = match(r"\\alttext\{([^}]+)\}", line)
        if isa(m, RegexMatch)
            return m[1]
        end
    end
    return ""
end

function pull_figures()
    is_start_of_ignore = str -> is_start_of_block(str, "ignore")
    is_start_of_figure = str -> is_start_of_block(str, "figure")
    is_start_of_marginfigure = str -> is_start_of_block(str, "marginfigure")
    is_start_of_relevant_block = str -> is_start_of_figure(str) || is_start_of_marginfigure(str) || is_start_of_ignore(str)

    figures = FigureEntry[]
    for (file_index, filepath) in enumerate(get_files())
        filename = splitext(splitdir(filepath)[2])[1]

        println("\treading ", filename)
        lines = [replace(line, "\n"=>"") for line in open(readlines, filepath, "r")]

        counter = 0
        i = something(findfirst(is_start_of_relevant_block, lines), 0)
        while i != 0
            block = is_start_of_ignore(lines[i]) ? "ignore" :
                    is_start_of_figure(lines[i]) ? "figure" :
                                                   "marginfigure"
            j = findnext(str -> is_end_of_block(str, block), lines, i+1)

            if block != "ignore"
                label = find_label(lines, i, j)
                alttext = find_alttext(lines, i, j)
                push!(figures, FigureEntry(file_index, i, j, label, alttext))
            end

            i = something(findnext(is_start_of_relevant_block, lines, j), 0)
        end
    end
    return figures
end

# Find all figure and marginfigure blocks
println("Pulling all figures")
figures = pull_figures()
for (i_figure, figure) in enumerate(figures)
    @printf "%3d %2d [%04d:%04d] %s\n" i_figure figure.file_index figure.line_index_lo figure.line_index_hi figure.label
    println("      $(figure.alttext)")
end

n_figures_missing_alttext = sum(fig.alttext == "" for fig in figures)
if n_figures_missing_alttext > 0
    println("MISSING ALT TEXT!")
    files = get_files()
    for (i_figure, figure) in enumerate(figures)
        label_text = figure.label
        if label_text == ""
            label_text = "UNLABELED"
        end
        @printf "%2d %s in %s\n" i_figure figure.label files[figure.file_index]
    end
end

println("")
println("$(length(figures) - n_figures_missing_alttext) / $(length(figures)) figures have labels")
println("Good job!")

The Joy of Coding your own Tools

That’s what this blog post is really about. The fact that you can dig in and code your own solution. We spend so much time coding for big company projects, that it is easy to forget that we can code small, useful things for ourselves.

The coding we did here is not particularly clever, nor particularly difficult, nor particularly large. That isn’t the point. The point is that we had a problem, and we were able to solve it ourselves with software. Our tools of the trade were brought to bear on our own problem.

I don’t often use coding to solve my own problems, but it does happen every so often. I used coding to create placecards for my wedding, for example, and to create the wedding website. I’ve written code to generate .svg files for CNC laser cutters, in order to craft a loved one a nice birthday present. In high school, I wrote a basic notecard program for practicing my French vocab. That one was super useful.

I am a big fan of Casey Muratori of Handmade Hero (and Computer, Enhance!), which gave rise to the handmade movement. The ideas there are very similar — there is joy to be had from building things yourself, and you are smart enough to dive into something and learn how it works.

Anyhow, I think its nice to be reminded of all this from time to time. Happy coding!

Emscripten is Neat

January 1, 2025 by timw

I spent most of December on vacation, so no big post. However, I did learn about Emscripten, a C++ compiler that produces an executable in WebAssembly. I tried it out and have to say, it was remarkably easy to get something up and running.

I made a very simple crossword game. You you can try it out here.

Best on desktop. While it does load in the browser, it is looking for key events, which is really cumbersome on mobile.

This was written with SDL2 and Dear ImGUI.

Compilation is pretty straightforwad. Instead of running gcc <srcs> <libs> you run emcc <srcs> <libs>. The compiler does some magic to reinterpret use of OpenGL with WebGL, and spits out a .wasm binary blob, a .js script that can execute it, and a .html page that runs it all. My game link above is simply to that .html page.

To test it locally without uploading it to a web page, you can simply start a Python server in your dev directory:

python -m http.server

and then open your .html file in the browser:

http://localhost:8000/your_thing.html

I spend a lot of time tinkering with games and other programs in C++, but I’m developing on a Linux machine. As a result, I can’t really share my games with my friends. Emscripten might be a nice way to achieve that.

Happy 2025 everyone. Hope its a good one.

Rollouts on the GPU

December 1, 2024 by timw

Last month I wrote about moving the Sokoban policy training code from CPU to GPU, yielding massive speedups. That significantly shortened both training time and the time it takes to compute basic validation metrics. It has not, unfortunately, significantly changed how long it takes to run rollouts, and relatedly, how long it takes to run beam search.

The Bottleneck

The training that I’ve done so far has all been with teacher forcing, which allows all inputs to be passed to the net at once:

When we do a rollout, we can’t pass everything in at once. We start with our initial state and use the policy to discover where we end up:

The problem is that the left-side of that image, the policy call, is happening on the GPU, but the right side, the state advancement, is happening on the CPU. If a rollout involves 62 player steps, then instead of one data transfer step like we have for training, we’re doing 61 transfers! Our bottleneck is all that back-and-forth communication:

Let’s move everything to the GPU.

CPU Code

So what is currently happening on the CPU?

At every state, we are:

Sampling an action for each board from the action logits
Applying that action to each board to advance the state

Sampling from the actions is pretty straightforward to run on the GPU. That’s the bread and butter of transformers and RL in general.

# policy_logits are [a×s×b] (a=actions, s=sequence length, b = batch size)
policy_logits, nsteps_logits = policy(inputs)

# Sample from the logits using the Gumbel-max trick
sampled_actions = argmax(policy_logits .+ gumbel_noise, dims=1)

where we use the Gumbel-max trick and the Gumble noise is sampled in advance and passed to the GPU like the other inputs:

using Distributions.jl
gumbel_noise = rand(Gumbel(0, 1), size(a, s, b))

Advancing the board states is more complicated. Here is the CPU method for a single state:

function maybe_move!(board::Board, dir::Direction)::Bool
    □_player::TileValue=find_player_tile(board)
    step_fore = get_step_fore(board, dir)

    □ = □_player # where the player starts
    ▩ = □ + step_fore # where the player potentially ends up

    if is_set(board[▩], WALL)
        return false # We would be walking into a wall
    end

    if is_set(board[▩], BOX)
        # We would be walking into a box.
        # This is only a legal move if we can push the box.
        ◰ = ▩ + step_fore # where box ends up
        if is_set(board[◰],  WALL + BOX)
            return false # We would be pushing the box into a box or wall
        end

        # Move the box
        board[▩] &= ~BOX # Clear the box
        board[◰] |= BOX # Add the box
    end

    # At this point we have established this as a legal move.
    # Finish by moving the player
    board[□] &= ~PLAYER # Clear the player
    board[▩] |= PLAYER # Add the player

    return true
end

There are many ways to represent board states. This representation is a simple Matrix{UInt8}, so an 8×8 board is just an 8×8 matrix. Each tile is a bitfield with components that can be set for whether that tile has/is a wall, box, floor, or tile.

Moving the player has 3 possible paths:

successful step: the destination tile is empty and we just move the player to it
successful push: the destination tile has a box, and the next one over is empty, so we move both the player and the box
failed move: otherwise, this is an illegal move and the player stays where they are

Moving this logic to the GPU has to preserve this flow, use the GPU’s representation of the board state, and handle a tensor’s worth of board states at a time.

GPU Representation

The input to the policy is a tensor of size \([h \times w \times f \times s \times b]\), where 1 board is encoded as a sparse \((h = \text{height}) \times (w=\text{width}) \times (f = \text{num features} = 5)\) tensor:

and we have board sequences of length \(s\) and \(b\) sequences per batch of them:

I purposely chose 4-step boards here, but sequences can generally be much longer and of different lengths, and the first state in each sequence is the goal state.

Our actions will be the \([4\times s \times b]\) actions tensor — one up/down/left/right action per board state.

Shifting Tensors

The first fundamental operation we’re going to need is to be able to check tile neighbors. That is, instead of doing this:

□ = □_player # where the player starts
▩ = □ + step_fore # where the player potentially ends up

we’ll be shifting all tiles over and checking that instead:

is_player_dests = shift_tensor(is_players, d_row=0, d_col=1)

The shift_tensor method method takes in a tensor and shifts it by the given number of rows and columns, padding in new values:

We pass in the number of rows or columns to shift, figure out what that means in terms of padding, and then leverage NNlib’s pad_constant method to give us a new tensor that we clamp to a new range:

function shift_tensor(
    tensor::AbstractArray,
    d_row::Integer,
    d_col::Integer,
    pad_value)

    pad_up    = max( d_row, 0)
    pad_down  = max(-d_row, 0)
    pad_left  = max( d_col, 0)
    pad_right = max(-d_col, 0)

    tensor_padded = NNlib.pad_constant(
        tensor,
        (pad_up, pad_down, pad_left, pad_right, 
            (0 for i in 1:2*(ndims(tensor)-2))...),
        pad_value)

    dims = size(tensor_padded)
    row_lo = 1 + pad_down
    row_hi = dims[1] - pad_up
    col_lo = 1 + pad_right
    col_hi = dims[2] - pad_left

    return tensor_padded[row_lo:row_hi, col_lo:col_hi,
                         (Colon() for d in dims[3:end])...]
end

This method works on tensors with varying numbers of dimensions, and always operates on the first two dimensions as the row and column dimensions.

Taking Actions

If we know the player move, we can use the appropriate shift direction to get the “next tile over”. Our player moves can be reflected by the following row and column shift values:

UP = (d_row=-1, d_col= 0)
LEFT = (d_row= 0, d_col=-1)
DOWN = (d_row=+1, d_col= 0)
RIGHT = (d_row= 0, d_col=+1)

This lets us convert the CPU-movement code into a bunch of Boolean tensor operations:

function advance_boards(
    inputs::AbstractArray{Bool}, # [h,w,f,s,b]
    d_row::Integer,
    d_col::Integer)

    boxes  = inputs[:,:,DIM_BOX,   :,:]
    player = inputs[:,:,DIM_PLAYER,:,:]
    walls  = inputs[:,:,DIM_WALL,  :,:]

    player_shifted = shift_tensor(player, d_row, d_col, false)
    player_2_shift = shift_tensor(player_shifted, d_row, d_col, false)

    # A move is valid if the player destination is empty
    # or if its a box and the next space over is empty
    not_box_or_wall = .!(boxes .| walls)

    # 1 if it is a valid player destination tile for a basic player move
    move_space_empty = player_shifted .& not_box_or_wall

    # 1 if the tile is a player destination tile containing a box
    move_space_isbox = player_shifted .& boxes

    # 1 if the tile is a player destination tile whose next one over
    # is a valid box push receptor
    push_space_empty = player_shifted .& shift_tensor(not_box_or_wall, -d_row, -d_col, false)

    # 1 if it is a valid player move destination
    move_mask = move_space_empty

    # 1 if it is a valid player push destination
    # (which also means it currently has a box)
    push_mask = move_space_isbox .& push_space_empty

    # new player location
    mask = move_mask .| push_mask
    player_new = mask .| (player .* shift_tensor(.!mask, -d_row, -d_col, false))

    # new box location
    box_destinations = shift_tensor(boxes .* push_mask, d_row, d_col, false)
    boxes_new = (boxes .* (.!push_mask)) .| box_destinations

    return player_new, boxes_new
end

The method appropriately moves any player tile that has an open space in the neighboring tile, or any player tile that has a neighboring pushable box. We create both a new player tensor and a new box tensor.

This may seem extremely computationally expensive — we’re operating on all tiles rather than on just the ones we care about. But GPUs are really good at exactly this, and it is much cheaper to let the GPU churn through that than wait for the transfer to/from the CPU.

The main complication here is that we’re using the same action across all boards. In a given instance, there are \(s\times b\) boards in our tensor. We don’t want to be using the same action in all of them.

Instead of sharding different actions to different boards, we’ll compute the results of all 4 actions and then index into the resulting state that we need:

Working with GPUs sure makes you think differently about things.

function advance_boards(
    inputs::AbstractArray{Bool}, # [h,w,f,s,b]
    actions::AbstractArray{Int}) #       [s,b]

    succ_u = advance_boards(inputs, -1,  0) # [h,w,s,d], [h,w,s,d]
    succ_l = advance_boards(inputs,  0, -1)
    succ_d = advance_boards(inputs,  1,  0)
    succ_r = advance_boards(inputs,  0,  1)

    size_u = size(succ_u[1])
    target_dims = (size_u[1], size_u[2], 1, size_u[3:end]...)
    player_news = cat(
        reshape(succ_u[1], target_dims),
        reshape(succ_l[1], target_dims),
        reshape(succ_d[1], target_dims),
        reshape(succ_r[1], target_dims), dims=3) # [h,w,a,s,d]
    box_news = cat(
        reshape(succ_u[2], target_dims),
        reshape(succ_l[2], target_dims),
        reshape(succ_d[2], target_dims),
        reshape(succ_r[2], target_dims), dims=3) # [h,w,a,s,d]

    actions_onehot = onehotbatch(actions, 1:4) # [a,s,d]
    actions_onehot = reshape(actions_onehot, (1,1,size(actions_onehot)...)) # [1,1,a,s,d]

    boxes_new = any(actions_onehot .& box_news, dims=3)
    player_new = any(actions_onehot .& player_news, dims=3)

    return cat(inputs[:,:,1:3,:,:], boxes_new, player_new, dims=3)
end

We’re almost there. This updates the boards in-place. To get the new inputs tensor, we want to shift our boards in the sequence dimension, propagating successor boards to the next sequence index. However, we can’t just shift the entire tensor. We want to keep the goals and the initial states:

The code for this amounts to a cat operation and some indexing:

function advance_board_inputs(
    inputs::AbstractArray{Bool}, # [h,w,f,s,b]
    actions::AbstractArray{Int}) #       [s,b]

    inputs_new = advance_boards(inputs, actions)

    # Right shift and keep the goal and starting state
    return cat(inputs[:, :, :, 1:2, :],
               inputs_new[:, :, :, 2:end-1, :], dims=4) # [h,w,f,s,b]
end

And with that, we’re processing actions across entire batches!

Rollouts on the GPU

We can leverage this new propagation code to propagate our inputs tensor during a rollout. The policy and the inputs have to be on the GPU, which in Flux.jl can be done with gpu(policy). Note that this requires a CUDA-compatible GPU.

A single iteration is then:

# Run the model
# policy_logits are [4 × s × b]
# nsteps_logits are [7 × s × b]
policy_logits_gpu, nsteps_logits_gpu = policy0(inputs_gpu)

# Sample from the action logits using the Gumbel-max trick
actions_gpu = argmax(policy_logits_gpu .+ gumbel_noise_gpu, dims=1)
actions_gpu = getindex.(actions_gpu, 1) # Int64[1 × s × b]
actions_gpu = dropdims(actions_gpu, dims=1) # Int64[s × b]

# Apply the actions
inputs_gpu = advance_board_inputs(inputs_gpu, actions_gpu)

The overall rollout code just throws this into a loop and does some setup:

function rollouts!(
    inputs::Array{Bool, 5},      # [h×w×f×s×b]
    gumbel_noise::Array{Float32, 3}, # [4×s×b]
    policy0::SokobanPolicyLevel0,
    s_starts::Vector{Board}, # [b]
    s_goals::Vector{Board}) # [b]

    policy0 = gpu(policy0)

    h, w, f, s, b = size(inputs)

    @assert length(s_starts) == b
    @assert length(s_goals) == b

    # Fill the goals into the first sequence channel
    for (bi, s_goal) in enumerate(s_goals)
        set_board_input!(inputs, s_goal, 1, bi)
    end

    # Fill the start states in the second sequence channel
    for (bi, s_start) in enumerate(s_starts)
        set_board_input!(inputs, s_start, 2, bi)
    end

    inputs_gpu = gpu(inputs)
    gumbel_noise_gpu = gpu(gumbel_noise)

    for si in 2:s-1

        # Run the model
        # policy_logits are [4 × s × b]
        # nsteps_logits are [7 × s × b]
        policy_logits_gpu, nsteps_logits_gpu = policy0(inputs_gpu)

        # Sample from the action logits using the Gumbel-max trick
        actions_gpu = argmax(policy_logits_gpu .+ gumbel_noise_gpu, dims=1)
        actions_gpu = getindex.(actions_gpu, 1) # Int64[1 × s × b]
        actions_gpu = dropdims(actions_gpu, dims=1) # Int64[s × b]

        # Apply the actions
        inputs_gpu = advance_board_inputs(inputs_gpu, actions_gpu)
    end

    return cpu(inputs_gpu)
end

There are several differences:

The code is simpler. We only have a single loop, over the sequence length (number of steps to take). The content of that loop is pretty compact.
The code does more work. We’re processing more stuff, but because it happens in parallel on the GPU, its okay. We’re also propagating all the way to the end of the sequence whether we need to or not. (The CPU code would check whether all boards had finished already).

If we time how long it takes to doing a batch worth of rollouts before and after moving to the GPU, we get about a \(60\times\) speedup. Our efforts have been worth it!

Beam Search on the GPU

Rollouts aren’t the only thing we want to speed up. I want to use beam search to explore the space using the policy and try to find solutions. Rollouts might happen to find solutions, but beam search should be a lot better.

The code ends up being basically the same, except a single goal and board is used to seed the entire batch (giving us a number of beams equal to the batch size), and we have to do some work to score the beams and then select which ones to keep:

unction beam_search!(
    inputs::Array{Bool, 5},      # [h×w×f×s×b]
    policy0::SokobanPolicyLevel0,
    s_start::Board,
    s_goal::Board)

    policy0 = gpu(policy0)

    h, w, f, s, b = size(inputs)

    # Fill the goals and starting states into the first sequence channel
    for bi in 1:b
        set_board_input!(inputs, s_goal, 1, bi)
        set_board_input!(inputs, s_start, 2, bi)
    end

    # The scores all start at zero
    beam_scores = zeros(Float32, 1, b) |> gpu # [1, b]

    # Keep track of the actual actions
    actions = ones(Int, s, b) |> gpu # [s, b]

    inputs_gpu = gpu(inputs)

    # Advance the games in parallel
    for si in 2:s-1

        # Run the model
        # policy_logits are [4 × s × b]
        # nsteps_logits are [7 × s × b]
        policy_logits, nsteps_logits = policy0(inputs_gpu)

        # Compute the probabilities
        action_probs = softmax(policy_logits, dims=1) # [4 × s × b]
        action_logls = log.(action_probs) # [4 × s × b]

        # The beam scores are the running log likelihoods
        action_logls_si = action_logls[:, si, :]  # [4, b]
        candidate_beam_scores = action_logls_si .+ beam_scores # [4, b]
        candidate_beam_scores_flat = vec(candidate_beam_scores) # [4b]

        # Get the top 'b' beams
        topk_indices = partialsortperm(candidate_beam_scores_flat, 1:b; rev=true)

        # Convert flat indices back to action and beam indices
        selected_actions = (topk_indices .- 1) .÷ b .+ 1  # [b] action indices (1 to 4)
        selected_beams   = (topk_indices .- 1) .% b .+ 1  # [b] beam indices (1 to b)
        selected_scores  = candidate_beam_scores_flat[topk_indices]  # [b]
        inputs_gpu = inputs_gpu[:,:,:,:,selected_beams]

        actions[si,:] = selected_actions

        # Apply the actions to the selected beams
        inputs = advance_board_inputs(inputs_gpu, actions)
    end

    return (cpu(inputs_gpu), cpu(actions))
end

This again results in what looks like way simpler code. The beam scoring and such is all done on tensors, rather than a bunch of additional for loops. It all happens on the GPU, and it is way faster (\(23\times\)).

Conclusion

The previous blog post was about leveraging the GPU during training. This blog post was about leveraging the GPU during inference. We had to avoid expensive data transfers between the CPU and the GPU, and to achieve that had to convert non-trivial player movement code to computations amenable to the GPU. Going about that meant thinking about and structuring our code very differently, working across tensors and creating more work that the GPU could nonetheless complete faster.

This post was a great example of how code changes based on the scale you’re operating at. Peter van Hardenberg gives a great talk about similar concepts in Why Can’t We Make Simple Software?. How you think about a problem changes a lot based on problem scale and hardware. Now that we’re graduating from the CPU to processing many many boards, we have to think about the problem differently.

Our inference code has been GPU-ized, so we can leverage it to speed up validation and solution search. It was taking me 20 min to train a network but 30 min to run beam search on all boards in my validation set. This change avoids that sorry state of affairs.

Tuning a Sokoban Policy Net

November 5, 2024 by timw

In a previous post, I had covered a project in which I was trying to get a neural network to learn to play Sokoban. I trained a transformer that operated on encoded board positions and predicted both the next action and how many steps were left until a solution would be reached:

My training was all done with Flux.jl and my own transformer implementation, not because it was more efficient or anything, but because I like to learn by doing it myself.

This training all happened on my little personal laptop. I love my laptop, but it is not particularly beefy, and it does not have much of a GPU to speak of. Training a fairly simple transformer was taking a long time — around 8 hours — which is pretty long. That doesn’t leave much room for experimentation. I knew that all this training was happening on the CPU, and the best way to make it faster would be to move to the GPU.

Flux making moving to the GPU incredibly easy:

using CUDA
policy = TransformerPolicy(
    batch_size = BATCH_SIZE,
    max_seq_len = MAX_SEQ_LEN,
    encoding_dim = ENCODING_DIM,
    dropout_prob=dropout_prob,
    no_past_info=no_past_info) |> gpu

That’s it – you just use the CUDA package and pipe your model to the GPU.

I tried this, and… it didn’t work. Unfortunately my little laptop does not have a CUDA-capable GPU.

After going through a saga of trying to get access to GPUs in AWS, I put the project to the side. There is sat, waiting for me to eventually pick it back up again whenever I ultimately decided to get a machine with a proper GPU or try to wrangle AWS again.

Then, one fateful day, I happened to be doing some cleaning and noticed an older laptop that I no longer used. Said laptop is bigger, and, just perhaps it had a CUDA-capable GPU. I booted it up, and lo and behold, it did. Not a particularly fancy graphics card, but CUDA-capable nonetheless.

This post is about how I used said GPU to train my Sokoban models, and how I then set up a task system in order to run a variety of parameterizations.

Model Training with a GPU

I switched my training code to move as much stuff as possible over to the GPU. After some elbow grease, I kicked off the same training run before and found that it ran in about 15 min – a 32x speed increase.

The next thing I tried was increasing the encoding from a length of 16 to 32. On the CPU, such an increase would at least double the training time. On the GPU, the training time remained the same.

How could that be?

Simply put, the GPU is really fast at crunching numbers, and the training runtime is dominated by using the CPU to unpack our training examples, sending data to and from the GPU, and running the gradient update on the CPU. In this case there seems to be a free lunch!

Here is a simple depiction of what was happening before (top timeline) vs. what is happening now:

We pay the upfront cost of sending the model to the GPU, but then can more efficiently shovel data into it to get results faster than computing them ourselves. There is nothing magical happening here, just literally passing data there, having it crunch it, and receiving it once its done.

Parameter Tuning

Now that we have faster training, it is nice to look at how various training and model parameters affect our metrics. Good parameters can easily make or break a deep learning project.

Our model parameters are:

board dimension – always \(8 \times 8\)
encoding dimension – the size of the transformer embedding
maximum sequence length – how long of a sequence the transformer can handle (always 64)
number of transformer layers

Our training parameters are:

learning rate – affects the size of the steps that our AdamW optimizer takes
AdamW 1st and 2nd momentum decay
AdamW weight decay
batch size – the number of training samples per optimization step
number of training batches – the number of optimization steps
number of training entries – the size of our training set
dropout probability – certain layers in the transformer have a chance of randomly dropping outputs for robustness purposes
gradient threshold – the gradient is clipped to this value to improve stability

That’s a lot of parameters. How are we going to go about figuring out the best settings?

The tried and true method that happens in practice is try-and-see, where humans just try things they think are reasonable and see how the model performs. That’s what I was doing originally, when each training run took 8 hours. While that’s okay, it’d be nice to do better.

The next simplest approach is grid search. Here we discretize all parameters and train a model on every possible parameter combination. In 2 dimensions this ends up looking like a grid, hence the name:

We have about 10 parameters. Even if we only consider 2 values for each of them, doing a grid search over them all would require training \(2^{10} = 1024\) models, which at 15 min per model is ~10.7 days. That’s both too long and pretty useless – we want higher granularity than that.

With grid search out, the next alternative is to conduct local searches for specific parameters. We already have a training parameterization that works pretty well, the one from the last blog post, and we can vary a single parameter and see how that affects training. That’s much cheaper – just the cost of the number of training points we’d like to evaluate per parameter. If we want to evaluate 5 values per parameter, that’s just \(5 \cdot 10 = 50\) models, or ~12.5 hours of training time. I could kick that off and come back to look at it the next day.

What I just proposed is very similar to cyclic coordinate search or coordinate descent, which is an optimization approach that optimizes one input at a time. It is quite simple, and actually works quite well. In fact, Sebastian Thrun himself has expressed his enthusiasm for this method to me.

There’s a whole realm of more complicated sampling strategies that could be followed. I rather like uniform projection plans and quasi-random space-filling sets like the Halton sequence. They don’t take that much effort to set up and do their best to fill the search space with a limited number of samples.

The method I most leaned up was ultimately a mixture of coordinate search and random sampling. Random sampling is what Andrej Karpathy recommended in CS231n, because it lets you evaluate far more independent values for specific parameters than something like grid search:

Here we see about 1/3 as many samples as grid search covering way more unique values for param 1.

Okay, so we want a way to run random search and perhaps some other, more targeted search approaches. How would we go about doing that?

Tasks

In the next phase of my project I want to expand from just training a transformer policy to predict player up/down/left/right moves to more complicated models that may interact with this simpler model. I also want to use my models to discover solutions to random problems, and perhaps refine previously discovered solutions with better models. I thus don’t want to merely support training this one model, I want to be able to run more general tasks.

function run_tasks()

    done = false
    while !done
        
        task_file = get_next_task()
        if !isa(task_file, String)
            println("No tasks left!")
            done = true
            break
        end

        task_filepath = joinpath(FOLDER_TODO, task_file::String)
        res = run_task(task_filepath)
        dest_folder = res.succeeded ? FOLDER_DONE : FOLDER_TRIED

        # name the task according to the time
        task_dst_name = Dates.format(Dates.now(), "yyyymmdd_HHMMss") * ".task"
        mv(task_filepath, joinpath(dest_folder, task_dst_name))
        write_out_result(res, task_dst_name, dest_folder)
        println(res.succeeded ? "SUCCEEDED" : "FAILED")
    end
end

The task runner simply looks for its next task in the TODO folder, executes it, and when it is done either moves it to the DONE folder or the TRIED folder. It then writes out additional task text that it captured (which can contain errors that are useful for debugging failed tasks).

The task runner is its own Julia process, and it spawns a new Julia process for every task. This helps ensure that issues in a task don’t pollute other tasks. I don’t want an early segfault to waste an entire night’s worth of training time.

The task files are simply Julia files that I load and prepend with a common header:

function run_task(task_filepath::AbstractString)
    res = TaskResult(false, "", time(), NaN)

    try
        content = read(task_filepath, String)

        temp_file, temp_io = mktemp()
        write(temp_io, TASK_HEADER)
        write(temp_io, "\n")
        write(temp_io, content)
        close(temp_io)

        output = read(`julia -q $(temp_file)`, String)

        res.succeeded = true
        res.message = output
    catch e
        # We failed to
        res.succeeded = false
        res.message = string(e)
    end

    res.t_elapsed = time() - res.t_start

    return res
end

This setup is quite nice. I can drop new task files in and the task runner with just dutifully run them as soon as its done with whatever came before. I can inspect the TRIED folder for failed tasks and look at the output for what went wrong.

Results

I ran a bunch of training runs and then loaded and plotted the results to get some insight. Let’s take a look and see if we learn anything.

We’ve got a bunch of metrics, but I’m primarily concerned with the top-2 policy accuracy and the top-2 nsteps accuracy. Both of these measure how often the policy had the correct action (policy accuracy) or number of steps remaining (nsteps accuracy) in its top-2 most likely predictions. The bigger these numbers are the better, with optimal performance being 1.0.

First let’s look at the learning rate, the main parameter that everyone typically has to futz with. First, the top-2 policy accuracy:

We immediately see a clear division between training runs with terrible accuracies (50%) and training runs with reasonable performance. This tells us that some of our model training runs did pretty poorly. That’s good to know – the parameters very much matter and can totally derail training.

Let’s zoom in on the good results:

We don’t see a super-clear trend in learning rate. The best policy accuracy was obtained with a learning rate around 5e-4, but that one sample is somewhat of an outlier.

The nsteps accuracy also shows the bad models, so we’ll jump straight to the zoomed version:

Interestingly, the same learning rate of 5e-4 produces the best nsteps accuracy as well, which is nice for us. Also, the overall spread here tends to prefer larger learning rates, with values down at 1e-4 trending markedly worse.

Next let’s look at the dropout probability. Large enough values are sure to tank training performance, but when does that start happening?

We don’t really have great coverage on the upper end, but based on the samples here it seems that a dropout probability of about 0.01 (or 1%) performs best. The nsteps accuracy shows a similar result.

Next let’s look at the weight decay.

We’ve found our performance-tanking culprit! Weight decay values even moderately larger than zero appear to be highly correlated with terrible performance. It seems to drag the model down and prevent learning. Very small weight decay values appear to be fine, so we’ll have to be careful to just search those.

This is an important aspect of parameter tuning – parameters like the learning rate or weight decay can take on rather small values like 1e-4. Its often more about finding the right exponent rather than finding the right decimal value.

With those learning parameters out of the way, let’s look at some model parameters. First, the encoding dimension:

Naively we would expect that bigger encoding dimensions would be better given infinite training examples and compute, but we those are finite. We didn’t exhaustively evaluate larger encoding dimensions, but find that the nsteps prediction doesn’t benefit all that much from going from 32 to 64 entries, whereas the policy does.

We can also look at the number of transformer layers:

Having more layers means we have a bigger model, with more room to perform operations on our token embeddings as they pass through the model. Bigger is often better, but is ultimately constrained by our training data and compute.

In this case the nsteps predictor can achieve peak performance across a variety of depths, whereas the policy seems to favor larger layer counts (but can still do pretty well even with a single layer).

The next question we might ask is whether the mode size overall is predictive of performance. We can plot the total number of trainable parameters:

In terms of policy accuracy, we are seeing the best performance with the largest models, but the nsteps predictor doesn’t seem to need it as much. That is consistent with what we’ve already observed.

Let’s now identify the best model. How would we do that?

I’m going to consider the best model to be the one with the highest value of top-2 policy accuracy + top-2 nsteps accuracy. That’s the same as asking for the model most top-right in the following plot:

The two accuracies are highly correlated (which makes sense – its hard to predict how many steps it will take to reach the goal without also being a good Sokoban policy). The model that does the best has an encoding dim of 64 with 5 transformer layers, uses a learning rate of 0.0005, has weight decay = 0.0002, and a dropout probability of 0.01.

Conclusion

In this post I got excited about our ability to use the GPU to train our networks, and then I tried to capitalize on it by running a generic task runner. I did some sampling and collected a bunch of metrics in order to try to learn a thing or two about how best to parameterize my model and select the training hyperparameters.

Overall, I would say that the main takeaways are:

Its worth spending a little time to help the computer do more work for you
Its important to build insight into how the various parameters affect training

That’s all folks. Happy coding.

The First Attempt: Per-Line Scanning

A Better Way: Lexing and Parsing

Conclusion

Old Method: Type-Specific Lists

Old Method: Top-Level Entity List

Large Array of Things

Sharing Concepts

Undo and Redo

Conclusion

Action Panel

Actions

Click to Expand: Pass Action+

Click to Expand: Deploy Action+

Move Actions

Click to Expand: Move Action+

Conclusion

Architecture

Finishing a Vertical Slice

Bugs

Initializer Lists

Conclusion

Grandmother Cells

Sparse Encodings

Transformer Embeddings

More Holistic than Local?

Conclusion

Storing Tests by Code

Pulling Alttext

The Joy of Coding your own Tools

The Bottleneck

CPU Code

GPU Representation

Shifting Tensors

Taking Actions

Rollouts on the GPU

Beam Search on the GPU

Conclusion

Model Training with a GPU

Parameter Tuning

Tasks

Results

Conclusion