Rollouts on the GPU

Last month I wrote about moving the Sokoban policy training code from CPU to GPU, yielding massive speedups. That significantly shortened both training time and the time it takes to compute basic validation metrics. It has not, unfortunately, significantly changed how long it takes to run rollouts, and relatedly, how long it takes to run beam search.

The Bottleneck

The training that I’ve done so far has all been with teacher forcing, which allows all inputs to be passed to the net at once:

When we do a rollout, we can’t pass everything in at once. We start with our initial state and use the policy to discover where we end up:

The problem is that the left-side of that image, the policy call, is happening on the GPU, but the right side, the state advancement, is happening on the CPU. If a rollout involves 62 player steps, then instead of one data transfer step like we have for training, we’re doing 61 transfers! Our bottleneck is all that back-and-forth communication:

Let’s move everything to the GPU.

CPU Code

So what is currently happening on the CPU?

At every state, we are:

  1. Sampling an action for each board from the action logits
  2. Applying that action to each board to advance the state

Sampling from the actions is pretty straightforward to run on the GPU. That’s the bread and butter of transformers and RL in general.

# policy_logits are [a×s×b] (a=actions, s=sequence length, b = batch size)
policy_logits, nsteps_logits = policy(inputs)

# Sample from the logits using the Gumbel-max trick
sampled_actions = argmax(policy_logits .+ gumbel_noise, dims=1)

where we use the Gumbel-max trick and the Gumble noise is sampled in advance and passed to the GPU like the other inputs:

using Distributions.jl
gumbel_noise = rand(Gumbel(0, 1), size(a, s, b))

Advancing the board states is more complicated. Here is the CPU method for a single state:

function maybe_move!(board::Board, dir::Direction)::Bool
    □_player::TileValue=find_player_tile(board)
    step_fore = get_step_fore(board, dir)

    □ = □_player # where the player starts
    ▩ = □ + step_fore # where the player potentially ends up

    if is_set(board[▩], WALL)
        return false # We would be walking into a wall
    end

    if is_set(board[▩], BOX)
        # We would be walking into a box.
        # This is only a legal move if we can push the box.
        ◰ = ▩ + step_fore # where box ends up
        if is_set(board[◰],  WALL + BOX)
            return false # We would be pushing the box into a box or wall
        end

        # Move the box
        board[▩] &= ~BOX # Clear the box
        board[◰] |= BOX # Add the box
    end

    # At this point we have established this as a legal move.
    # Finish by moving the player
    board[□] &= ~PLAYER # Clear the player
    board[▩] |= PLAYER # Add the player

    return true
end

There are many ways to represent board states. This representation is a simple Matrix{UInt8}, so an 8×8 board is just an 8×8 matrix. Each tile is a bitfield with components that can be set for whether that tile has/is a wall, box, floor, or tile.

Moving the player has 3 possible paths:

  • successful step: the destination tile is empty and we just move the player to it
  • successful push: the destination tile has a box, and the next one over is empty, so we move both the player and the box
  • failed move: otherwise, this is an illegal move and the player stays where they are

Moving this logic to the GPU has to preserve this flow, use the GPU’s representation of the board state, and handle a tensor’s worth of board states at a time.

GPU Representation

The input to the policy is a tensor of size \([h \times w \times f \times s \times b]\), where 1 board is encoded as a sparse \((h = \text{height}) \times (w=\text{width}) \times (f = \text{num features} = 5)\) tensor:

and we have board sequences of length \(s\) and \(b\) sequences per batch of them:

I purposely chose 4-step boards here, but sequences can generally be much longer and of different lengths, and the first state in each sequence is the goal state.

Our actions will be the \([4\times s \times b]\) actions tensor — one up/down/left/right action per board state.

Shifting Tensors

The first fundamental operation we’re going to need is to be able to check tile neighbors. That is, instead of doing this:

□ = □_player # where the player starts
▩ = □ + step_fore # where the player potentially ends up

we’ll be shifting all tiles over and checking that instead:

is_player_dests = shift_tensor(is_players, d_row=0, d_col=1)

The shift_tensor method method takes in a tensor and shifts it by the given number of rows and columns, padding in new values:

We pass in the number of rows or columns to shift, figure out what that means in terms of padding, and then leverage NNlib’s pad_constant method to give us a new tensor that we clamp to a new range:

function shift_tensor(
    tensor::AbstractArray,
    d_row::Integer,
    d_col::Integer,
    pad_value)

    pad_up    = max( d_row, 0)
    pad_down  = max(-d_row, 0)
    pad_left  = max( d_col, 0)
    pad_right = max(-d_col, 0)

    tensor_padded = NNlib.pad_constant(
        tensor,
        (pad_up, pad_down, pad_left, pad_right, 
            (0 for i in 1:2*(ndims(tensor)-2))...),
        pad_value)

    dims = size(tensor_padded)
    row_lo = 1 + pad_down
    row_hi = dims[1] - pad_up
    col_lo = 1 + pad_right
    col_hi = dims[2] - pad_left

    return tensor_padded[row_lo:row_hi, col_lo:col_hi,
                         (Colon() for d in dims[3:end])...]
end

This method works on tensors with varying numbers of dimensions, and always operates on the first two dimensions as the row and column dimensions.

Taking Actions

If we know the player move, we can use the appropriate shift direction to get the “next tile over”. Our player moves can be reflected by the following row and column shift values:

UP = (d_row=-1, d_col= 0)
LEFT = (d_row= 0, d_col=-1)
DOWN = (d_row=+1, d_col= 0)
RIGHT = (d_row= 0, d_col=+1)

This lets us convert the CPU-movement code into a bunch of Boolean tensor operations:

function advance_boards(
    inputs::AbstractArray{Bool}, # [h,w,f,s,b]
    d_row::Integer,
    d_col::Integer)

    boxes  = inputs[:,:,DIM_BOX,   :,:]
    player = inputs[:,:,DIM_PLAYER,:,:]
    walls  = inputs[:,:,DIM_WALL,  :,:]

    player_shifted = shift_tensor(player, d_row, d_col, false)
    player_2_shift = shift_tensor(player_shifted, d_row, d_col, false)

    # A move is valid if the player destination is empty
    # or if its a box and the next space over is empty
    not_box_or_wall = .!(boxes .| walls)

    # 1 if it is a valid player destination tile for a basic player move
    move_space_empty = player_shifted .& not_box_or_wall

    # 1 if the tile is a player destination tile containing a box
    move_space_isbox = player_shifted .& boxes

    # 1 if the tile is a player destination tile whose next one over
    # is a valid box push receptor
    push_space_empty = player_shifted .& shift_tensor(not_box_or_wall, -d_row, -d_col, false)

    # 1 if it is a valid player move destination
    move_mask = move_space_empty

    # 1 if it is a valid player push destination
    # (which also means it currently has a box)
    push_mask = move_space_isbox .& push_space_empty

    # new player location
    mask = move_mask .| push_mask
    player_new = mask .| (player .* shift_tensor(.!mask, -d_row, -d_col, false))

    # new box location
    box_destinations = shift_tensor(boxes .* push_mask, d_row, d_col, false)
    boxes_new = (boxes .* (.!push_mask)) .| box_destinations

    return player_new, boxes_new
end

The method appropriately moves any player tile that has an open space in the neighboring tile, or any player tile that has a neighboring pushable box. We create both a new player tensor and a new box tensor.

This may seem extremely computationally expensive — we’re operating on all tiles rather than on just the ones we care about. But GPUs are really good at exactly this, and it is much cheaper to let the GPU churn through that than wait for the transfer to/from the CPU.

The main complication here is that we’re using the same action across all boards. In a given instance, there are \(s\times b\) boards in our tensor. We don’t want to be using the same action in all of them.

Instead of sharding different actions to different boards, we’ll compute the results of all 4 actions and then index into the resulting state that we need:

Working with GPUs sure makes you think differently about things.

function advance_boards(
    inputs::AbstractArray{Bool}, # [h,w,f,s,b]
    actions::AbstractArray{Int}) #       [s,b]

    succ_u = advance_boards(inputs, -1,  0) # [h,w,s,d], [h,w,s,d]
    succ_l = advance_boards(inputs,  0, -1)
    succ_d = advance_boards(inputs,  1,  0)
    succ_r = advance_boards(inputs,  0,  1)

    size_u = size(succ_u[1])
    target_dims = (size_u[1], size_u[2], 1, size_u[3:end]...)
    player_news = cat(
        reshape(succ_u[1], target_dims),
        reshape(succ_l[1], target_dims),
        reshape(succ_d[1], target_dims),
        reshape(succ_r[1], target_dims), dims=3) # [h,w,a,s,d]
    box_news = cat(
        reshape(succ_u[2], target_dims),
        reshape(succ_l[2], target_dims),
        reshape(succ_d[2], target_dims),
        reshape(succ_r[2], target_dims), dims=3) # [h,w,a,s,d]

    actions_onehot = onehotbatch(actions, 1:4) # [a,s,d]
    actions_onehot = reshape(actions_onehot, (1,1,size(actions_onehot)...)) # [1,1,a,s,d]

    boxes_new = any(actions_onehot .& box_news, dims=3)
    player_new = any(actions_onehot .& player_news, dims=3)

    return cat(inputs[:,:,1:3,:,:], boxes_new, player_new, dims=3)
end

We’re almost there. This updates the boards in-place. To get the new inputs tensor, we want to shift our boards in the sequence dimension, propagating successor boards to the next sequence index. However, we can’t just shift the entire tensor. We want to keep the goals and the initial states:

The code for this amounts to a cat operation and some indexing:

function advance_board_inputs(
    inputs::AbstractArray{Bool}, # [h,w,f,s,b]
    actions::AbstractArray{Int}) #       [s,b]

    inputs_new = advance_boards(inputs, actions)

    # Right shift and keep the goal and starting state
    return cat(inputs[:, :, :, 1:2, :],
               inputs_new[:, :, :, 2:end-1, :], dims=4) # [h,w,f,s,b]
end

And with that, we’re processing actions across entire batches!

Rollouts on the GPU

We can leverage this new propagation code to propagate our inputs tensor during a rollout. The policy and the inputs have to be on the GPU, which in Flux.jl can be done with gpu(policy). Note that this requires a CUDA-compatible GPU.

A single iteration is then:

# Run the model
# policy_logits are [4 × s × b]
# nsteps_logits are [7 × s × b]
policy_logits_gpu, nsteps_logits_gpu = policy0(inputs_gpu)

# Sample from the action logits using the Gumbel-max trick
actions_gpu = argmax(policy_logits_gpu .+ gumbel_noise_gpu, dims=1)
actions_gpu = getindex.(actions_gpu, 1) # Int64[1 × s × b]
actions_gpu = dropdims(actions_gpu, dims=1) # Int64[s × b]

# Apply the actions
inputs_gpu = advance_board_inputs(inputs_gpu, actions_gpu)

The overall rollout code just throws this into a loop and does some setup:

function rollouts!(
    inputs::Array{Bool, 5},      # [h×w×f×s×b]
    gumbel_noise::Array{Float32, 3}, # [4×s×b]
    policy0::SokobanPolicyLevel0,
    s_starts::Vector{Board}, # [b]
    s_goals::Vector{Board}) # [b]

    policy0 = gpu(policy0)

    h, w, f, s, b = size(inputs)

    @assert length(s_starts) == b
    @assert length(s_goals) == b

    # Fill the goals into the first sequence channel
    for (bi, s_goal) in enumerate(s_goals)
        set_board_input!(inputs, s_goal, 1, bi)
    end

    # Fill the start states in the second sequence channel
    for (bi, s_start) in enumerate(s_starts)
        set_board_input!(inputs, s_start, 2, bi)
    end

    inputs_gpu = gpu(inputs)
    gumbel_noise_gpu = gpu(gumbel_noise)

    for si in 2:s-1

        # Run the model
        # policy_logits are [4 × s × b]
        # nsteps_logits are [7 × s × b]
        policy_logits_gpu, nsteps_logits_gpu = policy0(inputs_gpu)

        # Sample from the action logits using the Gumbel-max trick
        actions_gpu = argmax(policy_logits_gpu .+ gumbel_noise_gpu, dims=1)
        actions_gpu = getindex.(actions_gpu, 1) # Int64[1 × s × b]
        actions_gpu = dropdims(actions_gpu, dims=1) # Int64[s × b]

        # Apply the actions
        inputs_gpu = advance_board_inputs(inputs_gpu, actions_gpu)
    end

    return cpu(inputs_gpu)
end

There are several differences:

  1. The code is simpler. We only have a single loop, over the sequence length (number of steps to take). The content of that loop is pretty compact.
  2. The code does more work. We’re processing more stuff, but because it happens in parallel on the GPU, its okay. We’re also propagating all the way to the end of the sequence whether we need to or not. (The CPU code would check whether all boards had finished already).

If we time how long it takes to doing a batch worth of rollouts before and after moving to the GPU, we get about a \(60\times\) speedup. Our efforts have been worth it!

Beam Search on the GPU

Rollouts aren’t the only thing we want to speed up. I want to use beam search to explore the space using the policy and try to find solutions. Rollouts might happen to find solutions, but beam search should be a lot better.

The code ends up being basically the same, except a single goal and board is used to seed the entire batch (giving us a number of beams equal to the batch size), and we have to do some work to score the beams and then select which ones to keep:

unction beam_search!(
    inputs::Array{Bool, 5},      # [h×w×f×s×b]
    policy0::SokobanPolicyLevel0,
    s_start::Board,
    s_goal::Board)

    policy0 = gpu(policy0)

    h, w, f, s, b = size(inputs)

    # Fill the goals and starting states into the first sequence channel
    for bi in 1:b
        set_board_input!(inputs, s_goal, 1, bi)
        set_board_input!(inputs, s_start, 2, bi)
    end

    # The scores all start at zero
    beam_scores = zeros(Float32, 1, b) |> gpu # [1, b]

    # Keep track of the actual actions
    actions = ones(Int, s, b) |> gpu # [s, b]

    inputs_gpu = gpu(inputs)

    # Advance the games in parallel
    for si in 2:s-1

        # Run the model
        # policy_logits are [4 × s × b]
        # nsteps_logits are [7 × s × b]
        policy_logits, nsteps_logits = policy0(inputs_gpu)

        # Compute the probabilities
        action_probs = softmax(policy_logits, dims=1) # [4 × s × b]
        action_logls = log.(action_probs) # [4 × s × b]

        # The beam scores are the running log likelihoods
        action_logls_si = action_logls[:, si, :]  # [4, b]
        candidate_beam_scores = action_logls_si .+ beam_scores # [4, b]
        candidate_beam_scores_flat = vec(candidate_beam_scores) # [4b]

        # Get the top 'b' beams
        topk_indices = partialsortperm(candidate_beam_scores_flat, 1:b; rev=true)

        # Convert flat indices back to action and beam indices
        selected_actions = (topk_indices .- 1) .÷ b .+ 1  # [b] action indices (1 to 4)
        selected_beams   = (topk_indices .- 1) .% b .+ 1  # [b] beam indices (1 to b)
        selected_scores  = candidate_beam_scores_flat[topk_indices]  # [b]
        inputs_gpu = inputs_gpu[:,:,:,:,selected_beams]

        actions[si,:] = selected_actions

        # Apply the actions to the selected beams
        inputs = advance_board_inputs(inputs_gpu, actions)
    end

    return (cpu(inputs_gpu), cpu(actions))
end

This again results in what looks like way simpler code. The beam scoring and such is all done on tensors, rather than a bunch of additional for loops. It all happens on the GPU, and it is way faster (\(23\times\)).

Conclusion

The previous blog post was about leveraging the GPU during training. This blog post was about leveraging the GPU during inference. We had to avoid expensive data transfers between the CPU and the GPU, and to achieve that had to convert non-trivial player movement code to computations amenable to the GPU. Going about that meant thinking about and structuring our code very differently, working across tensors and creating more work that the GPU could nonetheless complete faster.

This post was a great example of how code changes based on the scale you’re operating at. Peter van Hardenberg gives a great talk about similar concepts in Why Can’t We Make Simple Software?. How you think about a problem changes a lot based on problem scale and hardware. Now that we’re graduating from the CPU to processing many many boards, we have to think about the problem differently.

Our inference code has been GPU-ized, so we can leverage it to speed up validation and solution search. It was taking me 20 min to train a network but 30 min to run beam search on all boards in my validation set. This change avoids that sorry state of affairs.

A Transformer Sokoban Policy

I’ve posted about Sokoban before. It is a problem I keep coming back to.

Sokoban is both very simple to specify but its difficulty can be readily scaled up. It is easy to accidentally push blocks into an inescapable kerfuffle, and more interestingly, it isn’t always obvious when you’ve done it.

Moreover, I think people solving Sokoban often beautifully exhibit hierarchical reasoning. The base game is played with up/down/left/right actions, but players very quickly start thinking about sequences of block pushes. We learn to “see” deadlock configurations and reason about rooms and packing orders. This ability to naturally develop higher-order thinking and execute on it is something I want to understand better. So I keep coming back to it.

In this blog post, I’ll recount training a Sokoban policy using deep learning, and more specifically, using the transformer architecture. I started pretty small, developing something I that hope to flesh out more in the future.

The Problem

This blog post is about policy optimization. A policy \(\pi(s)\) maps states to actions. These states and actions must exist in some sort of problem — i.e. a search problem or Markov decision process. It is often a good idea to define the problem before talking about how to structure the policy.

Our problem, of course, is Sokoban. The state space is the space of Sokoban boards. Each board is a grid, and each tile in the grid is either a wall or a floor, each floor may be a goal and may contain either a box or the player. For this post I’m working with 8×8 boards.

\[\text{state space}\enspace\mathcal{S} = \text{Sokoban boards}\]

Normal Sokoban has an action space of up/down/left/right. At least, that is what we’re told. I think that isn’t quite true. In normal Sokoban, the player can get stuck in a deadlock. When that happens, they typically reset the level or hit undo. So I’m going use an action space of up/down/left/right/undo. Trajectories with undos are of course never optimal, but you nevertheless might need to use them when you’re exploring.

\[\text{action space}\enspace\mathcal{A} = \left\{\text{up}, \text{down}, \text{left}, \text{right}, \text{undo}\right\}\]

Notice that we’re working directly with player movements, not with box pushes. Pretty much no-one ever works directly with player movements, because the search space gets so large so quickly. That being said, hierarchical reasoning is the point of my investigations, so I’m starting at the bottom of the reasoning hierarchy.

Our goal is to solve Sokoban problems. For now, let’s just say that we get a reward of 1 whenever we solve a board (all boxes are on goals), and zero reward otherwise. Given two action sequences that solve a board, we prefer the shorter one.

\[R(s,a) = \begin{cases}1 & \text{if } \texttt{issolved}(s) \\ 0 & \text{otherwise}\end{cases}\]

We won’t directly make use of this reward function in this blog post, but it will come into play later.

The Policy

A typical policy would take the current state and try to find the best action that leads toward having all boxes on goals. Its architecture would broadly look like:

The output is a probability distribution over the actions. There are 5 actions (up,down,left,right, and undo), so this is just a categorical distribution over 5 values. That’s the bread and butter of ML and we can easily produce that with a softmax output layer.

The input is a board state. I decided to represent boards as \(8\times 8\times 5\) tensors, where the first two dimensions are the board length and width, and then we have 5 channels that indicate whether a given tile is a wall space, a floor space, has a goal, has a player, and has a box. Altogether that is 320 scalars, which is somewhat long and sparse. However, it contains a lot of structured spatial information that we need to make use of.

The first architectural decision is to run the board state through an encoder head to produce a latent vector. Its basically a 5-channel image, so I’m using convolutional layers, just like you would for an image input. These can learn consistent features that get applied in parallel across the input.

This could work. We’d just stick some feedforward layers with maybe some residual connections into that trunk, and we’d be able to train a policy to place all boxes on goals. I didn’t quite want to go that route though, because I wanted the network to be slightly more configurable.

My eventual goal is to do hierarchical reasoning, so I want to be able to have another, higher-level policy use this lower-level policy as a tool. As such, I want an input that it can provide in order to steer the policy. To this end, I’m providing an additional goal input that tells the neural network what we want to happen:

In Sokoban its pretty important for the player to know when a board just isn’t solvable. Its really easy to end up in a deadlock state:

We want the model to be able to learn when this happens, and to be able to report that it thinks this is the case.

I thus include an additional output, which I’m calling the nsteps output. It predicts the number of steps left in the solution of the puzzle. If the network thinks there are an infinite number of steps left, then the board is not solvable.

In Stop Regressing, the authors suggest avoiding (continuous) regression and to train over discrete targets instead. As such, my nsteps output will actually output a discrete value:

nsteps = Int(clamp(round(log(nsteps+1)), 0, 5)) + 1

which can take on up to 6 values:

and a 7th bin for unsolvable.

That leaves our network as (with the softmaxes implied):

This architecture is a perfectly reasonable policy network. In fact, we’ll compare against this network later on. However, I was also very interested in working with transformers, so I built a transformer version of this network:

This architecture is roughly the same, except you have the massively parallel transformer that:

  • can train with supervised learning more efficiently, by receiving an entire sequence at once
  • can use past information to inform future decisions
  • can easily be scaled up or down based on just a few parameters
  • lends itself nicely to beam search for exploration

I am initially working with relatively small models consisting of 3-layer transformers with 16-dimensional embedding spaces, a maximum sequence length of 32, and 8-headed attention.

Training

This initial model is trained with supervised learning. In order to do that, we need supervised data.

Typical supervised learning would provide a starting state, the goal state, and an optimal set of actions to reach it. My training examples are basically that, but also include examples that cannot be solved and some examples with bad moves and a training mask that allows me to mask them out during training. (This theoretically lets the network learn undos and to recover from bad moves).

start:
########
# #
# #
# . bp #
# #
# #
# #
########

goal:
########
# #
# #
# B #
# #
# #
# #
########

solved
moves: LL
isbad: --

solved
moves: xLL
isbad: x--


solved
moves: UxLL
isbad: xx--

I have a few handcrafted training entries, but most of them are randomly generated problems solved with A*. I generate 1-box problems by starting with an 8×8 grid of all-walls, producing two random rectangular floor sections (rooms), randomly spawning the goal, box, and the player, and then speckling in some random walls. That produces a nice diversity that has structure, but not too much structure.

Most of these boards aren’t solvable, so I when I generate training examples, I set a target ratio for solved and unsolved boards and make sure to keep generating until I get enough of each type. I have 4000 generated problems with solutions and 4000 generated problems without solutions, resulting in 8,000 overall generated training examples.

I was able to cheaply boost the size of the training set by recognizing that a problem can be rotated 4 ways and can be transposed. Incorporating these transforms (and applying the necessary adjustments to the solution paths) automatically increases the number of training examples by 8x.

I train with a standard Flux.jl loop using the default Adam optimizer. I shuffle the training set with each epoch, and train with a batch size of 32. I’m using 1% dropout.

The training loop is set up to create a new timestamped directory with each run and save a .json file with basic metrics. I compute both the policy and nsteps loss on each batch and log these for later printing. This gives me nice training curves, effectively using the code from the Flux quick start guide:

This plot uses a log-scale x-axis, which gives it some distortion. Training curves are often plotted this way because a lot of stuff happens early on and more updates have to be be made later on to get similar movement.

I’ve been consistently seeing the plots play out like they do above. At first the loss drops very quickly, only to plateau between 100 and 1000 batches, and then to suddenly start reducing again. The model seems to suddenly understand something around the 1000-batch mark. Based on how flat the policy loss is, it seems to be related to that. Maybe it learns some reliable heuristics around spatial reasoning and walking to boxes.

The curves suggest that I could continue to train for longer and maybe get more improvement. Maybe that’s true, and maybe I’ll look into that in the future.

Validation

I have two types of validation – direct supervised metrics and validation metrics based on using the policy to attempt to solve problems. Both of these are run on a separate validation dataset of 4000 randomly generated problems (with a 50-50 split of solvable and not solvable).

The supervised metrics are:

  • accuracy at predicting whether the root position is solvable
  • top-1 policy accuracy – whether the highest-likelihood action matches the supervised example
  • top-2 policy accuracy – whether the supervised example action is in the top two choices
  • top-1 nsteps accuracy – whether the highest-likelihood nsteps bucket matches the example
  • top-2 nsteps accuracy – ditto, but top two choices

Note that a Sokoban problem can have multiple solutions, so we shouldn’t expect a perfect policy accuracy. (i.e. going up and then left is often the same as going left and then up).

I then use the model to solve each problem in the validation dataset using beam search. This lets me actually test efficacy. Beam search was run with a beam width of 32, which means I’m always going to find the solution if it takes fewer than 2 steps. The search keeps the top candidates based on the running sum of the policy likelihood. (I can try other formulations, like scoring candidates based on the predicted nsteps values).

Results

I trained two models – a transformer model and a 2nd model that does not have access to past data. The 2nd model is also a transformer, but with a mask that masks out everything but the current and goal states.

The metrics are:

TransformerBaseline
Solvability Accuracy0.9770.977
Top-1 Policy Accuracy0.8670.831
Top-2 Policy Accuracy0.9820.969
Top-1 Nsteps Accuracy0.9000.898
Top-2 Nsteps Accuracy0.9900.990
Solve Rate0.9210.879
Mean Solution Length6.9836.597

Overall, that’s pretty even. The baseline policy that does not have access to past information is ever so slightly worse with respect to top-1 accuracy. Our takeaway is that for this problem, so far, the past doesn’t matter all that much and we can be pretty Markovian.

A 97.7% solvability accuracy is pretty darn good. I am happy with that.

The top-k policy and nsteps accuracies are pretty good too. The full transformer outperforms the baseline a bit here, but only by a bit.

The primary difference comes out when we look at the solve rate, which we obtain by actually using the policies to solve problems. The full transformer succeeds 92.1% of the time, whereas when there is no history it only solves it 87.9% of the time. That’s only a 4.2% spread, but it is noticeable. I can only hypothesize, but I suspect past information is useful for longer traversals and simply the fact that you can pull more information in can give the model more space to compute with.