Skip to content
All notes

No. 04 · Evaluation

Project Pigeon: a small model that holds long context

Three early checkpoints, v5 to v7, from our post-transformer line.

·Evaluation·6 min read
01

Overview

We present three checkpoints of Project Pigeon, v5 through v7. Pigeon is a post-transformer line: a sub-quadratic sequence model trained with a lookahead objective, so its training target includes the trajectory of upcoming output, not only the next token. v7 is the smallest of the three. At 865M parameters it trains in 6.7 GB of GPU memory, and it is the only checkpoint that passes the long-context retrieval test.

Key features

  • Post-transformer architecture: a sub-quadratic sequence model rather than an attention-only Transformer.
  • Lookahead objective: the training target includes the trajectory of upcoming output, not only next-token prediction.
  • Long-context retrieval at small scale: v7 holds a needle across a few thousand tokens at 865M parameters.
  • Low footprint: v7 trains in 6.7 GB of GPU memory, about a fifth of v5's 34.4 GB.

Key highlights

  • v7, at 865M parameters, stays within a few points of v5 (9.34B) on most of the five shared benchmarks and leads on Winogrande.
  • v7 passes 10 of 10 long-context needle-in-haystack trials; v5 passes 0 and v6 passes 1 (context ~2k–5k tokens).
  • Validation perplexity falls from 9.91 (v5) to 8.17 (v6); v7 reaches 8.94 at 865M parameters and 5B tokens.
Table 1
PigeonV59.34BPigeonV67.31BPigeonV7865M
Params9.34B7.31B865M
Tokens seen8B12B5B
GPU memory34.4 GB27.2 GB6.7 GB
Val perplexity9.918.178.94
Training and evaluation basics for the three checkpoints.
02

Benchmarks

All three checkpoints report the same five tasks. v6 is the strongest on the commonsense and knowledge tests; v7, at roughly a tenth of v5's parameters, lands within a few points on most of them and ahead on Winogrande.

Figure 1
PIQA
V569.1
V670.7
V764.0
ARC-Easy
V554.9
V657.8
V748.0
Winogrande
V551.4
V653.4
V756.5
HellaSwag
V541.6
V646.9
V740.0
ARC-Challenge
V532.8
V634.4
V729.5
Accuracy (%) on the five shared benchmarks. v7 (865M) shown in accent.
03

Long-context retrieval

On a needle-in-haystack test, v7 passes all ten trials; v5 passes none and v6 one. The case for sub-quadratic sequence models rests on long context, and here the smallest checkpoint is the one that delivers it. Whether the result holds at longer contexts or across a full training run is still open.

Figure 2
PigeonV5 · 9.34B0/10
PigeonV6 · 7.31B1/10
PigeonV7 · 865M10/10
Needle-in-haystack pass rate, 10 trials per version, context ~2k–5k tokens.
04

Full results

v5 and v6 add MMLU; v7 adds SciQ, OpenBookQA, and LAMBADA. A dash marks a benchmark that checkpoint did not report.

Table 2
BenchmarkPigeonV59.34BPigeonV67.31BPigeonV7865M
PIQA69.170.764.0
ARC-Easy54.957.848.0
Winogrande51.453.456.5
HellaSwag41.646.940.0
ARC-Challenge32.834.429.5
MMLU (5-shot)26.027.3
SciQ74.5
OpenBookQA30.5
LAMBADA22.5
Val perplexity9.918.178.94
Full reported results across the three checkpoints.
05

Reading the results

These are early checkpoints: 5 to 12 billion training tokens, far short of a full run, with v7 the least trained at 5 billion. Read the scores against chance and the token budget, not against finished models.

Project Pigeon is active research. These checkpoints are internal and have not been released. The architecture direction is described in the research note.

Questions and corrections are welcome at contact@auxerta.com.