Project Pigeon: a small model that holds long context

Overview

We present three checkpoints of Project Pigeon, v5 through v7. Pigeon is a post-transformer line: a sub-quadratic sequence model trained with a lookahead objective, so its training target includes the trajectory of upcoming output, not only the next token. v7 is the smallest of the three. At 865M parameters it trains in 6.7 GB of GPU memory, and it is the only checkpoint that passes the long-context retrieval test.

Key features

Post-transformer architecture: a sub-quadratic sequence model rather than an attention-only Transformer.
Lookahead objective: the training target includes the trajectory of upcoming output, not only next-token prediction.
Long-context retrieval at small scale: v7 holds a needle across a few thousand tokens at 865M parameters.
Low footprint: v7 trains in 6.7 GB of GPU memory, about a fifth of v5's 34.4 GB.

Key highlights

v7, at 865M parameters, stays within a few points of v5 (9.34B) on most of the five shared benchmarks and leads on Winogrande.
v7 passes 10 of 10 long-context needle-in-haystack trials; v5 passes 0 and v6 passes 1 (context ~2k–5k tokens).
Validation perplexity falls from 9.91 (v5) to 8.17 (v6); v7 reaches 8.94 at 865M parameters and 5B tokens.

Table 1

	PigeonV59.34B	PigeonV67.31B	PigeonV7865M
Params	9.34B	7.31B	865M
Tokens seen	8B	12B	5B
GPU memory	34.4 GB	27.2 GB	6.7 GB
Val perplexity	9.91	8.17	8.94

Training and evaluation basics for the three checkpoints.

Benchmarks

All three checkpoints report the same five tasks. v6 is the strongest on the commonsense and knowledge tests; v7, at roughly a tenth of v5's parameters, lands within a few points on most of them and ahead on Winogrande.

Figure 1

PIQA

V569.1

V670.7

V764.0

ARC-Easy

V554.9

V657.8

V748.0

Winogrande

V551.4

V653.4

V756.5

HellaSwag

V541.6

V646.9

V740.0

ARC-Challenge

V532.8

V634.4

V729.5

Accuracy (%) on the five shared benchmarks. v7 (865M) shown in accent.

Long-context retrieval

On a needle-in-haystack test, v7 passes all ten trials; v5 passes none and v6 one. The case for sub-quadratic sequence models rests on long context, and here the smallest checkpoint is the one that delivers it. Whether the result holds at longer contexts or across a full training run is still open.

Figure 2

PigeonV5 · 9.34B0/10

PigeonV6 · 7.31B1/10

PigeonV7 · 865M10/10

Needle-in-haystack pass rate, 10 trials per version, context ~2k–5k tokens.

Full results

v5 and v6 add MMLU; v7 adds SciQ, OpenBookQA, and LAMBADA. A dash marks a benchmark that checkpoint did not report.

Table 2

Benchmark	PigeonV59.34B	PigeonV67.31B	PigeonV7865M
PIQA	69.1	70.7	64.0
ARC-Easy	54.9	57.8	48.0
Winogrande	51.4	53.4	56.5
HellaSwag	41.6	46.9	40.0
ARC-Challenge	32.8	34.4	29.5
MMLU (5-shot)	26.0	27.3	—
SciQ	—	—	74.5
OpenBookQA	—	—	30.5
LAMBADA	—	—	22.5
Val perplexity	9.91	8.17	8.94

Full reported results across the three checkpoints.

Reading the results

These are early checkpoints: 5 to 12 billion training tokens, far short of a full run, with v7 the least trained at 5 billion. Read the scores against chance and the token budget, not against finished models.

Project Pigeon is active research. These checkpoints are internal and have not been released. The architecture direction is described in the research note.