Skip to content
Research notev1.0 · 2026

Research.

Post-transformer architectures for foundation models.

Status · in progress

The dominant approach to foundation models is a Transformer trained on next-token prediction, then scaled in data and parameters. It works well, but it carries structural costs. Attention grows quadratically with sequence length, context is bounded, and inference gets expensive in exactly the regimes where deployment needs it cheap.

We work on post-transformer architectures. The hypothesis has two parts: sub-quadratic sequence models can hold Transformer-level quality at scale, and a shared trunk can learn representations that are not bound to next-token prediction. Both parts are unproven. If they hold, three properties follow.

  • (i)Longer effective context becomes affordable, because compute no longer grows quadratically with sequence length.
  • (ii)Representations transfer across tasks and modalities, because the trunk is not shaped by next-token prediction alone.
  • (iii)Adaptation and inference both get cheaper: a new task is a head trained on the shared trunk, and serving runs at sub-quadratic cost.

Where this sits

The precedent is real. State-space models (S4, Mamba), linear and recurrent attention, and Transformer–SSM hybrids have all shown that sub-quadratic sequence models can be competitive at scale. The open question is whether those results carry into the pretraining of general foundation models rather than isolated benchmark tasks. That is the question we work on.

What we are building

A post-transformer pretraining stack in which the architecture itself is the experiment. We treat the sequence-mixing layer, the memory mechanism, and the interface between trunk and heads as first-class and measurable, because those choices decide quality and transfer.

How we would know we are wrong

Two outcomes would settle it against us: a post-transformer model that trails a strong Transformer baseline on quality or transfer at matched compute, or efficiency gains that evaporate once training stability is priced in. We hold ourselves to matched-compute comparisons and to transfer measured on tasks the model never saw. Demos do not count.

Research that clears that bar ships as domain-specialized models through Neognathae. Kestrel, a text classifier, is the first. It is already released.

Cite as

Auxerta Labs. Post-transformer architectures for foundation models. Research note v1.0, 2026. auxerta.com/research