§ Notes
Research notes.
Project Pigeon: a small model that holds long context
Results from three internal checkpoints. The 865M v7 stays within a few points of the much larger v5 on most shared benchmarks, leads on Winogrande, and passes a long-context retrieval test that both larger checkpoints fail.
Read noteAdversarial robustness in domain-specific models: red-teaming beyond the generic benchmark.
Generic robustness benchmarks miss the failure modes that matter in specialized domains. Robustness for these models should be measured in the learned representations, not the outputs alone.
Interlocking specialized models: routing and merging domain experts for compound AI systems.
A compound system built from a shared pretrained trunk and interchangeable expert heads. The open problems sit in the routing and merging that keep the composition stable.
Improving synthetic data generation bounds via constrained decoding.
A decode-time bound on the error of synthetically generated training data. Constraining the generator, rather than filtering its output, keeps a growing corpus on-distribution.