Most large language models train the same way: predict the next token, scale up the data, scale up the model. We don't think that's the whole story.
We're training models on objectives that separate how they learn internal representations from how they produce tokens. The bet is that representations come out cleaner when they aren't bent to fit a specific output format. Three things follow if it works:
- (i)Representations that don't carry the marks of next-token prediction. Better odds of transferring to reasoning, vision, and audio.
- (ii)One trunk, many heads. Swap the decoder for a different modality without retraining the base.
- (iii)Cheaper to change behavior at inference. Fine-tune the output side; leave the foundation alone.