Improving synthetic data generation bounds via constrained decoding.

The fidelity gap in generated corpora

The appeal of synthetic data is straightforward: produce training text programmatically rather than collecting and labeling it by hand. As public, high-quality web text becomes scarcer relative to the compute available to consume it, model-generated data is widely expected to make up a growing share of pretraining mixtures. The open question is what that data does to the model that learns from it.

Unconstrained generation suffers from what is often called a fidelity gap: the statistical patterns a model has internalized do not coincide with the domain invariants real data satisfies. The output looks right but breaks rules that matter.

A clinical note prescribes a dose that sits comfortably inside the model's learned distribution yet exceeds the approved maximum for the stated indication.
A legal passage cites a case in flawless format, but the case does not exist.
An engineering spec lists tolerances that are internally consistent but physically impossible for the named alloy.

These are not hallucinations in the dramatic sense. They are plausible errors: outputs that survive a casual read and fail domain validation. A model pretrained on enough of them does not merely underperform. It acquires confidently wrong behaviors that are hard to trace back to their source in the data.

There is a second, compounding concern. A line of work beginning with Shumailov and colleagues has documented model collapse: when models train recursively on their own output, rare but important tails of the distribution thin out with each generation, and the corpus narrows. Anything that ties generated data to the structure of real data also slows that drift.

Constrained decoding: shaping the output space

The idea is to restrict generation so outputs conform to predefined rules at every step. Instead of sampling freely from the full vocabulary distribution, the decoder masks tokens that would lead to structurally or semantically invalid text, then samples from what remains.

The contrast is with post-hoc filtering, the generate-then-validate approach most pipelines still default to. Filtering generates broadly, runs checks, and discards whatever fails; in tightly constrained domains the rejected fraction can be large. Constrained decoding prevents invalid outputs from forming at all. A whole class of errors is ruled out by construction rather than by sampling luck.

Grammar-guided decoding

The first layer enforces structure at the token level. At each step a grammar, typically a context-free grammar, regular expression, or JSON schema, determines which tokens are valid continuations; the rest are masked from the distribution before sampling. Schema-constrained "structured output" modes in commercial APIs and open-source libraries such as Outlines and Guidance make this routine.

The subtlety is that naive masking distorts the model's distribution: removing tokens reweights the survivors, and quality can degrade even as validity is guaranteed. Grammar-aligned decoding (Park et al., NeurIPS 2024) addresses this. Its sampling schemes preserve the model's original distribution while still respecting the grammar, so validity no longer costs model quality.

Knowledge-grounded constraints

Grammar handles syntax, not truth. The second layer cross-references generated content against trusted sources during decoding, so the test shifts from "is this valid JSON" to "does this drug–dose pair appear in the approved labeling." This means wiring the decoder to authoritative references: drug databases for clinical text, case indices for legal text, materials databases for engineering specs. The literature on knowledge-grounded and retrieval-constrained generation is active and uneven, and grounding adds real latency, but it targets exactly the plausible-but-false failures grammar cannot catch.

Domain-specific validators

The third layer applies field-specific rules neither the grammar nor the knowledge base can express: unit consistency, cross-field relationships, temporal coherence across a record. Type-constrained code generation is a clean illustration: enforcing type-system rules during decoding tends to reduce compilation errors in generated code, and the same idea extends to any domain with a formal notion of validity.

Why this matters: bounding the error distribution

The intuition is geometric. Unconstrained generation samples from a wide region of output space; some samples land inside the valid set, many do not. Filtering draws a boundary and discards everything outside it. Constrained decoding instead reshapes the generating distribution so that probability mass concentrates inside the valid region from the start. Three consequences follow.

Higher effective yield. Fewer invalid samples are produced, so less is discarded. Usable data accumulates faster even when each decoding step costs a little more.
More predictable error. The error rate is set by the constraint specification rather than the luck of the draw, which a pretraining pipeline can plan around.
Preserved diversity. Aggressive filtering can collapse the output distribution. Well-designed constraints keep variation inside the valid region, a hedge against recursive drift.

An open-ended error rate becomes a bounded one.

Implementation: a cascading architecture

The cheapest checks run first, and the expensive ones see only what survives.

Token-level masks, applied at each decoding step. These enforce structural validity and are cheap because they operate on the logits before sampling; grammar-to-automaton compilation keeps the overhead small, and careful subword alignment can make it nearly free.
Segment-level validators, applied after each logical chunk: a field, a sentence, a record. These check domain invariants and trigger regeneration of just the offending segment rather than the whole output.
Document-level consistency checks, applied to a finished record to verify cross-field relationships. This is the most expensive layer, but the first two keep most errors from ever reaching it.

The constraint specification problem

The difficulty is not the decoding. It is the specification: someone has to define what "valid" means for a domain, precisely enough to compile into decoding rules. For JSON, code, and tabular data this is comparatively easy, since schemas and type systems supply the constraints almost for free. For clinical, legal, and engineering text it takes domain experts who can articulate the invariants that separate valid data from plausible errors. Evaluation is downstream of the same work: you cannot measure compliance with a constraint nobody has written down.

Trade-offs

Diversity versus accuracy. Tighter constraints reduce variety, and there is evidence that overly strict formatting spends model capacity on satisfying the format rather than on content. Separating a reasoning pass from a formatting pass helps, at the cost of more moving parts.
Coverage. Formal constraints catch only formally specifiable errors. Tone, nuance, and subtle reasoning mistakes fall outside what a grammar or knowledge base can express, so human review still has a role.
Distribution distortion. Naive masking shifts the model's distribution; distribution-aligned methods address this in principle but remain an active engineering concern in practice.
Portability. Constraints do not transfer between domains. A pipeline tuned for clinical text needs different rules for legal synthesis, and each new domain requires fresh expert input.

Outlook

As model-generated text takes up more of the pretraining mixture, the quality of that text increasingly sets the quality of the model trained on it. The bottleneck is no longer producing more data; it is producing data a foundation model can safely learn from. Constrained decoding gives a way to define what "safe to learn from" means in formal terms and to enforce it while the data is written: grammar for structure, grounding for fact, validators for semantics. An open-ended error rate becomes a bounded one.

At Auxerta we read this as a pretraining problem first. Our work centers on post-transformer architectures for foundation models, and the corpus those models train on sets the ceiling on what they can reach. Bounding training-data error at decode time is one of the cleaner levers on that ceiling: a way to grow a corpus deliberately rather than scrape it, and to keep generated text anchored to the structure of the data it stands in for. The hard part has not moved. It is still deciding what "valid" means, one domain at a time. The architecture direction is described in the research note.