Hao AI Lab · UC San Diego

JetSpec

Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

Lanxiang Hu1  Zhaoxiang Feng1  Yulun Wu2  Haoran Yuan3  Yujie Zhao1  Yu-Yang Qian4  Bojun Wang5  Peng Zhao4  Daxin Jiang5  Yibo Zhu5  Tajana Rosing1  Hao Zhang1

1UC San Diego   2Zhejiang University   3UIUC   4Nanjing University   5StepFun

Speculative decoding hits a scaling ceiling: a larger draft budget helps only while acceptance stays high and drafting stays cheap, and prior heads face a causality-efficiency dilemma: autoregressive drafters condition on the path but cost grows with tree depth, while block-diffusion drafters draft in one pass yet score branches independently, forming individually plausible but mutually inconsistent trees. JetSpec trains a causal parallel draft head over fused hidden states from the frozen target, so a candidate tree’s scores align with the target’s own autoregressive factorization and the full target verifies the whole tree in one forward pass, losslessly. A larger budget then becomes a longer accepted prefix: 9.64× on MATH-500 and 4.58× on open-ended chat (Qwen3-8B, H100, greedy, budget 256), and these carry into real single-stream serving on JetSpec’s own engine.

Race bar
Same prompt, decoded three ways on Qwen3-8B. Each panel streams its tokens; the bar under it tracks progress to the finish. Replay or slow it down with the controls above. Each lane's throughput, token count, accepted length and answer text are measured on B200 (GSM8K, Qwen3-8B, greedy). Token-level pacing within a lane is reconstructed from the measured throughput and slowed uniformly for legibility; it is not a per-token wall-clock recording.

Each panel reaches the same text autoregressive decoding would; JetSpec simply commits more tokens per verified step. The measured speedups across benchmarks are in the results table below.

Tree quality

Tree drafting & verification

Both heads spend the same draft budget; what differs is the quality of the tree that budget buys. JetSpec’s causal head helps in two ways the paper isolates: it keeps the top-ranked branch, the one the verifier follows, faithful to the target, and it does so without tuning the loss weighting.

  • accepted path
  • drafter’s top pick (not taken)
  • rejected
  • free bonus

One JetSpec drafting round on a MATH-500 problem. At every position the draft head keeps the top few continuations in one pass, fanning above and below the running text, not just the greedy guess. Twice here the target writes a token the drafter ranked second (“told” over “given” 0.36, “is” over “equals” 0.39, both marked in violet as the drafter’s top pick); because the tree had drafted both, the accepted path (green) keeps going where a single greedy chain would have stopped. The target verifies the whole tree in one pass and accepts the longest matching prefix; the first uncovered token is rejected, and the target’s own token is taken as a free bonus. Illustrative round (representative tokens and shape); on real runs JetSpec’s mean accepted length reaches τ=10.76 tokens per round (MATH-500, budget 256).

Branch faithfulness

A block-diffusion head produces all positions from one shared hidden state with no causal mask between depths, so the depth-2 distribution never conditions on the token chosen at depth 1. When two positions independently favour tokens that cannot follow each other, the surrogate still scores their composition highly and promotes it to the top of the tree. The causal head masks between depths, anchoring each position to the model’s own choice at the previous one, so its rank-1 branch carries a score the target agrees with.

Rank-1 branch faithfulness (50 MATH-500 prompts, no loss weighting)CausalDiffusion
Faithful rank-1 (<+5 nats)42%6%
Extreme gap (≥+80 nats)0%26%
Mean accepted length9.464.84
Across 50 MATH-500 prompts the causal head keeps its top-ranked branch aligned with the target. Without loss weighting the diffusion surrogate is miscalibrated, accepting a mean 4.84 tokens per round against the causal head’s 9.46. The gap is the drafter’s log-probability difference, in nats (natural-log units), between its top-ranked branch and the target’s preferred continuation; a small gap means the tree leads with the branch the target will accept.

No loss-weighting tuning

Bidirectional diffusion heads recover some of this with an exponential loss weighting, but only near a tuned setting: the diffusion head peaks at a single weighting and collapses at the extremes, while the causal head holds 8.3–8.5× across the whole range. The causal mask removes the need to tune any weighting at all.

MATH-500 speedup by loss-weightingγ=0γ=3γ=7γ=15
Causal (JetSpec)8.298.508.408.41
Diffusion head5.468.168.366.17
MATH-500 end-to-end speedup (× over autoregressive decoding) by loss weighting γ. γ exponentially downweights draft positions far from each anchor token; the causal head holds across the whole range and needs no such tuning.

Serving

The engine

JetSpec runs on its own standalone inference engine, with no dependency on an external serving stack. It takes the minimalist philosophy of nano-vLLM as a starting point but is otherwise its own engine. The target verifies every speculative tree node in one forward pass under a tree-attention mask, and the acceptance rule is lossless by construction, preserving the target’s output exactly. The mean accepted length is therefore a property of the model and draft tree rather than the hardware, while the end-to-end speedup and throughput depend on the GPU’s compute and bandwidth. We implement paged FlashAttention kernels in both Triton and NVIDIA CuTe DSL that apply the tree mask directly inside the attention computation, without materializing a dense per-request mask.

up to1456tok/s

JetSpec on its own engine, with no external serving dependency, on a single B200 (Qwen3-8B, budget 127), generating 12 chained MATH-500 problems. Per-problem throughput rises and falls with acceptance: peaking at 1456 tok/s and sustaining ~1000 tok/s on average, at τ=9.29 accepted tokens per round. Real JetSpec e3-engine generation; per-round constant-time pacing reconstructed from the measured throughput.

JetSpec engine · Qwen3-8B · single B200 · budget 127 real generation · paced
JetSpec · single-stream problem 1 / 12 idle

                  
                  
this problem0 tok/s
average~1000 tok/s
accepted / roundτ 9.29
JetSpec generating 12 chained MATH-500 solutions on a single B200, paced to the measured per-round throughput. The per-problem rate varies with acceptance, peaking at 1456 tok/s, averaging ~1000 tok/s, at τ=9.29 per round. Real engine generation (answer text + accepted tokens); per-token pacing reconstructed from the measured throughput, not a per-token wall-clock recording.

Approach

How JetSpec works

Head-based speculative decoding faces a causality-efficiency dilemma: an autoregressive draft head conditions each token on its path but pays a forward pass per depth, while a block-diffusion head drafts a whole block in one pass yet scores positions independently, composing branches the target rejects. JetSpec keeps the one-pass efficiency and recovers the conditioning. Over fused hidden states from the frozen target, deep features already shown to encode several future tokens, it trains a causal parallel draft head: a single forward pass emits a scored candidate tree whose branch scores follow the target’s own autoregressive factorization rather than an independent per-position surrogate. The frozen target then verifies the entire tree in one pass and commits the longest path it agrees with, leaving the target’s output distribution exactly unchanged.

JetSpec architecture: feature fusion, causal-parallel draft head producing a scored candidate tree, and frozen target verification under a tree-causal attention mask.
One JetSpec round: fused target features feed the causal-parallel draft head, which emits a scored candidate tree; the frozen target verifies every node in one forward pass under the tree-causal attention mask and commits the longest matching path. Figure from the paper (causal parallel draft head).

Training the head

Only the head is trained; the target stays frozen, so JetSpec attaches to a model already in production without touching its weights. The head reads fused multi-layer features from the target and is supervised against the target’s own next-token distributions rather than hard labels, so it inherits the relative preferences among candidate tokens that a tree needs in order to rank branches. We train with a causal mask over selected anchor positions, with each anchor expanded into a 16-token draft block. Each draft position attends to preceding positions within its block, enabling a single parallel pass to produce an internally consistent draft tree.

Block-wise training supervision: within each sampled block the anchor token carries no loss while the predicted draft positions are each supervised against the target.
Block-wise supervision. Within each sampled block the anchor token carries no loss; the draft positions it spawns are each supervised against the target, so the head learns to score whole branches rather than isolated next tokens. Figure from the paper (block-wise supervision).

At budget 256 (greedy), JetSpec leads DDTree on every benchmark: MATH-500 9.64× vs 8.78×, GSM8K 7.82× vs 7.04×, HumanEval 7.12× vs 6.31×. The full grid:

MethodBudgetGSM8KMATH-500AIME25HumanEvalMBPPLCBMT-Bench
SpdτSpdτSpdτSpdτSpdτSpdτSpdτ
Temperature = 0 (greedy)
EAGLE-3642.534.312.364.132.354.042.494.262.223.812.093.622.193.88
DDTree645.636.186.517.166.406.965.085.574.995.495.476.063.744.51
DDTree1286.637.318.279.197.938.665.936.525.706.286.427.254.125.14
DDTree2567.047.778.789.818.339.246.316.966.096.706.757.724.265.41
JetSpec645.986.566.767.426.477.005.536.065.345.885.956.593.974.77
JetSpec1287.348.058.939.958.269.106.667.286.316.957.298.214.375.52
JetSpec2567.828.629.6410.768.789.827.127.786.737.437.678.794.585.94
Temperature = 1 (sampling)
EAGLE-3642.354.172.234.002.113.732.354.132.133.701.993.491.963.63
DDTree645.285.775.686.264.945.464.655.094.655.115.285.823.504.22
DDTree1286.116.726.797.605.405.985.295.765.225.755.996.713.664.59
DDTree2566.417.177.108.135.266.205.436.035.496.126.267.173.814.88
JetSpec645.636.195.976.604.955.485.025.525.005.515.766.383.634.37
JetSpec1286.767.497.398.365.766.495.826.395.786.396.847.743.995.01
JetSpec2567.168.037.839.015.947.066.196.856.116.837.258.294.065.22
End-to-end speedup (Spd, × over autoregressive decoding) and mean accepted length (τ, tokens per round) on Qwen3-8B; budget is the draft-token count. EAGLE-3’s autoregressive head saturates past 64 draft tokens, while JetSpec leads every benchmark at every budget under both greedy (T=0) and sampling (T=1). Numbers reproduced from the paper’s main results table.

Co-optimizing drafting cost and quality through causality

A speculative decoder’s speedup is the product of two terms: how many drafted tokens the target accepts, and how cheaply those tokens were drafted. Most prior work improves one term at the expense of the other. Autoregressive draft heads such as Medusa and EAGLE-3 follow the target’s own factorization and accept well, but they draft one token at a time, so throughput stays bounded by sequential drafting. Parallel block-diffusion heads draft a whole block in a single pass and pay almost nothing per token, but their positions are scored independently, so deeper tree branches drift out of agreement with the target. Retrieval drafters skip a learned model entirely, at the cost of leaning on lexical overlap or repeated text. JetSpec keeps both terms at once: it drafts an entire tree in one parallel pass, at the cost of a diffusion head, while conditioning every position on its branch prefix, at the acceptance of a causal head. That is why a larger draft budget keeps turning into a longer accepted prefix instead of saturating.

Cost and limitations

The speedups above are end-to-end and lossless, but not uniform. Sampling (T=1) margins are smaller than greedy decoding, and on the hardest reasoning sets such as AIME25 the lead over the strongest diffusion baseline narrows. Wall-clock serving speedups also run below the paper’s algorithmic speedups, because real serving adds kernel-launch and host overhead that the accepted-length ratio does not capture. The optimal draft budget is itself a trade-off: a larger tree raises accepted length but also the per-round verification cost, so the best operating point depends on the model and the hardware.

BibTeX

@misc{hu2026jetspec,
  title  = {JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting},
  author = {Hu, Lanxiang and Feng, Zhaoxiang and Wu, Yulun and Yuan, Haoran and Zhao, Yujie and
            Qian, Yu-Yang and Wang, Bojun and Zhao, Peng and Jiang, Daxin and Zhu, Yibo and Rosing, Tajana and Zhang, Hao},
  year   = {2026},
  eprint = {2606.18394},
  archivePrefix = {arXiv},
  primaryClass = {cs.CL},
  url    = {https://arxiv.org/abs/2606.18394}
}

Preview build: the live replay is paced from measured B200 throughput (GSM8K); the results tables are the paper’s measured numbers. §3 tree is illustrative.