Four agents on real HMMT 2026 runs (GPT-5.4-medium).
Random sample speedup: Graph 1.92× · Chain 1.84× · Tree 1.82×.
Multi-agent reasoning systems adopt a generate-then-transfer paradigm that forces end-to-end latency to scale linearly with pipeline depth. We introduce StreamMA, a multi-agent reasoning system that streams each reasoning step to downstream agents as soon as it is generated, pipelining adjacent agents and thus reducing latency.
Surprisingly, this pipelining also improves effectiveness: because multi-step reasoning quality is non-uniform and early steps are more reliable than later ones, working with these reliable early steps instead of the full chain prevents error-prone late steps from misleading downstream agents. We formalise both advantages with the first closed-form joint analysis of stream, serial, and single protocols, deriving the effectiveness ordering, speedup upper bound, and cost ratio.
Across eight reasoning benchmarks spanning mathematics, science, and code, two frontier LLMs (Claude Opus 4.6 and GPT-5.4), and three topologies (Chain, Tree, Graph), StreamMA outperforms both baselines (avg. +7.3 pp, max +22.4 pp on HMMT 2026 with Claude Opus 4.6-high).
We further uncover a step-level scaling law: increasing per-agent steps consistently improves both effectiveness and efficiency, a new scaling dimension orthogonal to and composable with agent-count scaling.
When context arrives matters more than how much context arrives.
Agent² waits until Agent¹ finishes all steps, then reads the full chain — including the error-prone tail — and inherits its mistakes.
Agent² starts reasoning after step 1; by the time the bad tail arrives, it has formed its own trajectory and the tail's impact is diluted.
The first joint analysis of Stream, Serial and Single protocols.
One ordering for effectiveness, one upper bound for speed, one ratio for cost.
Six regimes; predicts when Stream / Serial / Single wins.
Depending on how $\bar{p}$, $p_{\mathrm{head}}$, $p_{\mathrm{tail}}$ compare to $p^*$, the sCorr ordering falls into six cases:
Closed-form speedup upper bound; 26.9× measured at A=S=64 (83% of the theoretical limit).
When decode ≫ prefill ($r_{v_{dp}}\!\to\!0$), reduces to $\mathrm{Speedup} \le \tfrac{AS}{S+A-1}$
S=64, A=64 → 32.3× theoretical, 26.9× measured (83%).
Exact cost formula; Stream saves ≈7.5% at ρ=1.
Claude Opus 4.6 pricing: $5 / $25 / $0.5 per MTok (input / output / cache), A=S=4.
Bound = 0.925ρ with full KV-cache — saves ≈7.5% even at ρ=1.
Same upstream output, two failure modes: corrupting the tail leaves Stream untouched; corrupting the head trips it.
The asymmetry is exactly what Theorem 1 predicts.
| Model | Topo | Method | AIME25 | AIME26 | HMMT26 | GPQA-D | HLE | LCB-G | LCB-E | LCB-T | Avg. |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Claude Opus 4.6 (high) | — | Single | 67.50 | 60.00 | 48.11 | 83.67 | 18.60 | 90.25 | 77.94 | 84.31 | 66.30 |
| Chain | Serial | 80.42 | 72.08 | 63.26 | 85.86 | 23.90 | 91.33 | 78.64 | 92.38 | 73.48 | |
| StreamMA | 92.50 | 89.58 | 85.61 | 87.37 | 26.97 | 91.50 | 84.41 | 95.63 | 81.70 | ||
| Tree | Serial | 86.25 | 86.25 | 75.00 | 85.18 | 24.82 | 91.92 | 88.45 | 97.59 | 79.43 | |
| StreamMA | 93.34 | 87.92 | 82.20 | 85.86 | 25.07 | 94.00 | 94.57 | 99.55 | 82.81 | ||
| Graph | Serial | 77.92 | 71.67 | 61.75 | 85.69 | 22.17 | 90.08 | 75.78 | 98.27 | 72.92 | |
| StreamMA | 95.83 | 87.92 | 82.58 | 86.53 | 27.68 | 92.17 | 95.27 | 98.72 | 83.34 | ||
| GPT-5.4 (medium) | — | Single | 55.83 | 71.25 | 40.53 | 77.95 | 12.08 | 91.08 | 92.48 | 96.68 | 67.24 |
| Chain | Serial | 60.00 | 70.42 | 54.55 | 75.08 | 14.66 | 90.08 | 97.43 | 99.02 | 70.16 | |
| StreamMA | 61.25 | 72.50 | 59.10 | 80.30 | 14.94 | 91.17 | 99.30 | 99.47 | 72.25 | ||
| Tree | Serial | 59.17 | 75.83 | 56.07 | 76.77 | 14.83 | 88.33 | 93.81 | 99.25 | 70.51 | |
| StreamMA | 62.08 | 75.83 | 58.34 | 78.12 | 15.74 | 89.50 | 94.78 | 99.17 | 71.70 | ||
| Graph | Serial | 60.00 | 74.17 | 52.65 | 78.45 | 14.04 | 92.25 | 99.51 | 98.80 | 71.13 | |
| StreamMA | 62.50 | 75.42 | 56.44 | 79.63 | 16.13 | 93.08 | 99.79 | 99.32 | 72.32 |
StreamMA rows shaded; numbers reproduce Tab. 1 of the paper.
At fixed agent count $A$, simply asking each agent to think in more finer
steps $S$ improves both speed and accuracy.
Fully composable with agent-count scaling.
GPT-5.4-medium · A=64, S=auto baseline 68.2% → S=64 lifts to 73.5% with 26.9× speedup.
Three-agent chain on HMMT 2026 with majority voting over $N\!\in\!\{1,4,16\}$ replicas. Claude Opus 4.6 pricing.
@article{yang2026streamma,
title={Streaming Communication in Multi-Agent Reasoning},
author={Yang, Zhen and Xu, Xiaogang and Wang, Wen and Chen, Cong and Xu, Xander and Chen, Ying-Cong},
journal={arXiv preprint arXiv:2606.05158},
year={2026}
}