StreamMA

Streaming Communication in Multi-Agent Reasoning

Zhen Yang1 · Xiaogang Xu3 · Wen Wang3 · Cong Chen3 · Xander Xu2* · Ying-Cong Chen1,4*

1HKUST(GZ) · 2Alibaba Group · 3ZJU · 4HKUST

*Co-corresponding authors

+7.3 pp
average accuracy gain
8 benchmarks · Claude Opus 4.6 · peak +22.4 pp on HMMT 2026
26.9×
wall-clock speedup
A=64, S=64 · 83% of theoretical bound
½ cost
Stream×4 beats Serial×16
$2.75 vs $5.46 · higher accuracy at half the price

🎯 Key Contributions

📡
Streaming protocol
Step-level forwarding replaces waiting for full responses — lower latency and higher accuracy.
📐
Three closed-form theorems
Effectiveness ordering, speedup upper bound, and cost ratio for Stream / Serial / Single.
🚀
Step-level scaling law
A new orthogonal dimension: more steps per agent → better accuracy + higher speedup.

Stream vs Serial — see the pipeline in action

Four agents on real HMMT 2026 runs (GPT-5.4-medium).

0%

Random sample speedup: Graph 1.92× · Chain 1.84× · Tree 1.82×.

Abstract

Multi-agent reasoning systems adopt a generate-then-transfer paradigm that forces end-to-end latency to scale linearly with pipeline depth. We introduce StreamMA, a multi-agent reasoning system that streams each reasoning step to downstream agents as soon as it is generated, pipelining adjacent agents and thus reducing latency.

Surprisingly, this pipelining also improves effectiveness: because multi-step reasoning quality is non-uniform and early steps are more reliable than later ones, working with these reliable early steps instead of the full chain prevents error-prone late steps from misleading downstream agents. We formalise both advantages with the first closed-form joint analysis of stream, serial, and single protocols, deriving the effectiveness ordering, speedup upper bound, and cost ratio.

Across eight reasoning benchmarks spanning mathematics, science, and code, two frontier LLMs (Claude Opus 4.6 and GPT-5.4), and three topologies (Chain, Tree, Graph), StreamMA outperforms both baselines (avg. +7.3 pp, max +22.4 pp on HMMT 2026 with Claude Opus 4.6-high).

We further uncover a step-level scaling law: increasing per-agent steps consistently improves both effectiveness and efficiency, a new scaling dimension orthogonal to and composable with agent-count scaling.

The counter-intuitive finding

When context arrives matters more than how much context arrives.

Serial — sees the whole upstream output

Agent² waits until Agent¹ finishes all steps, then reads the full chain — including the error-prone tail — and inherits its mistakes.

Agent¹ all steps at once Agent² ⏳ blocked reads all → ✗
Agent² answer: ✗ wrong

Stream — sees the reliable head first

Agent² starts reasoning after step 1; by the time the bad tail arrives, it has formed its own trajectory and the tail's impact is diluted.

Agent¹ step 1 forwarded Agent² all 4 steps received, but early momentum means errors can't derail the trajectory
Agent² answer: ✓ correct

Three closed-form theorems

The first joint analysis of Stream, Serial and Single protocols.
One ordering for effectiveness, one upper bound for speed, one ratio for cost.

Six regimes; predicts when Stream / Serial / Single wins.

Depending on how $\bar{p}$, $p_{\mathrm{head}}$, $p_{\mathrm{tail}}$ compare to $p^*$, the sCorr ordering falls into six cases:

p*
I.a
p*
I.b
p*
II.a
p*
II.b
p*
III.a
p*
III.b
I.a — Stream advantage (error accumulation). $\mathrm{sCorr}^{\mathrm{stream}} > \mathrm{sCorr}^{\mathrm{serial}} > \mathrm{sCorr}^{\mathrm{single}}$

Closed-form speedup upper bound; 26.9× measured at A=S=64 (83% of the theoretical limit).

$\displaystyle \mathrm{Speedup} \le \frac{A\bigl[(S+r_{po})\,r_{v_{dp}} + S\bigr]}{(S+A-1)(1 + \alpha\,r_{v_{dp}} + \beta\,r_{v_{dc}})}$

When decode ≫ prefill ($r_{v_{dp}}\!\to\!0$), reduces to $\mathrm{Speedup} \le \tfrac{AS}{S+A-1}$
S=64, A=64 → 32.3× theoretical, 26.9× measured (83%).

Exact cost formula; Stream saves ≈7.5% at ρ=1.

$\displaystyle \frac{C^{\mathrm{stream}}}{C^{\mathrm{serial}}} = \rho \cdot \frac{r_{c_{pd}}\,(\alpha + r_{c_{cp}}\,\beta) + 1}{r_{c_{pd}}\,(1 + r_{po}/S) + 1}$

Claude Opus 4.6 pricing: $5 / $25 / $0.5 per MTok (input / output / cache), A=S=4.
Bound = 0.925ρ with full KV-cache — saves ≈7.5% even at ρ=1.

Results · Part 1

Effectiveness — Stream is more accurate

Step-level perturbation — head/tail asymmetry

Same upstream output, two failure modes: corrupting the tail leaves Stream untouched; corrupting the head trips it.
The asymmetry is exactly what Theorem 1 predicts.

Tail-perturbed → Stream up to +24.0 pp Head-perturbed → Stream down to −36.0 pp

Main results — eight benchmarks, two LLMs, three topologies

ModelTopoMethod AIME25AIME26HMMT26 GPQA-DHLE LCB-GLCB-ELCB-T Avg.
Claude Opus 4.6 (high) Single 67.5060.0048.11 83.6718.60 90.2577.9484.31 66.30
ChainSerial 80.4272.0863.26 85.8623.90 91.3378.6492.38 73.48
StreamMA 92.5089.5885.61 87.3726.97 91.5084.4195.63 81.70
TreeSerial 86.2586.2575.00 85.1824.82 91.9288.4597.59 79.43
StreamMA 93.3487.9282.20 85.8625.07 94.0094.5799.55 82.81
GraphSerial 77.9271.6761.75 85.6922.17 90.0875.7898.27 72.92
StreamMA 95.8387.9282.58 86.5327.68 92.1795.2798.72 83.34
GPT-5.4 (medium) Single 55.8371.2540.53 77.9512.08 91.0892.4896.68 67.24
ChainSerial 60.0070.4254.55 75.0814.66 90.0897.4399.02 70.16
StreamMA 61.2572.5059.10 80.3014.94 91.1799.3099.47 72.25
TreeSerial 59.1775.8356.07 76.7714.83 88.3393.8199.25 70.51
StreamMA 62.0875.8358.34 78.1215.74 89.5094.7899.17 71.70
GraphSerial 60.0074.1752.65 78.4514.04 92.2599.5198.80 71.13
StreamMA 62.5075.4256.44 79.6316.13 93.0899.7999.32 72.32

StreamMA rows shaded; numbers reproduce Tab. 1 of the paper.

Results · Part 2

Efficiency — Stream is faster & cheaper

Step-level scaling law — a new orthogonal dimension

At fixed agent count $A$, simply asking each agent to think in more finer steps $S$ improves both speed and accuracy.
Fully composable with agent-count scaling.

Speedup vs steps S (log-log)

Accuracy heatmap on HMMT 2026

GPT-5.4-medium · A=64, S=auto baseline 68.2% → S=64 lifts to 73.5% with 26.9× speedup.

Cost–accuracy Pareto — Stream strictly dominates

Three-agent chain on HMMT 2026 with majority voting over $N\!\in\!\{1,4,16\}$ replicas. Claude Opus 4.6 pricing.

Stream×4 ($2.75, 90.9%) beats Serial×16 ($5.46, 89.4%) — half the cost, higher accuracy. With KV-cache hits, the same 90.9% drops to $1.61.

BibTeX

@article{yang2026streamma,
  title={Streaming Communication in Multi-Agent Reasoning},
  author={Yang, Zhen and Xu, Xiaogang and Wang, Wen and Chen, Cong and Xu, Xander and Chen, Ying-Cong},
  journal={arXiv preprint arXiv:2606.05158},
  year={2026}
}