Agent Vector Protocol: A Binary Protocol for Latent LLM Agent Communication
Multi-agent LLM pipelines have a serialization bottleneck. On a typical math problem, Agent A spends 15.6 seconds generating 387 tokens of chain-of-thought reasoning (Qwen 7B, A100). All that computation, the KV-cache built across every layer, gets discarded. Agent B receives the text, tokenizes it, and builds its own KV-cache from scratch. The 15.6 seconds of autoregressive decode is the cost; the 0.2 seconds of re-encoding is not the problem. The problem is that Agent A had to decode at all.
Agent Vector Protocol (AVP) eliminates the text serialization. Instead of decoding to tokens and re-encoding, AVP transfers the KV-cache directly (same-model) or a projected hidden state (cross-model) between agents. It’s transport-agnostic and works alongside orchestration frameworks like LangGraph, CrewAI, A2A, and MCP. Built on LatentMAS, extended with cross-model vocabulary-mediated projection. Open source: GitHub, PyPI.
What happens to the pipeline
End-to-end breakdown on a GSM8K math problem, Qwen 7B, A100:
TEXT PIPELINE LATENT PIPELINE
------------------------------ ------------------------------
1. Agent A: decode 387 tokens 15.6s 1. Agent A: 20 latent steps 0.9s
2. Agent B: prefill 387 ctx 0.2s 2. Agent B: prefill 0 ctx 0.0s
3. Agent B: prefill own prompt 0.2s 3. Agent B: prefill own prompt 0.2s
4. Agent B: decode answer 5.5s 4. Agent B: decode answer 5.5s
------------------------------ ------------------------------
Total ~21.5s Total ~6.6s
Pipeline speedup: ~3.3x
Agent A’s contribution drops from 15.8s to 0.9s (17x). But Agent B still decodes its own answer (~5.5s), which is irreducible in both modes. Amdahl’s Law gives you 2-3x on the pipeline, not 17x. The more verbose Agent A is, the bigger the win.
Benchmark results
All benchmarks: latent_steps=20, seed=42, temperature=0.7, NVIDIA A100. Full reproduction instructions in the benchmark docs.
Same-model (Qwen 2.5 7B Instruct)
| Benchmark | n | Direct | Latent (AVP) | Text Chain |
|---|---|---|---|---|
| HumanEval | 164 | 58.5% | 67.1% | 53.0% |
| GSM8K | 200 | 91.0% | 90.5% | 87.0% |
| DebugBench | 100 | 50.0% | 51.0% | 49.0% |
| MATH | 500 | 67.8% | 66.8% | 66.6% |
| HotpotQA | 200 | 51.5% | 52.5% | 50.5% |
Direct = single model, no pipeline. Latent (AVP) = Agent A runs latent steps (forward passes without text output), Agent B generates from that computation. Text Chain = Agent A generates text, Agent B reads it.
HumanEval latent hits 67.1% vs 53.0% for text chain. McNemar’s test on per-problem results: p=0.004 (n=164, single seed). The result holds on Llama 3B (54.3% vs 44.5%, single seed) and across 4 additional seeds on Qwen 7B at T=0.01: 70.0% +/- 0.3% (latent) vs 57.6% +/- 0.3% (text).
The likely mechanism: text-based chain-of-thought reviewers actively degrade correct code by second-guessing it. The text chain (53.0%) scores below the single-agent baseline (58.5%). Latent transfer avoids this because the solver receives computation, not editable text. One open question: latent-primed solutions are longer on average (109 vs 71 tokens), so the gain may partly come from output length rather than reasoning quality.
Everything else is neutral. GSM8K, MATH, DebugBench, HotpotQA show no statistically significant difference between latent and text (GSM8K p=0.121, all others p>0.8). The value on those benchmarks is efficiency: same accuracy, 2-3x faster pipeline (DebugBench: 7.6s vs 22.8s per problem, HotpotQA: 5.8x).
The efficiency gap widens with more agents. In a text pipeline, each agent reads all previous agents’ output, so token cost grows O(n^2). Latent transfer stays O(n) because each agent receives a fixed-size payload (KV-cache or projected hidden state), not an accumulating text history. A 4-agent GSM8K chain saves 73-78% of tokens.
Cross-model (experimental, ~6 KB wire)
Cross-model transfer uses vocabulary-mediated projection, referred to as “Rosetta” below. A single projected hidden state (~6 KB) replaces the full text handoff.
| Source -> Target | GSM8K (Rosetta / Text) | HumanEval (Rosetta / Text) |
|---|---|---|
| Qwen 7B -> Llama 3B | 77.0% / 86.5% | 47.0% / 57.9% |
| Llama 3B -> Qwen 7B | 90.0% / 82.0% | 79.3% / 61.6% |
Direction matters. When the stronger model is the solver (Llama 3B thinks, Qwen 7B solves), rosetta beats text on both benchmarks. When the weaker model solves, text wins on math. Single seed, preliminary results. Cross-model accuracy is bounded by the target model’s capability.
How it works
Same-model KV-cache transfer
Agent A “thinks”: it runs latent forward passes where each step’s output feeds back as input, building KV-cache entries. No autoregressive text decode happens. Agent B receives that KV-cache and generates normally.
from avp import HuggingFaceConnector
connector = HuggingFaceConnector.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
# Agent A thinks (builds KV-cache, no text output)
context = connector.think("Analyze this math problem: 24 * 17 + 3", steps=20)
# Agent B generates using Agent A's KV-cache
answer = connector.generate("Solve step by step: 24 * 17 + 3", context=context)
Think and generate support different prompts. Agent A can be a “researcher” and Agent B a “solver.” Same KV-cache format, zero conversion.
Cross-model via vocabulary-mediated projection
Qwen and Llama share about 85% of their BPE vocabulary. Tokens like return, function, +=, and most English words have identical byte-level representations in both tokenizers. AVP exploits this: project the source model’s last hidden state through its lm_head to get a softmax probability distribution over tokens, restrict to the shared vocabulary, renormalize, then take a weighted sum of the target model’s input embeddings. Zero learned parameters. The projection code is about 100 lines.
researcher = HuggingFaceConnector.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
solver = HuggingFaceConnector.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
context = researcher.think("Analyze: 24 * 17 + 3", steps=20)
answer = solver.generate(
"Solve: 24 * 17 + 3",
context=context, source=researcher, cross_model=True
)
The wire payload is a single projected hidden state vector, about 6 KB. Cross-model is experimental and opt-in (cross_model=True).
Where AVP fits in the LLM inference stack
AVP is not an agent framework or orchestration layer. It replaces the text serialization between agents:
WITHOUT AVP WITH AVP
Orchestration Orchestration
│ │
▼ ▼
Agent A Agent A
│ │
│ text (387 tokens, 15.6s) │ computation (0.9s)
▼ ▼
Agent B Agent B
│ │
▼ ▼
Inference Engine Inference Engine
Transport-agnostic. The spec defines the binary format, handshake, and codec, not the transport. The reference implementation uses HTTP/2, but AVP messages can ride on A2A, MCP, gRPC, or anything that carries binary payloads.
Your framework never touches the KV-cache. It passes text prompts in, gets text answers back. AVP works as a sidecar; the latent transfer is invisible to the orchestration layer. See the framework integration guide for LangGraph/CrewAI examples.
How this differs from prefix caching
Prefix caching (vLLM, SGLang) reuses KV-cache for identical prompt prefixes across requests to the same model instance. AVP transfers computation (KV-cache for same-model, projected hidden states for cross-model) between different agents with different prompts. The 20 latent thinking steps have no text equivalent, so there is nothing to prefix-cache. In production you’d want both: prefix caching for shared system prompts within each agent, AVP for the inter-agent handoff.
The handshake auto-negotiates three modes: same model gets lossless KV-cache, different models with compatible tokenizers get vocabulary-mediated projection (~6 KB), and incompatible models fall back to JSON.
Where it breaks and what it costs
Cross-model comprehension fails. HotpotQA cross-model gets 7.5% exact match (15/200). One hidden state vector can carry the “gist” of structured math or code reasoning, but not paragraph-level facts. We tested multi-embedding rollout, discrete token transfer, and multi-position injection. None beat single-vector. The bottleneck is inputs_embeds injection itself: models trained on discrete token embeddings don’t treat continuous injected vectors the same way for factual recall.
Latent steps have a ceiling. 10 and 20 produce equivalent accuracy. 40 degrades. 80 produces degenerate output. Noise accumulates in the recurrent loop.
Solver ceiling. Cross-model accuracy is bounded by the target model’s own capability. A 7B researcher doesn’t make a 3B solver smarter.
Self-hosted only. AVP requires direct KV-cache access. Not compatible with cloud APIs (OpenAI, Anthropic, Google), llama.cpp, or Ollama. Currently requires HuggingFace Transformers with GPU access. vLLM latent support is in progress.
Not production-hardened. The SDK has 476 tests and is on PyPI, but the full latent path runs on HuggingFace Transformers (no batching, no production serving). The path to production is through vLLM engine integration, which is under active development.
Cost: 20 latent steps = 0.9s fixed cost (7B model, A100, single request). Each step takes about 45ms, comparable to generating one text token. KV-cache is ~390 MB for a 7B model, less than 0.5% of an A100’s memory. Breakeven vs text: if Agent A would normally generate 22 or more tokens, latent is faster.
Prior work
LatentMAS (Guo et al., 2025) demonstrated same-model latent communication via hidden state transfer between agents. AVP packages this as a protocol (binary format, handshake, codec, session management) and adds cross-model vocabulary-mediated projection, which is new in the context of zero-training cross-model latent transfer.
Try it
pip install avp
A Colab notebook runs on a free T4 in ~8 minutes using Qwen 1.5B. At 1.5B scale the accuracy gains are minimal; the notebook demonstrates the mechanism, not the full results. For benchmark reproduction, use an A100.
- AVP Python SDK on GitHub – SDK, benchmarks, 476 tests
- Protocol spec – binary format, handshake, transport
- Full benchmark data with reproduction instructions
- Framework integration guide – LangGraph, CrewAI, custom
Apache 2.0. Contributions welcome – open an issue if something breaks on your workload.