Chain of Agents: Large Language Models Collaborating on Long-Context Tasks

1Penn State University, 2Google Cloud AI Research
*Last Authors

Chain-of-Agents is a training-free, task-agnostic, highly interpretable framework for Long Context.

Abstract

Addressing the challenge of effectively processing long contexts has become a critical issue for Large Language Models (LLMs). Two common strategies have emerged: 1) reducing the input length, such as retrieving relevant chunks by Retrieval-Augmented Generation (RAG), and 2) expanding the context window limit of LLMs. However, both strategies have drawbacks: input reduction has no guarantee of covering the part with needed information, while window extension struggles with focusing on the pertinent information for solving the task. To mitigate these limitations, we propose Chain-of-Agents (CoA), a novel framework that harnesses multi-agent collaboration through natural language to enable information aggregation and context reasoning across various LLMs over long-context tasks. CoA consists of multiple worker agents who sequentially communicate to handle different segmented portions of the text, followed by a manager agent who synthesizes these contributions into a coherent final output. CoA processes the entire input by interleaving reading and reasoning, and it mitigates long context focus issues by assigning each agent a short context. We perform comprehensive evaluation of CoA on a wide range of long-context tasks in question answering, summarization, and code completion, demonstrating significant improvements by up to 10% over strong baselines of RAG, Full-Context, and multi-agent LLMs.

Overall Structure of Chain-of-Agent

Overall Structure of Chain-of-Agents.

It consists of multiple worker agents who sequentially communicate to handle different segmented portions of the text, followed by a manager agent who synthesizes these contributions into a coherent final output.

Comparison with RAG Model

Overall results of CoA. CoA significantly outperforms Vanilla and RAG using various backbone LLMs on all datasets. CoA can also be applied in non-query tasks.



Main results.

CoA Improvement is More Obvious When RAG Fails to Retrieve Gold Answer

Comparison on NarrativeQA. X-axis/Y-axis indicate RAG/CoA performance while each point represents a bin. The number indicates the chunk index of gold answer (ratio of number of samples in bracket), and the size of the point indicates the improvement of CoA over RAG. Each point indicates a different retrieval quality (the number is recall @ n, lower is better).

Compare with RAG.

Multi-agent Collaboration in CoA Enables Complex Reasoning over Long Context

The figure displays a sample prediction from HotpotQA. To find the correct answer, RAG retrieves text chunks with high semantic similarity with the query. However, conducting multi-hop reasoning is challenging as the critical first-hop answer often lacks semantic relevance to the query. In contrast, CoA operates differently: the first agent explores related topics without knowing the query’s answer, aiding subsequent inference. The second agent, also unaware of the answer, broadens the topic scope by incorporating new information. The third agent finally discovers the answer, synthesizing information from earlier agents and new data to complete the reasoning chain. This collaborative approach highlights CoA’s ability to facilitate complex reasoning across long context tasks.

A case study.

Comparison with Long LLMs

Comparison with long context LLMs on NarrativeQA and BookSum. CoA significantly outperforms Claude 3 with 200k context limits. No Trun./Trun. indicates the source text in the sample is less/more than 200k tokens which does not need/needs truncation for vanilla (200k) baseline. Average is the mean value across all samples.



Long LLM results.

CoA Improvement is More Obvious When Long Context Models Meet Longer Inputs

As shown in the figure, CoA can outperform the vanilla baseline by a large margin on various source lengths.

Long LLM restuls.

CoA Mitigates “Lost-in-the-Middle” Phenomenon

To assess the “lost-in-the-middle” effect on Vanilla and CoA models, we replicated the original study by randomly selecting 500 samples from their Natural Question dataset to create a QA dataset. The figure shows the performance of CoA and Full on Natural Questions. CoA mitigates the lost-in-the-middle issue. X-axis is the index of document with gold answer where small number indicates gold answer is closer to start.

CoA mitigates Lost in the middle.

Other Results and Analysis

BibTeX

@article{zhang2024chain,
  author    = {Zhang, Yusen and Sun, Ruoxi and Chen, Yanfei and Pfister, Tomas and Zhang, Rui and Arık, Sercan Ö.},
  title     = {Chain of Agents: Large Language Models Collaborating on Long-Context Tasks},
  journal   = {arXiv:2406.02818},
  year      = {2024},
}

Acknowledgment

We thank Jinsung Yoon and other colleagues in cloud AI Research team for providing helpful feedback for this paper.