(Translated by https://www.hiragana.jp/)
CodeR: Issue Resolving with Multi-Agent and Task Graphs

CodeR: Issue Resolving with Multi-Agent and
Task Graphs

Dong Chen1   Shaoxin Lin1*  Muhan Zeng1*  Daoguang Zan2*
Jian-Gang Wang1  Anton Cheshkov1  Jun Sun3  Hao Yu4  Guoliang Dong3  Artem Aliev1
Jie Wang1  Xiao Cheng1  Guangtai Liang1  Yuchi Ma1  Pan Bian1  Tao Xie4  Qianxiang Wang1
1Huawei Co., Ltd. 2Chinese Academy of Science 3Singapore Management University 4Peking University
Equal contribution
Abstract

GitHub issue resolving recently has attracted significant attention from academia and industry. SWE-bench [1] is proposed to measure the performance in resolving issues. In this work, we propose CodeR, which adopts a multi-agent framework and pre-defined task graphs to Repair & Resolve reported bugs and add new features within code Repository. On SWE-bench lite, CodeR is able to solve 28.33%percent28.3328.33\%28.33 % of issues, when submitting only once for each issue. We examine the performance impact of each design of CodeR and offer insights to advance this research direction111https://github.com/NL2Code/CodeR.

1 Introduction

The rapidly growing capability of Large Language Models (LLMs) is dramatically reshaping many industries [2, 3, 4]. The most recent release of GPT-4o [5] demonstrates a significant leap in multi-modal capabilities and artificial intelligence (AI)-human interaction, whilst maintaining the same level of text generation, reasoning, and code intelligence as GPT-4-Turbo [6]. LLMs can interact with humans and the world as humans do, it is considered a starting point for LLMs to take over tasks from humans or collaborate naturally with humans.

Issue resolving is one of the software engineering tasks experimented with LLMs that is particularly relevant in practice. SWE-bench [1] collects 2,29422942,\!2942 , 294 real-world issues from 12121212 popular Python libraries. The LLMs are tasked to resolve the issues based on the given issue description and the whole repository. This task is extremely challenging due to the need for deep reasoning about a huge amount of code and incomplete information for the task description. SWE-bench-lite [1] removes the issue with low-quality descriptions to make the task more addressable, and yet it remains highly non-trivial.

Since SWE-bench was released, multiple approaches have been proposed. SWE-Llama [1] adopt a pipeline with Retrieval-Augmented Generation (RAG) to generate the patch directly. Later, AutoCodeRover [7] added code contextual retrieval with keywords in the issue description into the pipeline. It iteratively collects code context by the keywords in the issues until LLMs have collected enough information to generate a correct patch. Instead of explicitly patch generation, SWE-agent [8] performs iterative edits in the repository. It then uses the “git diff” command to generate patches which avoids patch format errors.

In the literature on applying LLMs for solving software engineering tasks, multiple agent-based approaches have shown their competitiveness. For instance, MetaGPT [9] uses the multi-agent approach to automate the software development process from scratch. AutoCodeRover [7] and SWE-agent [8] use the single-agent approach to address automatic GitHub issue resolving.

To the best of our knowledge, in issue resolving scenarios, the agent-based approaches primarily focus on a single agent. Moreover, previous works perform task decomposition on-the-go, with each subsequent step being determined by the preceding one. Multi-agent possesses the advantage of better decoupling each role and leveraging contextual information. However, implementing a multi-agent framework in issue resolving presents challenges such as: (1) Free communications between agents may lead to a non-progressing loop without termination [10]. (2) Information passed from one agent to another may incur information loss [11]. (3) Complex plans are hard to follow when multiple agents are involved. We remark that these problems are not unlike those when human developers collaborate. In this work, we develop a multi-agent design called CodeR that effectively addresses the above mentioned problems.

CodeR adopts a multi-agent framework and a task graph data structure for issue resolving tasks. Our design is based on the following intuitions:

  • Less candidate actions, easier decision. We introduce a set of diverse actions for different purposes. The number of actions is much larger compared with the single-agent framework such as SWE-agent. To address the problem of the large number of actions, we reduce the complexity of making decisions for the next action by limiting each agent’s focus to a subtask and a subset of associated actions.

  • Look before you leap. We believe that planning at the beginning of the pipeline is better than deciding the next steps on-the-go. Moreover, a good plan should consist of small and manageable tasks that LLMs were trained to solve.

  • Bypassing instruction-following and memorization. The conventional plan generated by LLM is in the form of plain text. It is usually placed in the prompt to guide the subsequent steps in a LLM-centered system. It requires the LLM to have a strong instruction-following ability and to have a “good” memory to execute the plan precisely and iteratively. For complex tasks, like issue resolving with complex tools, task plans in pure-text prompts will be hard to follow. Therefore, we introduce a new data structure namely task graph that can ensure that all pre-designed plans are accurately followed and executed.

Our contributions are as follows:

  1. 1.

    We propose CodeR, a multi-agent framework with task graphs for issue resolving. Inspired by the issue resolving process by humans in the real world, we design the roles and the actions. For plans, we design a graph data structure that can be parsed and strictly executed. It can ensure the exact execution of the plan and at the same time provide an easy-to-plug interface for plan injection from humans.

  2. 2.

    We leverage LLM-generated code for reproducing the issue and the tests in the repository (excluding the verification tests) to get code coverage information. Coverage information improves contextual retrieval based on the keywords in the issue text and does fault localization together with BM25.

  3. 3.

    We renew the state-of-the-art of SWE-bench lite to 28.33%percent28.3328.33\%28.33 % (85858585/300300300300) with only one submission per issue.

2 Framework

As Figure  1 shows, our design contains five agents, which can collaboratively solve GitHub issues:

  • Manager: The manager is an agent who interacts with the user directly and is in charge of the whole issue-resolving task. It has two responsibilities: (1) selecting a plan according to the issue description. The plan specifies the agents evolved and how they should interact to finish the task. (2) interpreting the execution summary of a plan. If the execution summary has indicated that the issue has been solved, it will summarize the changes and submit a patch; if not, it will come up with a new plan or give up.

  • Reproducer: The reproducer is an agent that is responsible for generating a test to reproduce the issue. If the issue description contains a complete test, the reproducer only needs to copy the test into a new test file “reproduce.py”, and execute and compare the output. But this is usually not the case for real-world issues, the reproducer often needs to adjust or generate test cases. We generate test cases by extracting test inputs from issues and using LLMs to generate test sequences.

  • Fault Localizer: The fault localizer is an agent that identifies the code regions that could cause the issue. It is equipped with several fault localization tools in software engineering.

  • Editor: The editor is the one who performs the actual code changes. It will utilize all information provided by other upstream agents and will gather contextual information with AutoCodeRover’s search [7]. With enough information gathered, the iterative edits same as SWE-agent will be performed [8].

  • Verifier: The verifier is an agent that will run the reproduced or integration tests222Integration tests refer to those built-in unit tests in the repository rather than official issue tests of SWE-bench lite. to check whether the modifications have resolved the issue or not.

Refer to caption
Figure 1: Multi-Agent framework of CodeR with task graphs.

For actions, we reuse the actions that are defined by SWE-agent and AutoCodeRover as Table 1 shows. Besides, we also introduce new actions 00 and 18181818-21212121. Action 00 selects or generates feasible plans by analyzing the current issue. Action 18181818 retrieves the top-1 similar issue and its corresponding patch by description. Note that we prompt the agent to check whether the retrieved result is relevant to the current issue and analyze how its patch solves the retrieved issue. Action 19191919 performs fault localization described in Section 3.2. Action 20202020 runs the reproducer-generated test and the integration tests. Same as Aider, the integration tests do not contain the tests to verify the correctness of the generated patches [12]. Action 21212121 summarizes all actions performed and observations by each agent for a sub-task. Action 22222222 provides basic Linux shell commands such as “cd”, “ls”, “grep”, and “cat”.

We assign a unique set of actions to each role, similar to how different roles in the real world possess distinct skills. For example, only the Manager has the permission to the “plan” and “submit” actions; All roles are granted permission to use the “basic shell commands” action.

Table 1: Actions selected and designed for each agent. 1-10 are from SWE-agent and 11-17 are from AutoCoderRover. * indicates that actions 11-17 are the enhancement versions of AutoCodeRover’s original actions described in Section 3.2.
Actions Agent Roles
\cdashline2-6 Manager Reproducer Fault Localizer Editor Verifier
0 plan square-root\surd
\cdashline1-1 1 open square-root\surd square-root\surd
2 goto square-root\surd square-root\surd
3 scroll down square-root\surd square-root\surd
4 scroll up square-root\surd square-root\surd
5 create square-root\surd square-root\surd
6 edit square-root\surd square-root\surd square-root\surd square-root\surd
7 submit square-root\surd
8 search dir square-root\surd square-root\surd square-root\surd
9 search file square-root\surd square-root\surd square-root\surd
10 find file square-root\surd square-root\surd square-root\surd
\cdashline1-1 11 rover search file square-root\surd square-root\surd square-root\surd
12 rover search class square-root\surd square-root\surd square-root\surd
13 rover search class in file square-root\surd square-root\surd square-root\surd
14 rover search method square-root\surd square-root\surd square-root\surd
15 rover search method in file square-root\surd square-root\surd square-root\surd
16 rover search code square-root\surd square-root\surd square-root\surd
17 rover search code in file square-root\surd square-root\surd square-root\surd
\cdashline1-1 18 related issue retrieval square-root\surd square-root\surd
19 fault localization square-root\surd
20 test square-root\surd
21 report square-root\surd square-root\surd square-root\surd square-root\surd
\cdashline1-1 22 basic shell command square-root\surd square-root\surd square-root\surd square-root\surd square-root\surd

3 Methodology

Repository-level tasks usually require processing a huge amount of information and taking many steps before reaching their desired solutions. Existing works show that dividing a repository-level task into a set of connected sub-tasks and conquering them one by one could be effective. Parsel [13] and CodeS [14] focus on generating a large piece of code for complex algorithms and simple repositories. Both of them utilize inherent program structures like call graphs or file structures for task decomposition. Issue resolving is also a repository-level task but is closer to a modification task rather than a generation task. In addition to generating code, a repository-level modification task requires identifying the correct locations before generating the correct code. It is unfeasible to use the whole repository as input context. This introduces additional steps and complexity which requires a more powerful framework for planning.

3.1 Task Graphs for Planning

The description of GitHub issues is extremely diverse. Some issues only have one sentence in natural language (e.g. astropy__astropy-7008333https://github.com/astropy/astropy/pull/7008). Some may provide the test code, running results of the test code, and a possible solution (sympy__sympy-14774444https://github.com/sympy/sympy/pull/14774). Besides descriptions, the solutions of issues are also varied. Some could only require changing one or two lines to resolve, making the task similar to a line completion task with context (scikit-learn__scikit-learn-13779555https://github.com/scikit-learn/scikit-learn/pull/13779) while some could necessitate changing multiple files, requiring a deep understanding of the code semantics within the repository.

For simple issues with clear descriptions, their solutions are obvious and can be figured out at first glance. But for complex ones with ambiguous or inaccurate descriptions, executing tests and searching through the code base or web could be beneficial for solving them. To cope with different approaches to solving an issue, we design a task graph that can easily add new plans. It can also be strictly followed by multi-agent systems.

Refer to caption
Figure 2: Task graphs in JSON format.

Figure 2 shows a task graph plan in JSON format. It specifies a collection of plans in the top level with the name “Plan ID”. For each plan, “entry” specifies which agent to start with. “roles” specifies a list of agents that are involved in this plan. Each selected agent will be given a subtask specified in “task”. Once finished, all actions that the agent performed will be summarized and passed to its “downstream” according to the result of the current sub-task. Plan A in Figure 2 involves four agents: Reproducer, Fault Localizer, Editor, and Verifier. This plan starts with Reproducer as demonstrated in Figure 1.

This design of plans decouples agent design with the task decomposition. When designing the agents, one can only focus on the high-level goal of a sub-task without considering the details of the diverse approaches. The diversity of approaches can be specified and adjusted in the field of “task” and “downstream”. In this way, the plans can be easily added, deleted, and tuned without changing a single line of code for agents.

Plans in Figure 2 will be parsed into a graph with an entry node specified by “entry”. When starting to execute the plan, the entry node is activated and the specified agent will start to execute its sub-task using the ReAct framework [15] iteratively. Once finished with its subtask, it will activate one of its specified “downstream” nodes. Agents in the plan may be activated multiple times if there is a cycle in the plan. The plan finishes when the Manager is activated or exceeds our budget.

We have designed four plans as Figure 3 shows. Plan A is shown in Figure 1, which is a standard flow to resolve an issue. It has no loop for simplicity and robustness. Plan B tries to resolve the issue directly for simple issues. Plan C adds a loop that allows the feedback from testing. This circle is also used by Aider [12] with tests that are not related to the issue (which is also called “integration tests”). Plan D takes a test-driven approach with a ground truth test for issues (such as “fail-to-pass” and “pass-to-pass” tests in SWE-bench). In our experiments, we use only Plan A and B for cost savings and fast evaluation.

Refer to caption
Figure 3: Plans in the form of structured graphs. They will be parsed into a graph when executed. The green and red arrows represent the reports passed to the next agent in cases of Success and Failure, respectively. The black arrows indicate the reports are passed to the next agent regardless of success or failure.

3.2 Fault Localization Specialized for Issue Resolving

We leverage fault localization techniques [16] to provide precise location information. A previous work [7] shows that the use of fault localization techniques leads to an increase in the efficacy of resolving GitHub issues.

We notice that the agent is allowed to run test suites but only the results are used while runtime information is not captured during the process. Test-based fault localization can provide precise location information based on runtime information and specifically, we use spectrum-based fault localization (SBFL) as the main fault localization method.

SBFL is a lightweight, test-based fault localization technique. Given a test suite that contains at least one failing test, SBFL collects statement coverage for the test suite. Suspiciousness score is then calculated based on coverage data, and all covered statements are ranked by their suspiciousness. Suspiciousness score can be calculated by different formulas such as Ochiai [17] and Tarantula [18]. These formulas share the same motivation that the fault location should possibly be covered by more failing tests and fewer passing tests.

One main limitation of SBFL and many other test-based fault localization techniques is the need for failing tests. In practice, a failing test is often not available at the time when the issue is raised. Since the Reproducer can create reproduced test cases, we select the failing tests and collect their coverage data. This coverage data is also used to guide “THE SEARCH ACTION”. Note that if Reproducer fails to generate any test script or its coverage data cannot be collected (e.g., test script uses system calls to invoke certain CLI), SBFL will not be used as no result can be produced by it.

Besides test information, issue descriptions can also be used to better localize the fault. The retrieval algorithm provides a simple yet effective way to combine text from an issue description and code from a repository. Jimenez et al. [1] also use the BM25 retrieval algorithm to provide file-level localization. As the information source from the retrieval algorithm and test-based fault localization (say test coverage and issue description text) differs a lot, we notice that these methods could be combined to provide better fault localization results. A previous study[19] shows that combining multiple fault localization methods can achieve a better result than any standalone method. We use a simple linear combination here to calculate the final suspiciousness score from both methods.

Score=λらむだScoreOchiai+(1λらむだ)ScoreBM25Score𝜆subscriptScoreOchiai1𝜆subscriptScoreBM25\textit{Score}=\lambda\cdot\textit{Score}_{\textit{Ochiai}}+(1-\lambda)\cdot% \textit{Score}_{\textit{BM25}}Score = italic_λらむだ ⋅ Score start_POSTSUBSCRIPT Ochiai end_POSTSUBSCRIPT + ( 1 - italic_λらむだ ) ⋅ Score start_POSTSUBSCRIPT BM25 end_POSTSUBSCRIPT (1)
ScoreBM25(Fi)=RelevanceBM25(Fi)FjFilesRelevanceBM25(Fj)subscriptScoreBM25subscript𝐹𝑖subscriptRelevanceBM25subscript𝐹𝑖subscriptsubscript𝐹𝑗𝐹𝑖𝑙𝑒𝑠subscriptRelevanceBM25subscript𝐹𝑗\textit{Score}_{\textit{BM25}}(F_{i})=\frac{\textit{Relevance}_{\textit{BM25}}% (F_{i})}{\sum_{F_{j}\in Files}{\textit{Relevance}_{\textit{BM25}}}(F_{j})}Score start_POSTSUBSCRIPT BM25 end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG Relevance start_POSTSUBSCRIPT BM25 end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_F italic_i italic_l italic_e italic_s end_POSTSUBSCRIPT Relevance start_POSTSUBSCRIPT BM25 end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG (2)

where ScoreOchiaisubscriptScoreOchiai{\textit{Score}}_{\textit{Ochiai}}Score start_POSTSUBSCRIPT Ochiai end_POSTSUBSCRIPT is the suspiciousness score from Ochiai formula and RelevanceBM25(Fi)subscriptRelevanceBM25subscript𝐹𝑖{\textit{Relevance}}_{\textit{BM25}}(F_{i})Relevance start_POSTSUBSCRIPT BM25 end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the BM25 relevance score for file Fisubscript𝐹𝑖F_{i}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

To choose a proper value for the combination factor λらむだ𝜆\lambdaitalic_λらむだ, we experiment on a small subset containing 10101010 issues that can be successfully reproduced. The result shows that almost all values between 0 and 1 yield the same result and all are better than taking λらむだ=1𝜆1\lambda=1italic_λらむだ = 1 or λらむだ=0𝜆0\lambda=0italic_λらむだ = 0. The reason for different λらむだ𝜆\lambdaitalic_λらむだs having the same result is that many locations can tie to the others with respect to a single metric. Statements that are covered by the same number of passing tests will have the same ScoreOchiaisubscriptScoreOchiai\textit{Score}_{\textit{Ochiai}}Score start_POSTSUBSCRIPT Ochiai end_POSTSUBSCRIPT and statements in the same file will have the same ScoreBM25subscriptScoreBM25\textit{Score}_{\textit{BM25}}Score start_POSTSUBSCRIPT BM25 end_POSTSUBSCRIPT. Both metrics could serve as a tiebreaker to each other, resulting in a better result than each standalone metric. We pick λらむだ=0.99𝜆0.99\lambda=0.99italic_λらむだ = 0.99 as our final setup in subsection 4.2.

We conducted an experiment on the issues that are:

  • Successfully reproduced by Reproducer. This means a runnable Python script is generated for reproducing the issue. 140 issues remain after this filtering.

  • Coverage data collected from the script is not empty. This means the reproduce script has at least covered one file in the project. 104 issues remain after this filtering.

The result of different λらむだ𝜆\lambdaitalic_λらむだs are listed in Table 2 and Table 3:

Table 2: Top-k precision for function-level fault localization. λらむだ=1𝜆1\lambda=1italic_λらむだ = 1 means using SBFL only, and 0.4-0.999 means any value between them shares the same result. Golden locations of each issue are marked by authors.
λらむだ𝜆\lambdaitalic_λらむだ top-1 top-3 top-5 top-10 top-all
0 12.27% 25.92% 34.04% 42.98% 69.23%
0.001 17.46% 31.21% 38.32% 44.50% 69.23%
0.01 17.46% 32.17% 39.28% 45.46% 69.23%
0.1 18.42% 30.25% 37.84% 45.07% 69.23%
0.2 17.46% 29.29% 38.32% 45.07% 69.23%
0.3 16.49% 28.33% 36.39% 43.15% 69.23%
0.4-0.999 16.49% 28.33% 35.91% 43.15% 69.23%
1 6.63% 14.11% 18.23% 24.95% 69.23%
Table 3: File-level fault localization.
λらむだ𝜆\lambdaitalic_λらむだ top-1 top-3 top-5 top-10 top-all
0 15.32% 32.67% 42.36% 54.07% 85.58%
0.001 23.49% 38.25% 46.85% 55.59% 85.58%
0.01 23.49% 39.21% 47.81% 56.55% 85.58%
0.1 23.49% 38.25% 46.37% 56.16% 85.58%
0.2 22.53% 38.25% 46.85% 56.16% 85.58%
0.3 20.60% 37.29% 45.89% 55.20% 85.58%
0.4-0.999 20.60% 36.33% 44.44% 54.24% 85.58%
1 8.12% 16.65% 21.19% 28.92% 85.58%

From the result, we can see that combining BM25 score with SBFL can greatly improve precision by more than 10%. We use method-level fault localization as it provides enough information for the agent to edit the file while keeping good precision. The way of constructing a prompt for fault localization results is shown in the Appendix Figure 11.

3.3 Prompt Engineering

CodeR includes five roles: manager, reproducer, fault localizer, editor, and verifier. To enable LLMs to play different roles, we set up system prompts and instance prompts for each agent role. The system prompt primarily describes the definition of role identity, role responsibilities, and corresponding actions. The instance prompt mainly includes the raw issue and important tips for resolving this issue. We have put system and instance prompts of five roles into Appendix Figure 4~13. We design these prompts inspired by SWE-agent [8]. When multiple agent roles communicate, they use the prompt template shown in Appendix Figure 14. Detailed prompt engineering designs for CodeR can be found at https://github.com/NL2Code/CodeR.

4 Experiments

4.1 Experimental Setup

Benchmarks

SWE-bench [1] is a benchmark that can test systems’ ability to solve GitHub issues automatically. The benchmark consists of 2,29422942,\!2942 , 294 Issue-Pull Request (PR) pairs from 12121212 popular open-source Python repositories (e.g., flask, numpy, and matplotlib). SWE-bench’s evaluation can be executed by providing unit test verification using post-PR behavior as the reference solution. SWE-bench lite [1] is a subset of SWE-bench, which is curated to make evaluation less costly and more accessible. SWE-bench lite comprises 300300300300 instances that have been sampled to be more self-contained, with a focus on evaluating functional bug fixes. More details of SWE-bench lite can be seen at https://www.swebench.com/lite.html. In this work, we focus on SWE-bench lite for faster, easier, and more cost-effective evaluation.

Metrics

We evaluate the issue resolving task using the following metrics: Resolved (%), Average Request, and Average Tokens/Cost. The Resolved (%) metric indicates the percentage of SWE-bench lite instances (300300300300 in total) that are successfully resolved. Average Requests and Average Tokens/Cost represent the average number of API requests per issue, the average consumption of input&output tokens, and the corresponding cost.

CodeR’s Comparative Methods

Recently, several commercial products addressing issue resolving have been released, but their technical details have not been disclosed. The following describes their functionalities.

  • Devin666https://www.cognition.ai/blog/introducing-devin, from cognition.ai, is capable of planning and executing complex engineering tasks that require thousands of decisions. It can recall relevant context at every step, learn over time, and fix program bugs. Devin can operate common developer tools within a sandbox environment, including the shell, code editor, and browser. Additionally, Devin can actively collaborate with users, report progress in real-time, accept feedback, and assist with design choices as needed.

  • Amazon Q Developer Agent777https://aws.amazon.com/cn/q/developer, from Amazon, is a generative AI-powered coding assistant that can help you understand, build, extend, operate, and repair code.

  • OpenCSG StarShip888https://opencsg.com/product is committed to providing a complete model/data management and application-building platform for large model application development teams. Based on it, they developed CodeGenAgent which can resolve GitHub issues automatically.

  • Bytedance MarsCode Agent999https://www.marscode.com is an AI coding assistant powered by GPT-4o, developed by ByteDance. Designed for multi-language support within IDE environments, it can reset repositories to undo previous modifications.

SWE-bench lite requires generating patches to resolve GitHub issues. One possible approach for LLMs is to generate the patch directly(explicit patch generation).

  • Retrieval-Based Approach [1] first retrieves the files that require editing and then adds the retrieved content to LLMs’ context. Finally, the LLMs generate the patch. In the experiments, LLMs used include GPT-3.5, GPT-4, Claude 2, Claude 3 Opus, and SWE-Llama [1].

  • AutoCodeRover [7] leverages advanced code search capabilities in software engineering to extend the model’s modeling context, thereby further improving the accuracy of patch generation.

Besides using LLMs to generate the patch directly to fix issues, another approach is to edit and modify the buggy code repository and then use “git diff” to automatically obtain the patch (implicit patch generation).

  • SWE-agent [8] is an automated software engineering system that utilizes LLMs as one agent to solve real-world software engineering tasks. It introduces a new concept of the agent-computer interface (ACI), which enables LLMs to effectively search, navigate, edit, and execute code commands in sandboxed computer environments.

  • Aider101010https://aider.chat is a command line tool that pairs with LLMs to edit code in your local git repository. Aider can directly edit the local source files and commit the changes with meaningful commit messages. Aider now works well with GPT-3.5, GPT-4o, Claude 3 Opus, and more.

4.1.1 Implementation Details

Hyper-Parameters of Inference

In our multi-agent framework, each role is considered a distinct agent with its own experimental settings, which include the model and history process window size. All roles are provided access to GPT4-preview-1106. The Manager role utilizes nucleus sampling during inference with the temperature parameter set to 00 and top_p to 0.950.950.950.95. It employs full history with a file viewer’s window size of 100100100100. The Reproducer role similarly uses nucleus sampling, but only incorporates the last five histories. Both the Fault Localizer and Tester roles follow the same settings as the Reproducer. Finally, the Programmer role, while sharing the same nucleus sampling parameters, includes a demo in addition to the last five histories and a file viewer’s window size of 100100100100. This setup ensures a reduction in repetition and maintains the unique functionality of each role. In addition, we set the maximum cost to 8888$ per issue.

Other Details

In fact, it is impossible to have a consistent evaluation environment for all currently proposed approaches. We make some adaptations to the evaluation environment released by AutoCodeRover [7] and use it as our evaluation environment. We reproduce all other approaches with our environment for fairness. However, the evaluation on repository “astropy” and “request” still has some environmental problems remaining. In our inference environment, commands like “edit” occasionally trigger a “container crashed” error which interrupts the process. If this occurs, we restart from the beginning of the pipeline for this issue. We pre-construct an environment-completed docker image offline to avoid wasting time on real-time installation during inference. Additionally, we divide the SWE-bench lite into six processes for parallel inference to further accelerate this process. When Fault Localizer runs the repository’s integration unit tests, it sometimes adds or modifies files within the repository, and we restore these files after the localization process.

4.2 Results

Table 4 shows CodeR’s performance on SWE-bench lite and its comparative methods. The results show that CodeR establishes a new benchmark record on SWE-bench lite, achieving the best performance to date, compared with all other commercial products and methods. In SWE-bench lite, CodeR resolves 28.33%percent28.3328.33\%28.33 % issues at one attempt, addressing 84848484 of 300300300300. In contrast, SWE-agent + GPT 4 and Aider solve 18.00%percent18.0018.00\%18.00 % and 26.33%percent26.3326.33\%26.33 % respectively. This proves that CodeR’s meticulously designed roles and actions are highly effective.

We notice that directly enabling LLMs to generate patches (explicit patch generation) for issues is less effective than having LLMs edit the code repository (implicit patch generation). While CodeR achieves 28.33%percent28.3328.33\%28.33 % resolved rate, RAG+GPT 4 and AutoCodeRover only solve 2.67%percent2.672.67\%2.67 % and 19.00%percent19.0019.00\%19.00 % respectively. Furthermore, we observe that existing LLMs may struggle to generate applicable and high-quality patches, as a correct patch requires a strict format and is sensitive to line numbers, which LLMs cannot perfectly handle.

The result also shows that CodeR sends more requests, resulting in increased tokens and cost at an acceptable rate. This could be due to our fine-grained design of multi-role and actions. The 10.33%percent10.3310.33\%10.33 % improvement over SWE-agent +GPT 4 (reported) demonstrates that pre-planning at the beginning of the pipeline is superior to deciding the next steps on-the-go. CodeR preemptively devises multiple plans in the form of structured graphs, and all agent roles will execute the pre-defined plan strictly according to the graphs. CodeR’s leading performance also validates the effectiveness of this idea. Pre-planning also possesses a clear advantage of bypassing imperfect instruction-following and long-context memorizing abilities of LLMs. Although CodeR has achieved impressive performance, we still believe that designing a more sophisticated plan will yield more significant improvements in the future.

We also conduct ablation studies on 50505050 issues of SWE-bench lite. The results in Table 5 show that removing the multi-agent & task graph would reduce CodeR’s resolved rate from 22%percent2222\%22 % to 10%percent1010\%10 %. This further demonstrates that our carefully designed roles motivated by real-world company collaboration are highly useful for issue resolving tasks. Additionally, we observe a performance drop and a cost increase when we remove the fault localization action, which highlights the significant potential of combining LLMs with traditional software engineering strategies for addressing complex downstream tasks.

Table 4: Results of CodeR and its comparative methods on SWE-bench lite (300300300300 GitHub issues). Note that “reported” refers to the numbers from the SWE-bench Leaderboard (https://www.swebench.com), while “reproduced” refers to our results obtained in our unified evaluation environment using their open-sourced generated patches.
Methods Resolved (%) Avg. Req. Avg. Tokens/Cost
Commercial Products
Devin (random 25%percent2525\%25 % subset of SWE-bench) 13.86 (-) - -
Amazon Q Developer Agent (reported) 20.33 (61) - -
Amazon Q Developer Agent (reproduced) 17.00 (54) - -
OpenCSG CodeGenAgent (reported) 23.67 (71) - -
OpenCSG CodeGenAgent (reproduced) 20.67 (62) - -
Bytedance MarsCode Agent 22.00 (66)
Explicit Patch Generation
RAG + GPT 3.5 0.33 (1) - -
RAG + SWE-Llama 13B 1.00 (3) - -
RAG + SWE-Llama 7B 1.33 (4) - -
RAG + GPT 4 2.67 (8) - -
RAG + Claude 2 3.00 (9) - -
RAG + Claude 3 Opus 4.33 (13) - -
AutoCodeRover 19.00 (57) - 112k/$1.30
Implicit Patch Generation
Aider (reported) 26.33 (79) - -
Aider (reproduced) 24.67 (74) - -
SWE-agent + Claude 3 Opus (reported) 11.67 (35) 17.10 221K/$3.41
SWE-agent + Claude 3 Opus (reproduced) 9.66 (29) 17.10 221K/$3.41
SWE-agent + GPT 4 (reported) 18.00 (54) 21.55 245K/$2.51
SWE-agent + GPT 4 (reproduced) 16.67 (50) 21.55 245K/$2.51
CodeR (reported) 28.33 (85) 30.39 299K/$3.09
CodeR (ours) 27.33 (82) 30.39 299K/$3.09
Table 5: Ablation studies on 50505050 issues. We randomly select 50505050 from 300300300300 issues of SWE-bench lite to conduct ablation studies for faster and more cost-effective experiments.
Methods Resolved (%) Avg. Req. Avg. Tokens/Cost
CodeR 22.00 (11) 30.40 295K/$3.09
\hdashline    w/o Multi-Agent & Task Graph 10.00 (5) 18.46 200K/$2.05
      w/o FL 14.00 (7) 29.98 309K$3.19

5 Related Works

Automatic Issue Resolving

GitHub’s issue can be resolved using the following solutions automatically: (1) Retrieval-Augmented Generation (RAG) [1] is a straightforward approach, which first retrieves the relevant code snippets from the repository, and then prompts LLMs to generate a patch to fix the reported issue. To enhance LLMs’ proficiency in generating program patches, SWE-Llama [1] was proposed and it fine-tuned the Llama [20, 21] model on well-crafted patch-generating instruction data. (2) Following this, SWE-agent [8] was proposed, which used LLMs to interact with a computer to solve issue problems automatically. SWE-agent pre-defines a series of agent-computer interfaces (ACIs) to enable LLMs to interact more efficiently with the computer. (3) Additionally, AutoCodeRover [7] expands the visible context information for LLMs by leveraging sophisticated code search tools in software engineering, achieving decent performance. (4) Another work [22] proposes a multi-agent pipeline of two successive steps. In the first step, three types of role agents (Repository Custodian, Manager, Developer) collaborate on the plan; the plan is represented as code, and embedded into the main program for execution. After, two types of role agents (Developer, Quality Assurance Engineer) participate in the coding process. In this paper, we propose CodeR, which defines fine-grained agent roles and corresponding actions and incorporates advanced software engineering tools.

Test-based Automated Program Repair

Automated program repair has been an active topic in software engineering for years, and a majority of work can be categorized as test-based automated program repair. Given the presence of a test suite, generated patches can be validated against the test, making the result to be more trustworthy. However, a weak test suite allows test-passing patches to be incorrect, and a large search space makes it difficult to synthesize a correct patch. Therefore, various techniques have been proposed to guide the search process, including genetic programming [23], manually defined fix patterns [24], mined fix patterns [25, 26, 27], heuristics [28], learning from code or program synthesis [26, 29],and semantic analysis [30, 31]. These works focus on code content, trying to find a patch that could satisfy all constraints(test, compiler, heuristics, etc.) while ignoring the issue description itself which may contain a lot of useful information. Apart from those approaches, many works adopt machine learning models to generate patches. SequenceR [32] proposes a sequence-to-sequence NMT to generate the fixed code directly. CODIT [33] uses the same model to predict the code edits for the faulty code. DLFix [34], CoCoNuT [35], and Cure [36] take the context of the faulty statement as input and encode it via tree-based LSTM, CNN, GPT, respectively. Recoder [37] proposes a syntax-guided decoder to generate edits with placeholders via the provider/decider architecture. RewardRepair [38] uses an RL approach that integrates program compilation and test execution information. Tare [39] directly learns the typing rules to guide the generation. These works treat APR problem as a neural translation task from the buggy code (with context) to the fixed code and most of them adopt encode-decoder models. Different from those approaches, CodeR proposes a multi-turn framework that could collect necessary information on demand and generate the fixed code based on the information collected.

Artificial Intelligence (AI) Agents

The development of AI agents has made substantial strides, introducing many advanced methodologies to automate tasks. AutoGPT [40], AgentGPT [41], and MetaGPT [9] employ an assembly line paradigm, where diverse roles are assigned to various AI agents, efficiently decomposing complex tasks in simpler subtasks through collaborative work. Dify [42] and FastGPT [43] are LLM application development platforms, that combine the concepts of Backend-as-a-Service and LLMOps to enable developers to quickly build production-grade generative AI applications. Using these platforms, even non-technical personnel can participate in the definition and data operations of AI applications. SWE-agent [8] enables LLMs to interact with the programming environment to automatically solve GitHub issues via pre-defining multiple ACIs. CodeR defines detailed and decoupled agent roles (e.g., reproducer, programmer, and tester) along with their corresponding fine-grained actions (e.g., reproducing, editing code, and testing code). Such an approach will facilitate resolving complex issues through collaborative efforts between various agents.

6 Conclusion and Future Works

This paper proposes CodeR which excels at resolving issues. It demonstrates the importance of providing plans that mimic humans’ problem-solving procedures for issue resolving. CodeR requires pre-specified task graphs that convert the planning task to a simpler decision task for LLMs and also provide a guarantee for the exact plan execution. With the idea of task graphs, some advanced software engineering skills like fault localization, mining similar issues, and web search can be seamlessly added to our pre-defined graph without any code changes by a JSON format text. CodeR’s pre-defined plans are experiences provided by human experts. We believe it is one of the key factors in resolving issues. In the future, we will build a comprehensive set of plans that may resolve more and more issues.

References

  • [1] Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023.
  • [2] Daoguang Zan, Bei Chen, Fengji Zhang, Dianjie Lu, Bingchao Wu, Bei Guan, Wang Yongji, and Jian-Guang Lou. Large language models meet nl2code: A survey. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7443–7464, 2023.
  • [3] Ziyin Zhang, Chaoyu Chen, Bingchang Liu, Cong Liao, Zi Gong, Hang Yu, Jianguo Li, and Rui Wang. A survey on language models for code. arXiv preprint arXiv:2311.07989, 2023.
  • [4] Zibin Zheng, Kaiwen Ning, Yanlin Wang, Jingwen Zhang, Dewu Zheng, Mingxi Ye, and Jiachi Chen. A survey of large language models for code: Evolution, benchmarking, and future trends. arXiv preprint arXiv:2311.10372, 2023.
  • [5] OpenAI. Hello gpt-4o. 2024. https://openai.com/index/hello-gpt-4o.
  • [6] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • [7] Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. Autocoderover: Autonomous program improvement. arXiv preprint arXiv:2404.05427, 2024.
  • [8] John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. arXiv preprint arXiv:2405.15793, 2024.
  • [9] Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for multi-agent collaborative framework. In The Twelfth International Conference on Learning Representations, 2023.
  • [10] Ying Wen, Yaodong Yang, Rui Luo, and Jun Wang. Modelling bounded rationality in multi-agent interactions by generalized recursive reasoning. arXiv preprint arXiv:1901.09216, 2019.
  • [11] Hui Su, Xiaoyu Shen, Rongzhi Zhang, Fei Sun, Pengwei Hu, Cheng Niu, and Jie Zhou. Improving multi-turn dialogue modelling with utterance rewriter. arXiv preprint arXiv:1906.07004, 2019.
  • [12] paul gauthier. Aider, ai pair programming in your terminal. https://aider.chat, 2024.
  • [13] Eric Zelikman, Qian Huang, Gabriel Poesia, Noah Goodman, and Nick Haber. Parsel: Algorithmic reasoning with language models by composing decompositions. Advances in Neural Information Processing Systems, 36:31466–31523, 2023.
  • [14] Daoguang Zan, Ailun Yu, Wei Liu, Dong Chen, Bo Shen, Wei Li, Yafen Yao, Yongshun Gong, Xiaolin Chen, Bei Guan, et al. Codes: Natural language to code repository via multi-layer sketch. arXiv preprint arXiv:2403.16443, 2024.
  • [15] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
  • [16] W Eric Wong, Ruizhi Gao, Yihao Li, Rui Abreu, and Franz Wotawa. A survey on software fault localization. IEEE Transactions on Software Engineering, 42(8):707–740, 2016.
  • [17] Rui Abreu, Peter Zoeteweij, and Arjan JC Van Gemund. On the accuracy of spectrum-based fault localization. In Testing: Academic and industrial conference practice and research techniques-MUTATION (TAICPART-MUTATION 2007), pages 89–98. IEEE, 2007.
  • [18] James A Jones, Mary Jean Harrold, and John Stasko. Visualization of test information to assist fault localization. In Proceedings of the 24th international conference on Software engineering, pages 467–477, 2002.
  • [19] Daming Zou, Jingjing Liang, Yingfei Xiong, Michael D Ernst, and Lu Zhang. An empirical study of fault localization families and their combinations. IEEE Transactions on Software Engineering, 47(2):332–347, 2019.
  • [20] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  • [21] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  • [22] Wei Tao, Yucheng Zhou, Wenqiang Zhang, and Yu Cheng. Magis: Llm-based multi-agent framework for github issue resolution. arXiv preprint arXiv:2403.17927, 2024.
  • [23] Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, and Westley Weimer. Genprog: A generic method for automatic software repair. Ieee transactions on software engineering, 38(1):54–72, 2011.
  • [24] Kui Liu, Anil Koyuncu, Dongsun Kim, and Tegawendé F Bissyandé. Tbar: Revisiting template-based automated program repair. In Proceedings of the 28th ACM SIGSOFT international symposium on software testing and analysis, pages 31–42, 2019.
  • [25] Hoang Duong Thien Nguyen, Dawei Qi, Abhik Roychoudhury, and Satish Chandra. Semfix: Program repair via semantic analysis. In 2013 35th International Conference on Software Engineering (ICSE), pages 772–781. IEEE, 2013.
  • [26] Jiajun Jiang, Yingfei Xiong, Hongyu Zhang, Qing Gao, and Xiangqun Chen. Shaping program repair space with existing patches and similar code. In Proceedings of the 27th ACM SIGSOFT international symposium on software testing and analysis, pages 298–309, 2018.
  • [27] Anil Koyuncu, Kui Liu, Tegawendé F Bissyandé, Dongsun Kim, Jacques Klein, Martin Monperrus, and Yves Le Traon. Fixminer: Mining relevant fix patterns for automated program repair. Empirical Software Engineering, 25:1980–2024, 2020.
  • [28] Qi Xin and Steven P Reiss. Leveraging syntax-related code for automated program repair. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 660–670. IEEE, 2017.
  • [29] Ming Wen, Junjie Chen, Rongxin Wu, Dan Hao, and Shing-Chi Cheung. Context-aware patch generation for better automated program repair. In Proceedings of the 40th international conference on software engineering, pages 1–11, 2018.
  • [30] Jinru Hua, Mengshi Zhang, Kaiyuan Wang, and Sarfraz Khurshid. Sketchfix: a tool for automated program repair approach using lazy candidate generation. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 888–891, 2018.
  • [31] Kui Liu, Anil Koyuncu, Dongsun Kim, and Tegawendé F Bissyandé. Avatar: Fixing semantic bugs with fix patterns of static analysis violations. In 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER), pages 1–12. IEEE, 2019.
  • [32] Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. An empirical investigation into learning bug-fixing patches in the wild via neural machine translation. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, pages 832–837, 2018.
  • [33] Saikat Chakraborty, Miltiadis Allamanis, and Baishakhi Ray. Codit: Code editing with tree-based neural machine translation. arXiv preprint arXiv:1810.00314, 2018.
  • [34] Yi Li, Shaohua Wang, and Tien N Nguyen. Dlfix: Context-based code transformation learning for automated program repair. In Proceedings of the ACM/IEEE 42nd international conference on software engineering, pages 602–614, 2020.
  • [35] Thibaud Lutellier, Hung Viet Pham, Lawrence Pang, Yitong Li, Moshi Wei, and Lin Tan. Coconut: combining context-aware neural translation models using ensemble for program repair. In Proceedings of the 29th ACM SIGSOFT international symposium on software testing and analysis, pages 101–114, 2020.
  • [36] Nan Jiang, Thibaud Lutellier, and Lin Tan. Cure: Code-aware neural machine translation for automatic program repair. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), pages 1161–1173. IEEE, 2021.
  • [37] Qihao Zhu, Zeyu Sun, Yuan-an Xiao, Wenjie Zhang, Kang Yuan, Yingfei Xiong, and Lu Zhang. A syntax-guided edit decoder for neural program repair. In Proceedings of the 29th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering, pages 341–353, 2021.
  • [38] He Ye, Matias Martinez, and Martin Monperrus. Neural program repair with execution-based backpropagation. In Proceedings of the 44th international conference on software engineering, pages 1506–1518, 2022.
  • [39] Qihao Zhu, Zeyu Sun, Wenjie Zhang, Yingfei Xiong, and Lu Zhang. Tare: Type-aware neural program repair. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1443–1455. IEEE, 2023.
  • [40] Hui Yang, Sifu Yue, and Yunzhong He. Auto-gpt for online decision making: Benchmarks and additional opinions. arXiv preprint arXiv:2306.02224, 2023.
  • [41] Assemble, configure, and deploy autonomous ai agents in your browser. GitHub, 2023. https://github.com/reworkd/AgentGPT.
  • [42] The innovation engine for generative ai applications. Dify.AI, 2024. https://dify.ai.
  • [43] Empower ai with your expertise. labring, 2024. https://fastgpt.run.
Refer to caption
Figure 4: The system prompt of the ‘manager’ agent. {command_docs} is obtained by parsing YAML files, which includes the command’s signature, docstring, arguments, end_name, etc.
Refer to caption
Figure 5: The instance prompt of the ‘manager’ agent. {plans} refers to all JSON-format plans in Figure 3.
Refer to caption
Figure 6: The system prompt of the ‘reproducer’ agent. {command_docs} is obtained by parsing YAML files, which includes command’s the signature, docstring, arguments, end_name, etc.
Refer to caption
Figure 7: The instance prompt of the ‘reproducer’ agent. {issue} is the issue that needs to be resolved.
Refer to caption
Figure 8: The system prompt of the ‘fault localizer’ agent. {command_docs} is obtained by parsing YAML files, which includes command’s the signature, docstring, arguments, end_name, etc.
Refer to caption
Figure 9: The instance prompt of the ‘fault localizer’ agent. {location} refers to the top 5555 function-level localization results of both fault localization and BM25.
Refer to caption
Figure 10: The system prompt of the ‘editor’ agent. {command_docs} is obtained by parsing YAML files, which includes command’s the signature, docstring, arguments, end_name, etc.
Refer to caption
Figure 11: The instance prompt of the ‘editor’ agent. {issue} is the issue that needs to be resolved. {location} refers to the top 5555 function-level localization results of both fault localization and BM25.
Refer to caption
Figure 12: The system prompt of the ‘verifier’ agent. {command_docs} is obtained by parsing YAML files, which includes command’s the signature, docstring, arguments, end_name, etc.
Refer to caption
Figure 13: The instance prompt of the ‘verifier’ agent. {issue} is the issue that needs to be resolved.
Refer to caption
Figure 14: Prompt template when communicating between multiple agents. {conclusion} and {history conclusion} refer to the summary report passed from the last agent and the reports from all other agents in history.
[Uncaptioned image]
[Uncaptioned image]