DEV Community

Cover image for How AppMap Navie solved the SWE bench AI coding challenge
Kevin Gilpin for AppMap

Posted on

How AppMap Navie solved the SWE bench AI coding challenge

AppMap Navie is an AI coding assistant that you can use directly in your VSCode or JetBrains code editor.

SWE Bench is a benchmark from Princeton University that assesses AI language models and agents on their ability to solve real-world software engineering issues. It's made up of 2,294 issues from 12 popular Python repositories, along with their human-coded solutions and test cases. It is considered to be the most difficult of the well-known coding benchmarks.

AppMap Navie recently posted 14.6% on SWE Bench, ahead of Amazon Q and 8 other tools. We were able to process the entire benchmark in under 4 hours, and at the lowest recorded cost of operation - up to 95% less expensive than other solvers.

How did we do it? Read on for useful techniques that you can apply to your own AI programming.

Why basic solvers fail

The easiest way to use an LLM to solve a code issue is simply to send the issue description to the LLM along with all the code files and prompt the LLM to generate a patch file. This is basically what the first generation of SWE bench solvers attempted to do. However, the solution rate of this approach is very low (single digit percents). Why?

1) Wrong context Most LLMs have a context limit which is too small to load the entire codebase. So, some guesses have to be made about which files to give the LLM. And when those guesses are wrong, the LLM fails to generate the right solution.

2) Failed patching LLMs are not good at creating patch files. Most LLM-generated patch files will fail to apply.

3) Broken code LLMs will generate code that is malformed. It won't even run cleanly, never mind pass the test cases.

4) Wrong code design The LLM does not understand the project design and architecture. So, it tries to solve the problem in the wrong place; or it fixes one problem while creating another.

You can see some of these early solvers on the SWE bench leaderboard:

Early SWE Bench solvers

Generation 2 - Agentic architecture

The next generation of solvers adopted a more complex architecture, in an effort to solve the problems above.

Basically, the idea of an "agent" is to give the LLM a wider variety of tools, and then run a loop in which the LLM chooses a tool and examines the results of using it.

Tools do things like:

  • Search the code for a keyword
  • Inspect a file
  • Run a program and examine the console output
  • Edit a file

Agents do substantially better on the benchmark:

Agents

However, most of the "agentic" solutions only appear on the SWE Bench "Lite" leaderboard. Why is that?

1) 💸 Cost Every tool an agent uses consumes tokens. Tokens cost money. Agentic loops use tokens over and over.
2) 🐢 Speed By design, agentic solvers can take a lot of time to explore the problem space. They can backtrack and repeat things they've already done. They can get stuck.

AppMap Navie architecture - Semi-agentic

Agents have a higher pass rate than Basic solvers, but they are slow and expensive. AppMap Navie takes an intermediate architecture, in which the solver is provided with powerful capabilities:

  • Rich, but selective, code context to enable retrieval-augmented generation (RAG) architecture.
  • A powerful and reliable tool for making file edits.
  • Self-healing feedback for fixing code.

Navie architecture

Code context

It's inefficient and expensive to send huge amounts of code context to the LLM. Embeddings are slow and expensive to generate. But without the right code context, the LLM-generated solution will fail.

Navie uses a technique called "client-side RAG", in which the code is organized and searched locally. Client-side compute is fast, cheap, and much more efficient than sending huge token payloads to the LLM or building expensive embeddings.

Planning

With the right context selected, it's time to get right to code generation - right?

Wrong. Before code generation comes Planning. Human developers don't dive into coding without some kind of plan. Building an understanding of the system architecture is an essential step, it can't be skipped over by humans, and it shouldn't be skipped by AI coders either.

So, Navie performs an explicit planning step, in which the the issue description is combined with the context to produce a detailed plan. The plan includes:

  • A restatement of the problem.
  • A high level solution description.
  • A list of files to be modified.
  • For each file, a description (no code, yet), of how that file will be changed.

Here's an example of a Navie-generated plan.

File editing

Now, with the plan in hand, the LLM is ready to change code files.

Navie doesn't ask the LLM to generate patch files; it doesn't work. Instead, the LLM generates a "search / replace" pair of code snippets. This works most of the time, and a simple retry loop fixes up most of the occasions when it doesn't.

Here are the Navie-generated code changes that implement the Plan.

Lint repair

The LLM still might get something wrong. Common cases include:

  • Mistakes with indenting (particularly with Python).
  • Missing imports.

The Navie solver runs a linter, then feeds the linter errors back into the AI code editor. Most lint errors can be fixed this way.

An example of lint errors fixed by Navie.

Test repair

Still not done! If the solution generated by Navie breaks existing tests, it's probably not going to fix the issue properly. So the Navie solver runs the application test cases to try and catch and fix any incompatibilities that may have been introduced.

Now, it's ready

Now a patch file is created by simply diff-ing the AI-edited code with the Git base revision. This patch file is submitted to the SWE Bench harness for evaluation.

How is this so efficient?

The Navie solver runs for about 1/3 the cost of most other solvers; and it's 95% cheaper than some of the most intensive agentic solvers on the benchmark (of those that post their costs; many don't 🙁).

  • Efficient client-side RAG context saves $$ on embeddings and LLM tokens.

  • Lint repair and test repair fixes solutions that might be almost, but not quite, correct.

  • A smaller "tool" suite and a linear approach to solving the problem prevents the LLM from wandering down dead ends or getting lost in pointless loops.

Dollar cost SWE bench

Try Navie yourself!

Navie is available today, with no wait list. Here's how to get Navie, or learn more about AppMap:

⬇️ Download AppMap Navie for VSCode and JetBrains: https://appmap.io/get-appmap
⭐ Star AppMap on GitHub: https://github.com/getappmap
📰 Follow on LinkedIn: https://www.linkedin.com/company/appmap
💬 Join AppMap Slack: https://appmap.io/slack
ℹ️ Read the AppMap docs: https://appmap.io/docs

Top comments (1)

Collapse
 
kgilpin profile image
Kevin Gilpin • Edited

Update!

AppMap Navie + GPT 4o is still the #1 Open and Verified solver on SWE Bench.