I recently built a coding agent using Llama 3.3 70B (see demo on x). While my agent's performance on SWE Bench Lite (5%) is not that impressive compared to the leaders (48.33%), I'd like to share some of my learnings in case there's something here you didn't already know.
Why do this
If you take a look on the SWE bench lite leaderboard, 80-90% of the leaderboard are closed models. I want to see how far we can push open source models and am excited to see the point where people are using open source models for real coding assistant use-cases.
Key Learnings
1. Keep It Simple
Sticking to the simplest (sometimes brute force) solutions gave me a lot more control and visibility into the agent's behavior. For example, I could quickly gather operational metrics using simple string searches over logs:
Syntax error: 46
File not found: 114
Total errors (post tool call): 122
Total tool calls: 670
list_files tool calls: 245
edit_file tool calls: 78
For evaluation, SWE bench lite contains 12 different repositories, and each repository has multiple versions, each repository + version pair has a different set of startup commands, python version and dependencies. So, instead of using the full SWE bench lite suite, I started with 1 repository and 1 version, and then slowly introduced more versions and repositories as I needed them.
2. The Power of Raw Prompts
Raw prompts seem intimidating at first, but they're pretty straight forward and give you way more power. For instance all chat completion APIs I could find with Llama, don't allow text content alongside tool calls. Looking under the hood, the standard Meta tool call prompt explicitly states not to have any text alongside tool calls:
If you decide to invoke any of the function(s), you MUST put it in the format of [func_name1(params_name1=params_value1, params_name2=params_value2...), func_name2(params)]
You SHOULD NOT include any other text in the response.
With raw prompts, not only could I achieve thinking alongside tool calls, I got a lot less tool call parsing errors since I had control over the parsing logic and could parse tool calls that some implementations would have otherwise rejected.
I suspect this is a big advantage open source models have over closed source models - you have access to the prompt formats. Ashwin, one of the core maintainers of Llama Stack has also mentioned this point. Meta post their prompt formats, and you can view llama-stack for a reference implementation.
3. Less Context Can Be More
More context isn't always better. Initially, I included the entire repository file tree in the context, thinking it would help navigation. This not only cost me $250 in evaluation runs but actually hurt performance. Excluding the file tree completely led to:
- 2-3x performance boost in SWE bench pass rate
- 10x reduction in costs (from ~$30 to ~$3 for 50 instances)
4. Tool Names Matter
Something as simple as renaming a tool from replace_in_file
to edit_file
decreased empty patch files by 22% (from 77% to 60.3%). The agent also complained less about not having the right tools to solve problems. I suspect there are a lot of opportunities for improvement here - finding tools that LLM find easy to use.
Current Limitations
To be honest - at 5% on SWE bench lite, this agent isn't ready for production use. It's still far behind agents using Claude (30-50% range). And is still behind the top open source performers on SWE Bench Lite:
- IBM AI Agent SWE-1.0: 23.67%
- Moatless tree search (Llama 3.1 70B): 17.7%
- OpenHands CodeAct (Llama 3.1 405B): 14%
- OpenHands CodeAct (Llama 3.1 70B): 9%
But I'm optimistic, I see a lot of room for improvement in areas like:
- High rate of empty patch files (59%)
- Tool call errors
- Unproductive looping
So I'm keen to see how far I can push this agent.
Conclusion
So those were the key learnings I got from doing this project. I hope you took something away from it. The code is open source at github.com/aidando73/l2-llama.
Feel free to comment below or message me on x.com if you have any feedback or questions.
Top comments (0)