Ewan McDougall

Posted on Mar 9 • Originally published at blog.mrloop.com

Exploring LLMs: A Blind Trial for Code Completions

#neovim #llm #ai

Abstract: This article presents findings from a four-month blind trial of LLM code completions in a professional development environment. By tracking over 94,000 code suggestions across three leading AI assistants, I found that GitHub Copilot delivers the highest quality suggestions (3.4% acceptance rate) while Supermaven provides the greatest volume of useful completions. Codeium performed significantly worse (0.5% acceptance rate) and was discontinued after seven days. This data-driven approach reveals meaningful differences in LLM performance for practical coding assistance.

Large language models (LLMs) have transformed the coding experience with capabilities ranging from auto-completion to chat interfaces and agentic file editing. In this post, I'll share results from a systematic evaluation of three popular AI code completion tools in a professional development environment.

Rather than relying on subjective impressions, I conducted a rigorous blind trial to determine which model genuinely performs best in day-to-day coding work. This methodology eliminates the confirmation bias that often affects tool selection ("I think this one feels better") and provides concrete data about which assistant actually delivers the most value.

To conduct a randomized blind trial of large language model code completions, I established four key requirements:

Configure multiple LLMs for code completions in my development environment
Display completion suggestions without revealing which LLM generated them
Randomize the presentation order to eliminate selection bias
Collect comprehensive usage data over an extended period

LLMs Under Evaluation

I selected three popular AI code assistants for this experiment:

GitHub Copilot: Microsoft and GitHub's AI pair programmer, built on OpenAI technology, trained on public code repositories and avialbe as a paid subscription service ($10/month or $100/year).
Codeium: A free AI coding assistant from Exafunction that promises high-quality completions with low latency.
Supermaven: AI code completion tool that claims to offer deeper contextual understanding and multi-line completions. Available as both free and paid tiers. The free tier was used for this trial.

By comparing these different systems, I aimed to determine which would be most effective for my daily development workflow.

Implementation

In this section, I'll explain the technical approach used to create a fair testing environment for these AI auto completions. If you're primarily interested in the findings rather than the methodology, feel free to skip directly to the results section.

I use Neovim daily; it is highly customizable and can be configured to meet my needs. The configuration samples shown use Lazy.nvim.

LLM Configurations

Codeium, GitHub Copilot, and Supermaven are used for code completions.

{
  "Exafunction/codeium.nvim",
  enabled = os.getenv("CODEIUM") == "true",
  dependencies = {
    "nvim-lua/plenary.nvim",
    "hrsh7th/nvim-cmp",
  },
  config = true,
},
{
  "zbirenbaum/copilot.lua",
  enabled = os.getenv("COPILOT") == "true",
  cmd = "Copilot",
  event = "InsertEnter",
  opts = {
    suggestion = { enabled = false },
    panel = { enabled = false },
  }
},
{
  "zbirenbaum/copilot-cmp",
  enabled = os.getenv("COPILOT") == "true",
  dependencies = { "copilot.lua", "lspkind.nvim" },
  config = true,
  init = function()
    vim.api.nvim_set_hl(0, "CmpItemKindCopilot", { fg = "#6CC644" })
  end,
},
{
  "supermaven-inc/supermaven-nvim",
  enabled = os.getenv("SUPERMAVEN") == "true",
  opts = {
    disable_inline_completion = true,
  },
  init = function ()
    vim.api.nvim_set_hl(0, "CmpItemKindSupermaven", { fg = "#6CC644" })
  end,
}

UI Configuration

For each LLM source, only a symbol is shown in the UI, and the same symbol is used for all LLM completions.

cmp.setup({
  -- only showing formatting option for brevity
  formatting = {
    fields = {'menu', 'abbr', 'kind'},
    format = require('lspkind').cmp_format({
      mode = "symbol",
      maxwidth = 50,
      show_labelDetails = true,
      ellipsis_char = '...',
      symbol_map = {
        Supermaven = "",
        Codeium = "",
        Copilot = "",
      },
    })
  },
})

Randomize the Order of Completion Suggestions

To eliminate position bias (the tendency to select the first suggestion), I implemented a system to randomize the order in which completion suggestions appear. Rather than defining a fixed order of completion sources in the setup function, I dynamically shuffle them each time a new buffer is loaded:

 cmp.setup({
  -- only showing sources option for brevity
-  sources = cmp.config.sources({
-    { name = 'nvim_lsp' },
-    { name = 'luasnip' },
-    { name = 'path' },
-  }, {
-    { name = 'nvim-cmp-ts-tag-close' },
-    { name = "codeium" },
-    { name = 'copilot'},
-    { name = 'supermaven'},
-    { name = 'nvim_lua' },
-    { name = 'spell' },
-  }, {
-    { name = 'buffer' },
-  }),
-
 })

This approach ensures that no AI completion source consistently appears first in the suggestion list, which is crucial for unbiased evaluation:

-- Fisher-Yates shuffle algorithm to randomize suggestion order
-- This is critical for reducing selection bias - if suggestions always appeared
-- in the same order, I might unconsciously favor the first option
local function shuffle(t)
  local n = #t
  for i = n, 2, -1 do
    local j = math.random(1, i)
    t[i], t[j] = t[j], t[i]
  end
  return t;
end

vim.api.nvim_create_autocmd('BufReadPre', {
  callback = function(t)
    local sources = cmp.config.sources({
      { name = 'nvim_lsp' },
      { name = 'luasnip' },
      { name = 'path' },
    }, shuffle({
      { name = 'nvim-cmp-ts-tag-close' },
      { name = "codeium" },
      { name = 'copilot'},
      { name = 'supermaven'},
      { name = 'nvim_lua' },
      { name = 'spell' },
    }), {
      { name = 'buffer' },
    });
    cmp.setup.buffer { sources = sources }
  end
})

Fig 1: Example of the auto completion UI showing showing suggestions for this.objectUrl

Data Collection System

To objectively measure each LLM's performance, I built a comprehensive data collection system using SQLite. This system silently recorded every completion suggestion and user acceptance decision during my normal development activities:

-- Setup lsqlite3
-- Run `luarocks install lsqlite3` if needed
local status, sqlite3 = pcall(require, 'lsqlite3')

if status then
  local db_path = vim.fn.stdpath('data') .. '/completion-log.sqlite3'
  local db = sqlite3.open(db_path)

  -- Create a log table if it doesn't exist
  db:exec[[
    CREATE TABLE IF NOT EXISTS cmp_log (
      id INTEGER PRIMARY KEY,
      timestamp TEXT,
      event TEXT,
      multiline INTEGER,
      source_name TEXT,
      filetype TEXT
    )
  ]]

  print("Logging completions to " .. db_path)

  -- Prepare the insert statement
  local stmt = db:prepare("INSERT INTO cmp_log (timestamp, event, multiline, source_name, filetype) VALUES (?, ?, ?, ?, ?)")
  -- Close the database connection when Neovim exits
  vim.api.nvim_create_autocmd("VimLeavePre", {
    callback = function()
      stmt:finalize()
      db:close()
    end,
  })

  -- Table of source names to log
  local log_sources = {
    supermaven = true,
    codeium = true,
    copilot = true,
    cmp_ai = true
  }

  -- Function to log completion events
  local function log_completion(event, item, source_name, filetype)
    local multiline = item.textEdit and item.textEdit.newText and item.textEdit.newText:find("\n") and 1 or 0
    if log_sources[source_name] then
      stmt:reset()
      stmt:bind_values(os.date('%Y-%m-%d %H:%M:%S'), event, multiline, source_name, filetype)
      stmt:step()
    end
  end


  -- Register event listeners
  cmp.event:on('menu_opened', function(evt)
    for _, entry in ipairs(evt.window.entries) do
      local source_name = entry.source.name
      log_completion("completion_suggested", entry:get_completion_item(), source_name, entry.context.filetype)
    end
  end)

  cmp.event:on('confirm_done', function(evt)
    log_completion("completion_used", evt.entry:get_completion_item(), evt.entry.source.name, evt.entry.context.filetype)
  end)
else
  print(sqlite3)
end

SQLite is used because it's lightweight, requires no server setup, and provides a self-contained database solution ideal for this type of data collection. Additionally, Datasette can be used to easily query, visualize, and publish the data for later analysis.

The database schema tracked five key data points for each completion event:

CREATE TABLE IF NOT EXISTS cmp_log (
  id INTEGER PRIMARY KEY,
  timestamp TEXT,
  event TEXT,
  multiline INTEGER,
  source_name TEXT,
  filetype TEXT
)

The timestamp is the date and time the event was logged and is useful to see if the effectiveness of the LLMs changes over time. The event is the name of the nvim-cmp event that was fired. The multiline is a boolean value that is true if the completion is multiline. The source_name is the name of the source that provided the completion, for example, Copilot. The filetype is the filetype of the buffer that the completion was provided in.

The system tracked two primary event types:

completion_suggested: Recorded whenever a completion was offered
completion_used: Recorded when I accepted a completion

This approach allowed me to calculate the critical acceptance rate metric (accepted completions ÷ suggested completions) that forms the foundation of this analysis.

Results

After collecting data for approximately four months across real-world development projects, clear patterns emerged in how these AI assistants performed in day-to-day coding scenarios.

The most immediate finding was Codeium's poor performance. It achieved only a 0.5% acceptance rate (meaning I accepted only 1 out of every 200 suggestions). This led me to discontinue it after seven days of evaluation.

Quantitative Results

Summary Statistics

LLM	Total Suggestions	Accepted Completions	Acceptance Rate	Avg Daily Suggestions	Avg Daily Acceptances
Supermaven	60,508	1,171	1.9%	293.3	5.7
Copilot	27,395	921	3.4%	133.5	4.5
Codeium	7, 091	32	0.5%	745.4	3.4

Data here

These statistics reveal two distinct AI assistant strategies:

Copilot: Offers fewer but higher quality suggestions (3.4% acceptance)
Supermaven: Provides more aggressive suggestions with lower precision (1.9% acceptance)

Daily Used Count

Fig 1: Daily count of accepted completions by LLM. Note Supermaven's consistently higher daily usage (average: 5.7 completions) compared to Copilot (average: 4.5 completions).

Daily Suggested Count

Fig 2: Daily count of suggested completions by LLM. Supermaven offers significantly more suggestions (an average of 293 per day) than Copilot (134 per day).

Daily Acceptance Ratio

Fig 3: Daily acceptance ratio (accepted/suggested) by LLM. Copilot consistently maintains higher suggested to accepted completion ratios (3.4%), while Supermaven typically ranges between 0.1.5-0.2% of suggestions accepted.

View raw data here

Performance by Language

Analysis of completions by file type revealed interesting patterns:

Language	Most Accurate LLM	Acceptance Rate
http	copilot	8.3%
zsh	copilot	5.9%
typescriptreact	copilot	5.8%
css	copilot	5.6%
handlebars	copilot	5.0%
jsonc	copilot	5.0%
lua	copilot	4.4%
make	copilot	3.9%
javascript.glimmer	copilot	3.6%
JavaScript/TypeScript	copilot	2.8%
conf	copilot	2.4%
yaml	supermaven	1.4%
git*	supermaven	1.0%
markdown	supermaven	0.9%

Data here

Conclusion

Out of Codeium, Copilot, and Supermaven, Codeium is the clear loser the lowest acceptance rate (0.5%).

Supermaven provided the highest volume of useful completions (1,171 accepted suggestions) but at the cost of efficiency - it generated 60508 total suggestions, meaning 98.1% of its completions were rejected.

Copilot demonstrated the best balance of quality and quantity - with 921 accepted completions from just 27395 suggestions, achieving a 3.4% acceptance rate. More than twice the acceptance rate of Supermaven.

Limitations and Future Work

While this study provides valuable insights, several limitations should be acknowledged:

Single-user perspective - Results primarily reflect my personal coding style, projects, and preferences
Project-specific patterns - Performance likely varies across different codebases and domains
Temporal factors - AI models are continuously improving; results may not reflect current capabilities
Commercial focus - Only commercial LLMs were tested; promising open-source models weren't evaluated
Data collection gaps - There were two notable gaps in data collection:
- Eight days in July with no Copilot data (likely due to authentication expiration)
- A period between October and December when I transitioned to a new development machine

Based on these findings and limitations, I've identified several opportunities for future work:

Expanded model comparison: Integrate additional code completion tools, particularly open-source models via plugins like cmp-ai or minuet-ai.nvim.
Methodological refinements: Switch from the menu_opened event to menu_closed for asynchronous completions to ensure all displayed suggestions are properly captured. This would provide a more accurate picture of what completions were actually seen and evaluated.
Multi-user study: Expand data collection to include other developers to better understand how these tools perform across different coding styles and domains.

Summary

This blind trial of LLM code completions revealed significant differences between the three tested models:

Codeium: Discontinued after 7 days due to poor performance (0.5% acceptance rate)
GitHub Copilot: Highest quality suggestions (3.4% acceptance rate)
Supermaven: Highest volume of accepted suggestions but lower efficiency (1.9% acceptance rate)

The findings demonstrate that different LLMs have distinct strengths - Copilot excels at providing precise, focused suggestions that are more likely to be accepted, while Supermaven offers more aggressive suggestions covering a wider range of possibilities.

The quantitative approach revealed patterns that wouldn't be obvious from casual usage, such as the significant difference in acceptance rates between the tools (Copilot's 3.4% vs. Supermaven's 1.9%).

Based on these results, I've adopted a hybrid approach in my development workflow:

Continue using Copilot for auto complete.
Trial paid for Supermaven to see if this improves the acceptance rate.
Test new models to ensure I'm using the most effective tools

This experiment demonstrates the value of data-driven decision making when selecting developer tools, and I look forward to expanding this methodology to evaluate additional LLM-based coding assistants in the future.

DEV Community

Exploring LLMs: A Blind Trial for Code Completions

LLMs Under Evaluation

Implementation

LLM Configurations

UI Configuration

Randomize the Order of Completion Suggestions

Data Collection System

Results

Quantitative Results

Performance by Language

Conclusion

Limitations and Future Work

Summary

Top comments (0)

Read next

How AI Development Companies Are Reshaping the World?

From LLM to Data Warehousing: How to Achieve AI-Driven Data Processing and Analysis

AI-Powered MCQ Generator Using NLP and LSTM

How I Built a Teen Slang Translator with GitHub Copilot and Claude 3.7 Sonnet