As a professional developer working daily with a massive codebase containing millions of lines of code and over 1,000 C# projects, finding the right pieces of code to modify can often be a time-consuming task. Recently, my interest has revolved around solving the problem of code search, and I was particularly intrigued by the potential of GraphCodeBERT, as outlined in the research paper GraphCodeBERT: Pre-training Code Representations with Data Flow.
Encouraged by the promising results described in the paper, I decided to evaluate its capabilities. The pretrained model is available here, with a corresponding demo project hosted in the GitHub repository: GraphCodeBERT Demo.
Diving Into Code Search
Initially, I went all in and vectorized the SeaGOAT repository, resulting in 193 Python function records stored in my Elasticsearch database. Using natural language queries, I attempted to find relevant functions by comparing their embeddings via cosine similarity. Unfortunately, I noticed that similar results were returned across multiple, distinct queries.
This led me to believe that the model likely requires fine-tuning for better performance. To test this hypothesis, I decided to take a simpler approach and use the demo project provided with the pretrained model.
Testing with a Controlled Dataset
The demo focuses on three Python functions:
1) download_and_save_image
def f(image_url, output_dir):
import requests
r = requests.get(image_url)
with open(output_dir, 'wb') as f:
f.write(r.content)
2) save_image_to_file
def f(image, output_dir):
with open(output_dir, 'wb') as f:
f.write(image)
3) fetch_image
def f(image_url, output_dir):
import requests
r = requests.get(image_url)
return
Modified Query Results
Below is the table reflecting my findings when testing slightly modified queries against the three functions. It represents the similarity between the user query vectors and the function vectors.
User Query | 1 | 2 | 3 |
---|---|---|---|
Download an image and save the content in output_dir | 0.97 | 9.7e-05 | 0.03 |
Download and save an image | 0.56 | 0.0002 | 0.44 |
Retrieve and store an image | 0.004 | 7e-06 | 0.996 |
Get a photo and save it | 0.0001 | 4e-08 | 0.999 |
Save a file from URL | 0.975 | 6e-07 | 0.025 |
Process downloaded data and reshape it | 0.025 | 0.0002 | 0.975 |
Go to the moon and back as soon as possible | 0.642 | 0.006 | 0.353 |
Observations
From the table, it’s evident that the model correctly identifies the function only when the query is very specific and closely matches the original wording. When queries are slightly modified or synonyms are used, the results seem almost random. The same issue occurs with abstract queries or those unrelated to any function in the database.
It’s also evident that for 2 of the functions, every query returns very low similarity scores, which seems suspicious. This raises questions about whether the model is properly capturing meaningful distinctions for these cases or if there's an issue with the embeddings or similarity calculations.
Concluding Thoughts
After experimenting with the demo version, I concluded that further exploration of this model for code search in larger repositories may not be worthwhile—at least not in its current form. It appears that code search based on natural language queries cannot yet be solved by a single AI model. Instead, a hybrid solution might be more effective, grouping classes or functions based on logical and business-related criteria and then searching these groups for code that addresses the specified problem.
I plan to continue exploring this area further. If you have any insights, suggestions, or experiences with code search models or techniques, please don’t hesitate to share them in the comments. Let’s discuss and learn together!
Top comments (0)