Photo by Markus Spiske on Unsplash
Introduction
I am in the process of experimenting with training an LLM on a codebase. My goal is to build a foundational model that I can then create different generative AIs from that are more focused on a task, say Code Review or High-Level Documentation. I needed to start from a good known source even if it might be a little small for my ultimate goal so that I know whether or not I was going in the right direction. I started by setting up the training process for a BERT-style LLM. I chose BERT because I believe it is best for trying to build understanding of its source material.
Getting to work
Under the Microsoft username, there is a dataset called LCC_csharp. I started using this as the codebase I wanted to work with is also written in C#, but I quickly found a significant issue with the codebase.
// See the LICENSE file in the project root for more information. // // using System; using System.Runtime.InteropServices; using Windows.Foundation; #pragma warning disable 436 // Redefining types from Windows.Foundation namespace Windows.UI.Xaml.Media.Media3D
Where would you put the line breaks for the code snippet above?
The structured C# files had been compressed down to a single line. This isn't so much of an issue but it becomes hard to find where a comment ends and the code itself begins. I wasn't going to sit there and figure this out by hand, so I decided to program my way out of the problem.
The Execution
My plan was to use the Roslyn Analyzer developed by one of the .NET teams at Microsoft to do static analysis on C# code. The Analyzer has a concept known as Trivia, and each section of code is a different Trivia. Once the code is read into a Syntax Tree, it becomes a structured document and different parts of the file can be queried and modified easily.
var commentTrivia = from t in tree.GetRoot().DescendantTrivia()
where t.IsKind(SyntaxKind.SingleLineCommentTrivia) ||
t.IsKind(SyntaxKind.MultiLineCommentTrivia) ||
t.IsKind(SyntaxKind.SingleLineDocumentationCommentTrivia)
select t;
The above code block finds all of the major types of comment Trivia that exist in the document. The next thing I do is remove those items from the document and normalize the whitespace so that it takes on a more natural shape. I had intended to then save the document as is, with line breaks and all, but I quickly found out that this corrupted the Parquet format that the data was originally saved in. A lot of trial and error later, I settled on removing all of the line breaks, effectively putting all of the code back on one line, but this time, without the comments, the code could be read as a big line of code rather than a document where comments and code intermingled without a clear break between them all.
using System;using System.Runtime.InteropServices;using Windows.Foundation;#pragma warning disable 436 // Redefining types from Windows.Foundationnamespace Windows.UI.Xaml.Media.Media3D
Not perfect, but a lot better than before.
Conclusion
Now the codebase can be tokenized from scratch and should contain only meaningful code. Comments are incredibly important for us humans to fully understand a piece of code, but I feel that it's more important for a foundational LLM to be able to generate good code. From there I can reliably build Q&A, code-completion and documentation LLMs that fine-tune the base weights to be better at their individual tasks. Ideally I can then merge these all together in a Mixture of Experts model that can be good at a variety of tasks, and have all been trained or at least fine-tuned on the specific codebase.
Top comments (0)