Davide Santangelo

Posted on Feb 21

Building a Homegrown LLM with Python: Training on Hacker News Data

#python #machinelearning #ai #llm

Introduction

Large Language Models (LLMs) have transformed AI applications, from conversational agents to intelligent code assistants. While OpenAI’s GPT models are widely used, many developers want to understand how these models work and even train their own versions from scratch. In this in-depth guide, we will explore how to build a lightweight LLM using Python, train it on Hacker News data, optimize its performance, and deploy it for real-world usage.

What is a Language Model?

A Language Model (LM) is a type of artificial intelligence model that predicts the likelihood of a sequence of words in a given language. It learns patterns, grammar, and context from a large corpus of text data, enabling it to generate coherent and contextually relevant text. For example, given the phrase "I love to code in," an LM might predict "Python" as the next word, based on patterns observed in its training data.

LMs are used in various applications, including:

Text generation: Creating articles, stories, or dialogues.
Machine translation: Translating text from one language to another.
Text classification: Sentiment analysis or spam detection.
Autocomplete: Suggesting words or phrases in search engines or text editors.
In this article, we'll focus on building a simple autoregressive LM that predicts the next word in a sequence, trained specifically on HackerNews data.

What We’ll Cover

Understanding how LLMs work
Collecting and preprocessing Hacker News data
Tokenizing text and creating structured datasets
Training a transformer-based model using PyTorch and Hugging Face Transformers
Fine-tuning and optimizing for better performance
Evaluating model performance using loss metrics and perplexity
Deploying the model with FastAPI and making it accessible via an API
Improving the model with reinforcement learning and knowledge distillation
Scaling the model with distributed training
Reducing computational costs through quantization and pruning
Enhancing model security with adversarial training

Prerequisites

Before we start, install the necessary dependencies:

pip install torch transformers datasets tokenizers accelerate fastapi uvicorn matplotlib deepspeed bitsandbytes

We will use:

torch for deep learning computations
transformers for leveraging pre-built architectures like GPT-2
datasets for handling large text corpora efficiently
tokenizers for high-speed text processing
fastapi and uvicorn for deploying the model as an API
matplotlib for visualizing loss curves and performance metrics
deepspeed for optimizing large-scale model training
bitsandbytes for quantization to reduce memory footprint

Step 1: Understanding Large Language Models

Before diving into code, it's essential to grasp how LLMs function. At their core, these models are neural networks trained to predict the next word in a sequence given an input context. They use:

Tokenization to break text into numerical representations
Transformer architectures (such as GPT-2) with attention mechanisms to understand long-range dependencies
Self-supervised learning to train on vast amounts of unstructured text data
Fine-tuning to adapt to specific tasks, such as chatbots or code generation

Step 2: Collecting Hacker News Data

We’ll use the Hacker News API to collect stories and comments for training.

import requests
import json
import time

def fetch_hackernews_data(num_stories=1000):
    url = "https://hacker-news.firebaseio.com/v0/topstories.json"
    story_ids = requests.get(url).json()[:num_stories]

    stories = []
    for story_id in story_ids:
        story_url = f"https://hacker-news.firebaseio.com/v0/item/{story_id}.json"
        response = requests.get(story_url).json()
        if "text" in response:
            stories.append(response["text"])
        time.sleep(0.5)  # Prevent API rate limits

    return stories

hackernews_texts = fetch_hackernews_data()

Step 3: Preprocessing and Tokenization

Data cleaning ensures better training results. We remove HTML tags, non-text characters, and normalize text.

Tokenization is the process of breaking down text into smaller units, called tokens, which can be words, subwords or even characters. This step is critical to transforming text into a numerical representation that deep learning models can understand and process.

import re
from transformers import AutoTokenizer

def clean_text(text):
    text = re.sub(r"<[^>]+>", "", text)  # Remove HTML tags
    text = re.sub(r"[^a-zA-Z0-9 .,!?]", "", text)  # Retain only valid characters
    return text.lower()

cleaned_texts = [clean_text(text) for text in hackernews_texts]
TOKENIZER = AutoTokenizer.from_pretrained("gpt2")
tokenized_texts = [TOKENIZER(text, truncation=True, padding="max_length", max_length=512) for text in cleaned_texts]

Step 4: Creating a Dataset and DataLoader

We format our text for training using PyTorch’s Dataset class.

import torch
from torch.utils.data import Dataset, DataLoader

class HNDataset(Dataset):
    def __init__(self, tokenized_texts):
        self.inputs = torch.tensor([t["input_ids"] for t in tokenized_texts])
        self.attention_masks = torch.tensor([t["attention_mask"] for t in tokenized_texts])

    def __len__(self):
        return len(self.inputs)

    def __getitem__(self, idx):
        return {"input_ids": self.inputs[idx], "attention_mask": self.attention_masks[idx]}

train_dataset = HNDataset(tokenized_texts)
train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True)

Step 5: Training the Transformer Model

We fine-tune a pre-trained GPT-2 model on our dataset.

from transformers import GPT2LMHeadModel, AdamW

device = "cuda" if torch.cuda.is_available() else "cpu"
model = GPT2LMHeadModel.from_pretrained("gpt2").to(device)
optimizer = AdamW(model.parameters(), lr=5e-5)

EPOCHS = 3
for epoch in range(EPOCHS):
    model.train()
    for batch in train_dataloader:
        inputs = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)

        outputs = model(input_ids=inputs, attention_mask=attention_mask, labels=inputs)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    print(f"Epoch {epoch+1}: Loss = {loss.item()}")

Step 6: Model Optimization

Model optimization focuses on improving the efficiency of a trained model by reducing its memory usage and increasing its inference speed. Two key techniques used for optimization are:

Quantization

Quantization reduces the precision of model parameters (e.g., converting 32-bit floating point numbers to 8-bit integers). This helps decrease memory consumption and speeds up inference, especially on resource-limited devices.
In the code, we achieve this using BitsAndBytesConfig(load_in_8bit=True), which loads the GPT-2 model in an 8-bit format, reducing its size and computational requirements.

Pruning

Pruning removes unnecessary parameters from the model, reducing the number of computations required during inference. While pruning is not explicitly implemented in the code, it can be done by eliminating less significant weights from the neural network.

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(load_in_8bit=True)
model = GPT2LMHeadModel.from_pretrained("gpt2", quantization_config=bnb_config).to(device)

Step 7: Deploying as an API

We use FastAPI to make our model accessible.

from fastapi import FastAPI

app = FastAPI()
@app.get("/generate")
def generate(prompt: str):
    return {"generated_text": generate_text(prompt)}

Run the API:

uvicorn app:app --reload

Conclusion

We have successfully built, trained, optimized, and deployed a custom LLM using Hacker News data. Future improvements could involve:

Training on a larger dataset
Optimizing hyperparameters
Implementing reinforcement learning with human feedback (RLHF)
Deploying in a production-grade environment
Enhancing security against adversarial attacks

DEV Community