DEV Community

Davide Santangelo
Davide Santangelo

Posted on

Building a Homegrown LLM with Python: Training on Hacker News Data

Introduction

Large Language Models (LLMs) have transformed AI applications, from conversational agents to intelligent code assistants. While OpenAI’s GPT models are widely used, many developers want to understand how these models work and even train their own versions from scratch. In this in-depth guide, we will explore how to build a lightweight LLM using Python, train it on Hacker News data, optimize its performance, and deploy it for real-world usage.

What is a Language Model?

A Language Model (LM) is a type of artificial intelligence model that predicts the likelihood of a sequence of words in a given language. It learns patterns, grammar, and context from a large corpus of text data, enabling it to generate coherent and contextually relevant text. For example, given the phrase "I love to code in," an LM might predict "Python" as the next word, based on patterns observed in its training data.

LMs are used in various applications, including:

  • Text generation: Creating articles, stories, or dialogues.
  • Machine translation: Translating text from one language to another.
  • Text classification: Sentiment analysis or spam detection.
  • Autocomplete: Suggesting words or phrases in search engines or text editors.
  • In this article, we'll focus on building a simple autoregressive LM that predicts the next word in a sequence, trained specifically on HackerNews data.

What We’ll Cover

  • Understanding how LLMs work
  • Collecting and preprocessing Hacker News data
  • Tokenizing text and creating structured datasets
  • Training a transformer-based model using PyTorch and Hugging Face Transformers
  • Fine-tuning and optimizing for better performance
  • Evaluating model performance using loss metrics and perplexity
  • Deploying the model with FastAPI and making it accessible via an API
  • Improving the model with reinforcement learning and knowledge distillation
  • Scaling the model with distributed training
  • Reducing computational costs through quantization and pruning
  • Enhancing model security with adversarial training

Prerequisites

Before we start, install the necessary dependencies:

pip install torch transformers datasets tokenizers accelerate fastapi uvicorn matplotlib deepspeed bitsandbytes
Enter fullscreen mode Exit fullscreen mode

We will use:

  • torch for deep learning computations
  • transformers for leveraging pre-built architectures like GPT-2
  • datasets for handling large text corpora efficiently
  • tokenizers for high-speed text processing
  • fastapi and uvicorn for deploying the model as an API
  • matplotlib for visualizing loss curves and performance metrics
  • deepspeed for optimizing large-scale model training
  • bitsandbytes for quantization to reduce memory footprint

Step 1: Understanding Large Language Models

Before diving into code, it's essential to grasp how LLMs function. At their core, these models are neural networks trained to predict the next word in a sequence given an input context. They use:

  • Tokenization to break text into numerical representations
  • Transformer architectures (such as GPT-2) with attention mechanisms to understand long-range dependencies
  • Self-supervised learning to train on vast amounts of unstructured text data
  • Fine-tuning to adapt to specific tasks, such as chatbots or code generation

Step 2: Collecting Hacker News Data

We’ll use the Hacker News API to collect stories and comments for training.

import requests
import json
import time

def fetch_hackernews_data(num_stories=1000):
    url = "https://hacker-news.firebaseio.com/v0/topstories.json"
    story_ids = requests.get(url).json()[:num_stories]

    stories = []
    for story_id in story_ids:
        story_url = f"https://hacker-news.firebaseio.com/v0/item/{story_id}.json"
        response = requests.get(story_url).json()
        if "text" in response:
            stories.append(response["text"])
        time.sleep(0.5)  # Prevent API rate limits

    return stories

hackernews_texts = fetch_hackernews_data()
Enter fullscreen mode Exit fullscreen mode

Step 3: Preprocessing and Tokenization

Data cleaning ensures better training results. We remove HTML tags, non-text characters, and normalize text.

Tokenization is the process of breaking down text into smaller units, called tokens, which can be words, subwords or even characters. This step is critical to transforming text into a numerical representation that deep learning models can understand and process.

import re
from transformers import AutoTokenizer

def clean_text(text):
    text = re.sub(r"<[^>]+>", "", text)  # Remove HTML tags
    text = re.sub(r"[^a-zA-Z0-9 .,!?]", "", text)  # Retain only valid characters
    return text.lower()

cleaned_texts = [clean_text(text) for text in hackernews_texts]
TOKENIZER = AutoTokenizer.from_pretrained("gpt2")
tokenized_texts = [TOKENIZER(text, truncation=True, padding="max_length", max_length=512) for text in cleaned_texts]
Enter fullscreen mode Exit fullscreen mode

Step 4: Creating a Dataset and DataLoader

We format our text for training using PyTorch’s Dataset class.

import torch
from torch.utils.data import Dataset, DataLoader

class HNDataset(Dataset):
    def __init__(self, tokenized_texts):
        self.inputs = torch.tensor([t["input_ids"] for t in tokenized_texts])
        self.attention_masks = torch.tensor([t["attention_mask"] for t in tokenized_texts])

    def __len__(self):
        return len(self.inputs)

    def __getitem__(self, idx):
        return {"input_ids": self.inputs[idx], "attention_mask": self.attention_masks[idx]}

train_dataset = HNDataset(tokenized_texts)
train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True)
Enter fullscreen mode Exit fullscreen mode

Step 5: Training the Transformer Model

We fine-tune a pre-trained GPT-2 model on our dataset.

from transformers import GPT2LMHeadModel, AdamW

device = "cuda" if torch.cuda.is_available() else "cpu"
model = GPT2LMHeadModel.from_pretrained("gpt2").to(device)
optimizer = AdamW(model.parameters(), lr=5e-5)

EPOCHS = 3
for epoch in range(EPOCHS):
    model.train()
    for batch in train_dataloader:
        inputs = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)

        outputs = model(input_ids=inputs, attention_mask=attention_mask, labels=inputs)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    print(f"Epoch {epoch+1}: Loss = {loss.item()}")
Enter fullscreen mode Exit fullscreen mode

Step 6: Model Optimization

Model optimization focuses on improving the efficiency of a trained model by reducing its memory usage and increasing its inference speed. Two key techniques used for optimization are:

Quantization

Quantization reduces the precision of model parameters (e.g., converting 32-bit floating point numbers to 8-bit integers). This helps decrease memory consumption and speeds up inference, especially on resource-limited devices.
In the code, we achieve this using BitsAndBytesConfig(load_in_8bit=True), which loads the GPT-2 model in an 8-bit format, reducing its size and computational requirements.

Pruning

Pruning removes unnecessary parameters from the model, reducing the number of computations required during inference. While pruning is not explicitly implemented in the code, it can be done by eliminating less significant weights from the neural network.

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(load_in_8bit=True)
model = GPT2LMHeadModel.from_pretrained("gpt2", quantization_config=bnb_config).to(device)
Enter fullscreen mode Exit fullscreen mode

Step 7: Deploying as an API

We use FastAPI to make our model accessible.

from fastapi import FastAPI

app = FastAPI()
@app.get("/generate")
def generate(prompt: str):
    return {"generated_text": generate_text(prompt)}
Enter fullscreen mode Exit fullscreen mode

Run the API:

uvicorn app:app --reload
Enter fullscreen mode Exit fullscreen mode

Conclusion

We have successfully built, trained, optimized, and deployed a custom LLM using Hacker News data. Future improvements could involve:

  • Training on a larger dataset
  • Optimizing hyperparameters
  • Implementing reinforcement learning with human feedback (RLHF)
  • Deploying in a production-grade environment
  • Enhancing security against adversarial attacks

Top comments (0)