DEV Community

Shrijith Venkatramana
Shrijith Venkatramana

Posted on

The Power of Quantization: Shrinking GPT2, Unleashing Speed

Imagine taking a powerful language model like GPT-2—capable of crafting stories, answering questions, and mimicking human text—and compressing it into a leaner, faster version without gutting its capabilities.

This is the promise of quantization: a technique that reduces the precision of a model’s calculations, trading marginal accuracy for dramatic efficiency gains.

Phase 0: The Technical Setup

    !pip install torch transformers accelerate bitsandbytes psutil

    from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
    import torch
    import time
    import gc

    def get_memory_usage():
        return torch.cuda.memory_allocated() / 1e6 if torch.cuda.is_available() else 0


    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model_name = "gpt2"
    input_text = "Once upon a time"
Enter fullscreen mode Exit fullscreen mode

Phase 1: The Baseline – Full Precision (FP32)

The experiment begins with GPT-2 in its natural state: 32-bit floating-point precision (FP32). This is the model’s “full power” mode—highly precise but resource-intensive.

  • Memory: Loading the FP32 model consumes 511 MB of GPU memory.
  • Speed: Generating 50 tokens from the prompt “Once upon a time” takes 1.76 seconds.
  • Post-Cleanup Footprint: Even after deleting the model, 458 MB of memory remains occupied.

FP32 works, but it’s bulky.

    # Load tokenizer and base model
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    print(f"Pre-load memory: {get_memory_usage()} MB")

    # Full precision model
    model_fp32 = AutoModelForCausalLM.from_pretrained(model_name).to(device)
    print(f"Post-load memory: {get_memory_usage()} MB")  # 511.15 MB

    # Inference measurement
    inputs = tokenizer(input_text, return_tensors="pt").to(device)
    start_time = time.time()
    output = model_fp32.generate(**inputs, max_length=50)
    inference_time = time.time() - start_time  # 1.76s

    # Cleanup protocol
    del model_fp32, inputs
    gc.collect()
    torch.cuda.empty_cache()
Enter fullscreen mode Exit fullscreen mode

Phase 2: Trimming the Fat – 8-bit Quantization (INT8)

Enter 8-bit quantization, where weights and activations are stored as integers instead of floats. The transformation is immediate:

  • Memory: The INT8 model loads with just 187 MB63% smaller than FP32.
  • Speed: Inference accelerates to 1.38 seconds, a 22% improvement.
  • Post-Cleanup Footprint: Memory drops to 139 MB after deletion.

The model is lighter, faster, and still functional. A clear upgrade.

    # 8-bit configuration
    quant_config_8bit = BitsAndBytesConfig(load_in_8bit=True)

    print(f"Pre-load memory: {get_memory_usage()} MB")  # 9.18 MB
    model_int8 = AutoModelForCausalLM.from_pretrained(
        model_name, 
        quantization_config=quant_config_8bit
    )

    # Dynamic input handling
    inputs_int8 = tokenizer(input_text, return_tensors="pt").to(model_int8.device)
    start_time = time.time()
    output = model_int8.generate(**inputs_int8, max_length=50)  # 1.38s
Enter fullscreen mode Exit fullscreen mode

Phase 3: The Edge of Efficiency – 4-bit Quantization (INT4)

Now we push further. With 4-bit quantization, weights are compressed to near-minimal precision, and computations use 16-bit floats for stability.

  • Memory: The INT4 model weighs in at 149 MB, 71% lighter than FP32.
  • Speed: Inference time drops to 1.08 seconds, a 39% gain over FP32.
  • Post-Cleanup Footprint: Memory plummets to 58 MB—a fraction of the original.

This isn’t just optimization; it’s reinvention.

    # 8-bit configuration
    quant_config_8bit = BitsAndBytesConfig(load_in_8bit=True)

    print(f"Pre-load memory: {get_memory_usage()} MB")  # 9.18 MB
    model_int8 = AutoModelForCausalLM.from_pretrained(
        model_name, 
        quantization_config=quant_config_8bit
    )

    # Dynamic input handling
    inputs_int8 = tokenizer(input_text, return_tensors="pt").to(model_int8.device)
    start_time = time.time()
    output = model_int8.generate(**inputs_int8, max_length=50)  # 1.38s
Enter fullscreen mode Exit fullscreen mode

The Trade-offs: Precision vs. Practicality

Quantization isn’t free. Reducing precision can subtly degrade model accuracy, but for many tasks—like casual text generation—the difference is imperceptible. What we gain far outweighs the cost:

  • Memory Efficiency:FP32: 511 MB → INT8: 187 MB → INT4: 149 MB.

Result: Models fit into tighter memory constraints, enabling deployment on consumer GPUs or edge devices.

  • Inference Speed:FP32: 1.76s → INT8: 1.38s → INT4: 1.08s.

Result: Faster responses for real-time applications, from chatbots to automated content generation.


How It Works: The Mechanics of Compression

At its core, quantization maps high-precision values (like 32-bit floats) to lower-precision formats (8- or 4-bit integers). For example:

  • FP32 uses 32 bits per number, capturing fine details but demanding heavy resources.
  • INT8/INT4 use fewer bits, approximating values with minimal loss.

The bitsandbytes library handles this automatically, repacking weights and adjusting computations to maintain stability.


The Visual Proof

The Visual Proof

A side-by-side comparison seals the argument:

  • Memory Usage (Bar Chart): FP32 towers over INT8 and INT4, showcasing the stark reduction in resource demands.
  • Inference Time (Line Plot): The downward slope from FP32 to INT4 highlights the speed gains.

The takeaway? Quantization isn’t just a technical footnote—it’s a practical tool for democratizing AI.

    # Visualization setup
    import matplotlib.pyplot as plt
    quantization_types = ['FP32', 'INT8', 'INT4']

    fig, ax1 = plt.subplots(figsize=(8, 6))
    bars = ax1.bar(quantization_types, memory_usages, color='blue', alpha=0.7)
    ax1.set_ylabel('Memory (MB)', color='blue')

    # Annotation logic
    for bar in bars:
        yval = bar.get_height()
        ax1.text(bar.get_x() + bar.get_width()/2, yval+30, 
                 f'{yval:.2f}', ha='center', va='bottom', 
                 color='blue', fontweight='bold')

    # Dual-axis formatting
    ax2 = ax1.twinx()
    ax2.plot(quantization_types, inference_times, color='red', 
             marker='o', linewidth=2)
    ax2.set_ylabel('Time (sec)', color='red')

    plt.title('Quantization Trade-offs')
    plt.show()
Enter fullscreen mode Exit fullscreen mode

The Final Word

Through quantization, we’ve transformed GPT-2 from a resource-heavy behemoth into a nimble, efficient tool—proving that with the right techniques, even giants can learn to move lightly.

This implementation reveals quantization's power through concrete code and measurements. By modifying just 10-15 lines of configuration, and deploying quantization, we achieved:

  • 71% reduction in memory footprint
  • 39% faster inference speeds

If you're curious and wish to have acccess to the full notebook for the experiment - head over to Google Colab.

Top comments (0)