Tachi 0x

Posted on Sep 17

Build AI Text Classification in Rust with Rig

#rust #ai #llm #opensource

TL;DR: This guide walks you through building a text classification system in Rust using the Rig library. In very few lines of code, you'll create a system that performs sentiment analysis and classifies news articles by topic, leveraging OpenAI's GPT models for accurate text classification.

Introduction

Text classification is a fundamental task in natural language processing, involving the assignment of predefined categories to text documents. It's widely used in applications such as sentiment analysis, content categorization, and spam detection. Large Language Models (LLMs) have significantly improved the accuracy and flexibility of text classification tasks, but working with them can be complex.

Rig, an open-source Rust library, simplifies the development of LLM-powered applications, including text classification systems. In this guide, I'll walk you through the process of building a functional text classification system using Rig. We'll create a system capable of performing sentiment analysis and classifying news articles by topic, demonstrating Rig's application to real-world text classification tasks.

💡 Tip: New to Rig?

If you're not familiar with Rig or want a comprehensive introduction to its capabilities, check out our introductory blog post: Rig: A Rust Library for Building LLM-Powered Applications. It provides an overview of Rig's features and how it simplifies LLM application development in Rust.

💡 Tip: New to Rust?

This guide assumes some familiarity with Rust and a set-up coding environment. If you're just starting out or need to set up your environment, check out these quick guides:

Introduction to Rust

Setting up Rust with VS Code

These resources will help you get up to speed quickly!

Setting Up the Project

Let's start by setting up our Rust project and installing the necessary dependencies.

Create a new Rust project:

cargo new text_classifier
cd text_classifier

Add the following dependencies to your Cargo.toml:

[dependencies]
rig-core = "0.0.6"
tokio = { version = "1.34.0", features = ["full"] }
anyhow = "1.0.75"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
dotenv = "0.15.0"

These dependencies provide the core functionality we need:

rig-core: The main Rig library for LLM applications
tokio: Asynchronous runtime for Rust
anyhow: Flexible error handling
serde and serde_json: JSON serialization and deserialization
dotenv: Loading environment variables from a file

Before coding, set up your OpenAI API key:

export OPENAI_API_KEY=your_api_key_here

Building the Text Classification System

We'll start with a simple sentiment analysis classifier. This will demonstrate the basics of using Rig for text classification.

First, let's define our data structures:

// Import necessary dependencies
use rig::providers::openai;
use schemars::JsonSchema;
use serde::{Deserialize, Serialize};

// Define an enum to represent sentiment categories
#[derive(Debug, Deserialize, JsonSchema, Serialize)]
enum Sentiment {
    Positive,
    Negative,
    Neutral,
}

// Define a struct to hold the sentiment classification result
#[derive(Debug, Deserialize, JsonSchema, Serialize)]
struct SentimentClassification {
    sentiment: Sentiment,
    confidence: f32,
}

fn pretty_print_result(text: &str, result: &SentimentClassification) {
    println!("Text: \"{}\"", text);
    println!("Sentiment Analysis Result:");
    println!("  Sentiment: {:?}", result.sentiment);
    println!("  Confidence: {:.2}%", result.confidence * 100.0);
    println!();
}

Now, let's implement our sentiment classifier:

#[tokio::main]
async fn main() {
    // Initialize the OpenAI client
    let openai_client = openai::Client::from_env();

    // Create a sentiment classifier using Rig's Extractor
    let sentiment_classifier = openai_client
        .extractor::<SentimentClassification>("gpt-3.5-turbo")
        .preamble("
            You are a sentiment analysis AI. Classify the sentiment of the given text.
            Respond with Positive, Negative, or Neutral, along with a confidence score (0-1).
            Examples:
            Text: 'This movie was terrible. I hated every minute of it.'
            Result: Negative, 0.9
            Text: 'The weather today is okay, nothing special.'
            Result: Neutral, 0.7
            Text: 'I'm so excited about my upcoming vacation!'
            Result: Positive, 0.95
        ")
        .build();

    // Sample text to classify
    let text = "I absolutely loved the new restaurant. The food was amazing!";

    // Perform sentiment classification
    match sentiment_classifier.extract(text).await {
        Ok(result) => pretty_print_result(text, &result),
        Err(e) => eprintln!("Error classifying sentiment: {}", e),
    }
}

This code creates a sentiment classifier using OpenAI's GPT-3.5-turbo model. The Extractor is configured with a preamble that instructs the model to perform sentiment analysis and provides examples to guide its output. When we call extract with our input text, the model classifies the sentiment and returns a SentimentClassification struct.

When you run this code, you should see output similar to this:

Text: I absolutely loved the new restaurant. The food was amazing!
Sentiment Analysis Result:
  Sentiment: Positive
  Confidence: 95.00%

The output demonstrates that our sentiment classifier correctly identified the positive sentiment in the input text. The high confidence score (0.95) indicates that the model is very certain about its classification. This aligns well with the strongly positive language used in the input text.

Advanced Text Classification: News Article Classifier

Now that we've covered the basics, let's build a more complex system: a news article classifier that categorizes articles by topic and performs sentiment analysis.

First, let's define our structures:

// Import necessary dependencies
use rig::providers::openai;
use schemars::JsonSchema;
use serde::{Deserialize, Serialize};

// Define enum for sentiment classification
#[derive(Debug, Deserialize, JsonSchema, Serialize)]
enum Sentiment {
    Positive,
    Negative,
    Neutral,
}

// Define an enum for sentiment confidence level
#[derive(Debug, Deserialize, JsonSchema, Serialize)]
struct SentimentClassification {
    sentiment: Sentiment,
    confidence: f32,
}

// Define an enum to represent news article topics
#[derive(Debug, Deserialize, JsonSchema, Serialize)]
enum Topic {
    Politics,
    Technology,
    Sports,
    Entertainment,
    Other(String),
}

// Define a struct to hold the news article classification result
#[derive(Debug, Deserialize, JsonSchema, Serialize)]
struct NewsArticleClassification {
    topic: Topic,
    sentiment: SentimentClassification,
    summary: String,
}

fn pretty_print_result(article: &str, result: &NewsArticleClassification) {
    println!("Article: \"{}...\"", &article[..100]); // Print first 100 characters
    println!("Classification Result:");
    println!("  Topic: {:?}", result.topic);
    println!("  Sentiment: {:?}", result.sentiment.sentiment);
    println!("  Confidence: {:.2}%", result.sentiment.confidence * 100.0);
    println!("  Summary: {}", result.summary);
    println!();
}

Now, let's implement our news article classifier:

#[tokio::main]
async fn main() {
    // Initialize the OpenAI client
    let openai_client = openai::Client::from_env();

    // Create a news article classifier using Rig's Extractor
    let news_classifier = openai_client
        .extractor::<NewsArticleClassification>("gpt-4")
        .preamble("
            You are a news article classification AI. For the given news article:
            1. Classify the main topic (Politics, Technology, Sports, Entertainment, or Other).
            2. Analyze the overall sentiment (Positive, Negative, or Neutral) with a confidence score.
            3. Provide a brief summary of the article.
        ")
        .build();

   // Sample news article to classify
    let article = "
        After conducting the first-ever commercial spacewalk and traveling farther from Earth than anyone \
        in more than half a century, the astronauts of the Polaris Dawn mission returned to Earth safely \
        early Sunday.

        The SpaceX Crew Dragon capsule splashed down in the Gulf of Mexico, off the coast of Dry Tortugas, \
        Fla., shortly after 3:30 a.m., carrying Jared Isaacman, a billionaire entrepreneur, and his crew \
        of three private astronauts, according to a SpaceX livestream.

        The ambitious space mission, a collaboration between Mr. Isaacman and Elon Musk's SpaceX, spent \
        five days in orbit, achieved several milestones in private spaceflight and was further evidence \
        that space travel and spacewalks are no longer the exclusive domain of professional astronauts \
        working at government agencies like NASA.

        The Crew Dragon capsule launched on Tuesday, after delays because of a helium leak and bad weather. \
        On board were Mr. Isaacman, the mission commander and the founder of the payment services company \
        Shift4; Sarah Gillis and Anna Menon, SpaceX employees; and Scott Poteet, a retired U.S. Air Force \
        lieutenant colonel.

        Late on Tuesday, its orbit reached a high point of about 870 miles above the Earth's surface. That \
        beat the record distance for astronauts on a mission not headed to the moon, which the Gemini XI \
        mission set in 1966 at 853 miles high, and made Ms. Gillis and Ms. Menon the first women ever to \
        fly so far from Earth.

        On Thursday, Mr. Isaacman and Ms. Gillis became the first private astronauts to successfully complete \
        a spacewalk. The operation involved the crew letting all the air out of the spacecraft, because it \
        had no airlock, while the other two crew members wore spacesuits inside the airless capsule. Mr. \
        Isaacman moved outside and conducted mobility tests of his spacesuit for a few minutes before \
        re-entering the capsule. Ms Gillis then moved outside and performed the same tests.

        This was the first of three Polaris missions aimed at accelerating technological advances needed to \
        fulfill Mr. Musk's dream of sending people to Mars someday. A key goal of the mission was to further \
        the development of more advanced spacesuits that would be needed for SpaceX to try any future \
        off-world colonization.

        During a news conference before the launch, Mr. Isaacman mused that one day, someone might step onto \
        Mars wearing a version of the spacesuit that SpaceX had developed for this flight. Closer to Earth, \
        commercial spacewalks also present other possibilities, like technicians repairing private satellites \
        in orbit.

        During the spaceflight, the four astronauts conducted about 40 experiments, mostly about how \
        weightlessness and radiation affect the human body. They also tested laser communications between \
        the Crew Dragon and SpaceX's constellation of Starlink internet satellites.\
    ";

    // Perform news article classification
    match news_classifier.extract(article).await {
        Ok(result) => pretty_print_result(article, &result),
        Err(e) => eprintln!("Error classifying article: {}", e),
    }
}

When you run this code, you might see output like this:

Article: "
        After conducting the first-ever commercial spacewalk and traveling farther from Earth than ..."
Classification Result:
  Topic: Technology
  Sentiment: Positive
  Confidence: 90.00%
  Summary: The SpaceX Crew Dragon capsule carrying billionaire entrepreneur Jared Isaacman and his crew of three private astronauts returned successfully to Earth, after conducting the first-ever commercial spacewalk and setting a new distance record. The mission, a collaboration between Isaacman and SpaceX, is a part of three Polaris missions aimed to accelerate technological advances for space colonization. SpaceX also hopes to develop more advanced spacesuits necessary for future Mars missions.

This output shows that our news article classifier has successfully categorized the article as belonging to the Technology topic. The sentiment is classified as Positive with a relatively high confidence of 90%.

The classification makes sense given the article's content:

The topic is clearly technology-focused, discussing space exploration, SpaceX's Crew Dragon capsule, and advancements in spacesuit technology.
The positive sentiment is reflected in phrases like "first-ever commercial spacewalk," "achieved several milestones," and "accelerating technological advances." The high confidence in the positive sentiment comes from the overall optimistic tone about the mission's achievements and future possibilities.
The technology classification is appropriate as the article primarily focuses on the technological aspects of space exploration, including the spacecraft, spacesuits, and experiments conducted during the mission.
While the article does mention some challenges (like delays due to a helium leak and bad weather), these are presented as minor setbacks in an otherwise successful mission, which explains why the sentiment remains positive but the confidence isn't at the maximum.

This example demonstrates how our classifier can handle complex, real-world news articles, extracting both the main topic and the overall sentiment accurately. It shows Rig's capability to process nuanced content and provide insightful classifications.

Best Practices and Common Pitfalls

As you work with Rig for text classification, keep these best practices in mind:

Prompting: Craft clear, specific prompts. Include examples in your preamble to guide the model's output.
Model Selection: Choose the appropriate model for your task. While GPT-4 is more capable, GPT-3.5-turbo may be sufficient for many classification tasks and is more cost-effective.
Error Handling: Always handle potential errors from API calls and unexpected model outputs.
Validation: Implement output validation to ensure the model's responses match your expected format.
Batching: Use batching for processing multiple texts to reduce API calls and improve efficiency.

Watch out for these common pitfalls:

Overreliance on the Model: Don't assume the model will always produce perfect classifications. Implement checks and balances in your system.
Ignoring Rate Limits: Be aware of and respect the rate limits of your LLM provider.
Neglecting Security: Always protect your API keys and sensitive data.
Lack of Monitoring: Implement proper logging and monitoring to catch issues early.

Conclusion and Next Steps

In this guide, we've built a text classification system using Rig, demonstrating its capability to simplify LLM-powered applications in Rust. By creating both a sentiment analyzer and a news article classifier, we've showcased Rig's flexibility and power in handling diverse text classification tasks.

To further explore Rig's capabilities:

Adapt the provided code examples to your specific use cases.
Dive deeper into Rig's features by exploring the documentation.
Experiment with different models and classification tasks.

For additional resources and community engagement:

Browse more examples in our gallery.
Contribute or report issues on GitHub.
Join discussions in our Discord community.

We're continually improving Rig based on user feedback. If you build a project with Rig, consider sharing your experience through our feedback form and get rewarded with $100. Your insights are valuable in shaping Rig's future development.

Happy coding!

Ad Astra,

Tachi
Co-Founder @ Playgrounds Analytics