Fine-Tuning BERT for Precise Sentiment Detection in Blog Feedback

Large Language Models (LLMs) are powerful AI systems that analyze and generate human-like text. However, they are not perfect and can sometimes produce incorrect or misleading responses—this phenomenon is known as hallucination. To mitigate these inaccuracies, we can use three key approaches:

Prompt Engineering
Retrieval-Augmented Generation (RAG)
Fine-Tuning

1. Prompt Engineering: Maximizing Context Window

One way to improve an LLM’s accuracy in analyzing blog comments is by structuring the prompt with relevant context. This helps the model understand sentiment better and generate more precise responses. However, every model has a token limit. For example:

OpenAI’s GPT-4o supports up to 8,192 tokens
GPT-3.5 has a 4,096-token limit
Google’s Gemini 1.5 Flash supports 1 million tokens
Google’s Gemini 1.5 Pro supports a massive 2 million tokens

A context window refers to the amount of information a model can process at once. With such high limits in Gemini models, you might wonder:

🤔 Is RAG dead?

Absolutely not! Despite larger context windows, RAG remains essential for handling vast, structured, and real-time external knowledge retrieval. I’ll discuss this in detail in an upcoming blog.

2. Retrieval-Augmented Generation (RAG): Enhancing Sentiment Accuracy

Retrieval Augmented Generation (RAG) is a strategy that helps address both LLM hallucinations and out-of-date training data. RAG combines the strengths of information retrieval systems with the generative capabilities of LLMs. By integrating external data sources, RAG enhances the LLM's ability to generate accurate and contextually relevant responses. This approach mitigates hallucinations and compensates for the LLM's outdated training data, providing more reliable and current information. when real-time or domain-specific knowledge is required. By retrieving relevant context dynamically.

3. Fine-Tuning: Customizing LLMs for Sentiment Analysis

For highly specialized sentiment analysis, fine-tuning an LLM on domain-specific comments is the best approach. Fine-tuning allows the model to learn specific sentiment patterns, improving accuracy in classifying user feedback. However, this comes with challenges:

Computational Cost – Fine-tuning requires significant computing resources.
Custom Datasets – You need to prepare and label domain-specific data for training.

Fine-Tuning a BERT Model for Blog Comment Sentiment Analysis

In this blog, we will explore how to fine-tune a BERT transformer model using the Hugging Face Transformers library. I’ll use a custom dataset of blog comments with sentiment labels:

2 – Very good
1 – Good
0 – Bad

Since these labels are ordinal (having a meaningful order), we use label encoding instead of one-hot encoding to convert them into numerical values for model training. Additionally, I’ll be using TensorFlow for fine-tuning, though PyTorch is another viable option.

Fine Tuning Architecture

from sklearn.model_selection import train_test_split
import tensorflow as tf
from transformers import BertTokenizer, TFBertForSequenceClassification
import numpy as np
import pandas as pd
# Sample text corpus and labels
documents = [
    "The blog is insightful and provides great value to the reader.",
    "The content is useful but lacks depth in certain areas.",
    "Sreeni's blog offers excellent advice and strategies for developers.",
    "It was a bit too technical for someone new to the topic.",
    "I appreciate how the blog explains complex concepts simply.",
    "The blog is very helpful for beginners and experts alike.",
    "The writing could use more examples to enhance understanding.",
    "Fantastic explanations with clear and practical takeaways.",
    "The blog could have gone deeper into some of the topics.",
    "A must-read for anyone looking to improve their skills in AI.",
    "The blog offers great insights but lacks some real-world examples.",
    "The information provided is accurate but presented in a monotonous manner.",
    "Excellent resource for understanding advanced topics.",
    "The blog is a bit too brief on certain technical details.",
    "Sreeni's examples in the blog are very effective and easy to follow.",
    "The content was informative, but I expected more detailed case studies.",
    "One of the best blogs I’ve read for learning new technologies.",
    "The blog was a bit too long and repetitive at times.",
    "Sreeni's writing style is clear and engaging throughout the blog.",
    "The information is solid, but the delivery could be improved."
]

# Labels (2 for very good, 1 for good, 0 for bad)
labels = [2, 1, 2, 0, 2, 2, 1, 2, 1, 2, 1, 0, 2, 1, 2, 1, 2, 0, 2, 1]

dataset = pd.DataFrame(documents, labels)
print(dataset.head())

# Load pre-trained BERT tokenizer and model for sequence classification
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)  # 3 labels (0, 1, 2)

# Tokenize the text data
inputs = tokenizer(documents, padding=True, truncation=True, return_tensors='tf')

# Prepare the dataset for TensorFlow
input_ids = np.array(inputs['input_ids'])
attention_mask = np.array(inputs['attention_mask'])
labels = np.array(labels)

# Split the data into training and testing sets
train_input_ids, test_input_ids, train_attention_mask, test_attention_mask, train_labels, test_labels = train_test_split(
    input_ids, attention_mask, labels, test_size=0.2, random_state=42
)

# Convert data to TensorFlow datasets
train_dataset = tf.data.Dataset.from_tensor_slices((
    {'input_ids': train_input_ids, 'attention_mask': train_attention_mask},
    train_labels
))

test_dataset = tf.data.Dataset.from_tensor_slices((
    {'input_ids': test_input_ids, 'attention_mask': test_attention_mask},
    test_labels
))

# Batch and prefetch datasets
train_dataset = train_dataset.batch(4).prefetch(tf.data.experimental.AUTOTUNE)
test_dataset = test_dataset.batch(4).prefetch(tf.data.experimental.AUTOTUNE)

# Compile the model
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

# Train the model
model.fit(train_dataset, epochs=3, validation_data=test_dataset)

# Evaluate the model
results = model.evaluate(test_dataset)
print("Evaluation results:")
print(results)

# New text data to test
new_documents = [
    "I found the blog to be a valuable resource.",
    "It was informative but not as engaging as I expected."
]

# Tokenize the new data
new_inputs = tokenizer(new_documents, padding=True, truncation=True, return_tensors='tf')

# Prepare input tensors for new data
new_input_ids = np.array(new_inputs['input_ids'])
new_attention_mask = np.array(new_inputs['attention_mask'])

# Create a TensorFlow dataset for the new data (no labels, as it's for inference)
new_dataset = tf.data.Dataset.from_tensor_slices((
    {'input_ids': new_input_ids, 'attention_mask': new_attention_mask}
))

# Batch and prefetch the new data for prediction
new_dataset = new_dataset.batch(4).prefetch(tf.data.experimental.AUTOTUNE)

# Predict using the trained model
predictions = model.predict(new_dataset)

# Get the predicted class (the one with the highest probability)
predicted_labels = np.argmax(predictions.logits, axis=1)

# Print the predictions
for doc, label in zip(new_documents, predicted_labels):
    print(f"Document: {doc}")
    print(f"Predicted Label: {label} (0 for Bad, 1 for Good, 2 for Very Good)")
    print()

Understanding the Output

al_accuracy: 0.7500

This suggests an accuracy score of 75%, meaning the model correctly classified 75% of the test data.
1/1 [==============================] - 0s 86ms/step - loss: 0.5542 - accuracy: 0.7500

Note: We only trained with 10 Epochs

Loss (0.5542): Represents how well (or poorly) the model's predictions match the true labels. Lower values indicate better performance.

Accuracy (0.7500): Confirms that the model achieved 75% accuracy during evaluation.
1/1: Indicates that the test dataset had only one batch (likely a small dataset).

86ms/step: The time taken per evaluation step (86 milliseconds).
Evaluation results: [0.5542, 0.75]

The list represents [loss, accuracy], which means the final evaluation loss is 0.5542 and the accuracy is 75%.
1/1 [==============================] - 7s 7s/step

This line suggests that another process (possibly prediction or additional evaluation) took 7 seconds per step, which is significantly longer than the previous step (86ms). This could be due to batch size, model complexity, or hardware limitations.

Interpretation

A 75% accuracy is decent but could be improved by:
Increasing training data or improving data quality
Experimenting with different hyperparameters

Thanks
Sreeni Ramadorai

Top comments (4)

Pankaj Jainani • Jan 26

_When combined, "RAG" (Retrieval Augmented Generation) and "fine-tuning" are often referred to as "RAFT" (Retrieval-Augmented Fine-Tuning), which essentially means using a fine-tuned model within a RAG architecture to leverage both domain-specific expertise and the ability to retrieve relevant information from external sources. _

Aptly summarised, Sreeni!