Raheel Siddiqui

Posted on Feb 20

Lowering Your Gemini API Bill: A Guide to Context Caching

#llm #contextcaching #genai #geminiapi

Last week, I spent hours debugging why our RAG system was burning through our API budget like there was no tomorrow. Then I discovered Gemini's context caching feature - a game-changer for anyone working with large context windows.

Let me walk you through what this is, why it matters, and how to implement it in your projects.

The Problem: Repeated Context = Wasted Tokens

If you've built LLM applications, you've probably encountered this scenario: You have a large document, knowledge base, or system prompt that needs to be included with every user query.

The traditional approach looks like this:

User asks a question
Your code combines the full context + their question
Send everything to the LLM
Repeat for every single question

This means sending the same tokens over and over again. For large contexts, this adds up quickly in terms of:

API costs (paying for the same tokens repeatedly)
Latency (transferring large amounts of data)
Processing time (model must process all tokens with each request)

Enter Context Caching

Gemini's context caching feature lets you upload content once, store it server-side, and reference it in subsequent requests. Think of it as creating a temporary knowledge base that the model can access without you needing to resend it.

import os
from google import genai
from google.genai import types

# Configure the client
client = genai.Client(api_key=os.environ.get("GOOGLE_API_KEY"))

# Large knowledge base or system instruction
knowledge_base = """
[Your large document, instructions, or context here - must be at least 32,768 tokens]
"""

# Create a cache (note the model version suffix is required)
cache = client.caches.create(
    model="models/gemini-1.5-pro-001",  # Must include version suffix
    config=types.CreateCachedContentConfig(
        display_name="my_knowledge_base",
        system_instruction="You are a helpful assistant that answers questions based on the provided knowledge base.",
        contents=[knowledge_base],
        ttl="3600s",  # 1 hour cache lifetime
    )
)

# Now you can query using just the user's question
response = client.models.generate_content(
    model="models/gemini-1.5-pro-001", 
    contents="Who was the founder of the company?",
    config=types.GenerateContentConfig(cached_content=cache.name)
)

print(response.text)
print(response.usage_metadata)

When Context Caching Shines

From my experience building production systems, context caching works best for:

1. Document Q&A Systems

If you're building a system to answer questions about large documents (legal contracts, technical manuals, research papers), caching is perfect. Cache the document once, then let users ask multiple questions without resending it.

2. Complex RAG Systems

When implementing retrieval-augmented generation with extensive knowledge bases, you can cache frequently accessed chunks or entire document collections.

3. Video/Audio Analysis

If you're analyzing long media files, caching prevents repeatedly sending the same massive file with each analytical query.

4. Consistent System Instructions

For applications that use elaborate system prompts or few-shot examples, caching these instructions saves tokens on every request.

Real-World Implementation Example

Here's a practical example from a customer support system I built recently:

import os
import time
from google import genai
from google.genai import types

class CachedKnowledgeBase:
    def __init__(self, api_key, model="models/gemini-1.5-flash-001", cache_hours=24):
        self.client = genai.Client(api_key=api_key)
        self.model = model
        self.cache = None
        self.ttl_seconds = int(cache_hours * 3600)
        self.cache_created = False

    def load_knowledge_base(self, kb_file_path, system_instruction=None):
        """Load and cache the knowledge base from a file"""
        # Read the knowledge base file
        with open(kb_file_path, 'r') as file:
            kb_content = file.read()

        # Set default system instruction if none provided
        if not system_instruction:
            system_instruction = """
            You are a customer support specialist. Answer customer questions 
            based ONLY on the information in the knowledge base. 
            If you don't know the answer, say so clearly rather than making something up.
            Always be polite, concise, and helpful.
            """

        # Create the cache
        try:
            self.cache = self.client.caches.create(
                model=self.model,
                config=types.CreateCachedContentConfig(
                    display_name=f"support_kb_{os.path.basename(kb_file_path)}",
                    system_instruction=system_instruction,
                    contents=[kb_content],
                    ttl=f"{self.ttl_seconds}s",
                )
            )
            self.cache_created = True
            print(f"Knowledge base cached successfully! (ID: {self.cache.name})")
            print(f"Cache will expire in {self.ttl_seconds/3600} hours")
            return True
        except Exception as e:
            print(f"Failed to cache knowledge base: {e}")
            return False

    def answer_question(self, question, temperature=0.2):
        """Answer a customer question using the cached knowledge base"""
        if not self.cache_created:
            raise Exception("Knowledge base not cached. Call load_knowledge_base first.")

        try:
            start_time = time.time()
            response = self.client.models.generate_content(
                model=self.model,
                contents=question,
                config=types.GenerateContentConfig(
                    cached_content=self.cache.name,
                    temperature=temperature
                )
            )
            end_time = time.time()

            # Extract token usage info
            usage = response.usage_metadata

            # Return the response and metadata
            return {
                "answer": response.text,
                "response_time": round(end_time - start_time, 2),
                "cached_tokens": usage.cached_content_token_count,
                "prompt_tokens": usage.prompt_token_count,
                "response_tokens": usage.candidates_token_count,
                "total_tokens": usage.total_token_count
            }
        except Exception as e:
            return {"error": str(e)}

    def extend_cache(self, additional_hours=24):
        """Extend the cache lifetime"""
        if not self.cache_created:
            return False

        new_ttl = int(additional_hours * 3600)
        try:
            self.client.caches.update(
                name=self.cache.name,
                config=types.UpdateCachedContentConfig(
                    ttl=f"{new_ttl}s"
                )
            )
            self.ttl_seconds = new_ttl
            print(f"Cache extended by {additional_hours} hours")
            return True
        except Exception as e:
            print(f"Failed to extend cache: {e}")
            return False

    def cleanup(self):
        """Delete the cache when no longer needed"""
        if self.cache_created:
            try:
                self.client.caches.delete(self.cache.name)
                print("Cache deleted successfully")
                self.cache_created = False
                return True
            except Exception as e:
                print(f"Failed to delete cache: {e}")
                return False

# Usage example
if __name__ == "__main__":
    support_bot = CachedKnowledgeBase(
        api_key=os.environ.get("GOOGLE_API_KEY"),
        model="models/gemini-1.5-flash-001",
        cache_hours=48
    )

    # Load the knowledge base
    support_bot.load_knowledge_base(
        "product_documentation.txt",
        system_instruction="""
        You are a technical support assistant for our cloud product.
        Answer customer questions precisely based on our documentation.
        Include specific steps when describing how to solve technical issues.
        If information isn't in the documentation, direct the customer to contact
        live support rather than guessing.
        """
    )

    # Example customer questions
    questions = [
        "How do I reset my password?",
        "What's the difference between Basic and Pro plans?",
        "Can I integrate with Salesforce?",
        "What are the system requirements?",
        "How do I set up two-factor authentication?"
    ]

    # Process all questions and track total token usage
    total_cached_tokens = 0
    total_prompt_tokens = 0
    total_response_tokens = 0

    for i, question in enumerate(questions):
        print(f"\nQuestion {i+1}: {question}")
        result = support_bot.answer_question(question)

        if "error" in result:
            print(f"Error: {result['error']}")
            continue

        print(f"Answer: {result['answer'][:150]}...")
        print(f"Response time: {result['response_time']}s")
        print(f"Tokens: {result['prompt_tokens']} prompt + {result['cached_tokens']} cached + {result['response_tokens']} response")

        total_cached_tokens = result['cached_tokens']  # Same for all queries
        total_prompt_tokens += result['prompt_tokens']
        total_response_tokens += result['response_tokens']

    # Calculate cost estimates
    # These are example rates - adjust based on current pricing
    cached_storage_cost = (total_cached_tokens / 1_000_000) * 48 * 1  # $1 per million tokens per hour
    standard_approach_cost = ((total_cached_tokens * len(questions)) / 1_000) * 0.0005
    cached_approach_cost = ((total_prompt_tokens + total_response_tokens) / 1_000) * 0.0005 + cached_storage_cost

    print("\n--- Cost Analysis ---")
    print(f"Standard approach (resending context): ${standard_approach_cost:.2f}")
    print(f"Using context caching: ${cached_approach_cost:.2f}")
    print(f"Savings: ${standard_approach_cost - cached_approach_cost:.2f} ({(1 - cached_approach_cost/standard_approach_cost) * 100:.1f}%)")

    # Clean up when done
    support_bot.cleanup()

Cost Considerations: When Is It Worth It?

Context caching isn't free - you're paying for storage time. Here's how the costs break down:

Storage Cost: $1 per million tokens per hour
Processing Cost: You still pay for processing the cached tokens, but at a reduced rate

Let's look at a real example from a project I worked on last month:

Knowledge base: 50,000 tokens
Cache duration: 24 hours
Average user makes 15 queries per day
Average query: 25 tokens
Average response: 200 tokens

Without caching:

Total tokens processed: 50,025 tokens × 15 queries = 750,375 tokens per day
Cost at $0.0005 per 1K tokens: $0.38 per day per user

With caching:

Storage cost: (50,000 tokens ÷ 1,000,000) × 24 hours × $1 = $1.20 for 24 hours
Processing cost: (25 + 200) tokens × 15 queries × $0.0005 per 1K tokens = $0.0017
Total cost: $1.20 + $0.0017 = $1.21 per day

In this scenario, caching doesn't make financial sense. But when serving 100+ users with the same knowledge base, the economics flip dramatically:

Without caching (100 users): $0.38 × 100 = $38 per day
With caching (100 users): $1.20 + ($0.0017 × 100) = $1.37 per day

That's a 96% cost reduction!

Implementation Tips from the Trenches

After implementing this in several projects, here are my hard-earned tips:

Version Requirement: Always include the version suffix (e.g., -001) when specifying the model.
Minimum Token Requirement: The context must be at least 32,768 tokens. This is a current limitation that Google will hopefully reduce in the future.
Cache Management: Implement cache lifecycle management. Here's a pattern I use:

import datetime
import os
from google import genai
from google.genai import types

class GeminiCacheManager:
    def __init__(self, api_key=None):
        self.client = genai.Client(api_key=api_key or os.environ.get("GOOGLE_API_KEY"))

    def list_all_caches(self):
        """List all active caches with metadata"""
        caches = []
        try:
            response = self.client.caches.list()
            for cache in response:
                # Calculate remaining time
                if hasattr(cache, 'expire_time') and cache.expire_time:
                    now = datetime.datetime.now(datetime.timezone.utc)
                    expire_time = cache.expire_time
                    remaining = expire_time - now
                    remaining_hours = remaining.total_seconds() / 3600
                else:
                    remaining_hours = "Unknown"

                caches.append({
                    "name": cache.name,
                    "display_name": cache.display_name if hasattr(cache, 'display_name') else "Unnamed",
                    "model": cache.model if hasattr(cache, 'model') else "Unknown",
                    "created": cache.create_time.isoformat() if hasattr(cache, 'create_time') else "Unknown",
                    "expires": cache.expire_time.isoformat() if hasattr(cache, 'expire_time') else "Unknown",
                    "remaining_hours": remaining_hours,
                })
            return caches
        except Exception as e:
            print(f"Error listing caches: {e}")
            return []

    def extend_all_caches(self, additional_hours=24):
        """Extend all active caches by the specified hours"""
        extended = 0
        failed = 0

        caches = self.list_all_caches()
        for cache in caches:
            try:
                # Calculate new expiration time
                new_expiry = datetime.datetime.now(datetime.timezone.utc) + datetime.timedelta(hours=additional_hours)

                self.client.caches.update(
                    name=cache["name"],
                    config=types.UpdateCachedContentConfig(
                        expire_time=new_expiry
                    )
                )
                extended += 1
                print(f"Extended cache '{cache['display_name']}' to expire at {new_expiry.isoformat()}")
            except Exception as e:
                failed += 1
                print(f"Failed to extend cache '{cache['display_name']}': {e}")

        return {"extended": extended, "failed": failed}

    def cleanup_expired_caches(self):
        """Delete caches that have expired or are about to expire (within 10 minutes)"""
        deleted = 0
        failed = 0

        caches = self.list_all_caches()
        now = datetime.datetime.now(datetime.timezone.utc)
        expiry_threshold = now + datetime.timedelta(minutes=10)

        for cache in caches:
            if cache["remaining_hours"] != "Unknown" and isinstance(cache["remaining_hours"], (int, float)):
                if cache["remaining_hours"] < (10/60):  # Less than 10 minutes remaining
                    try:
                        self.client.caches.delete(cache["name"])
                        deleted += 1
                        print(f"Deleted expired/soon-to-expire cache: {cache['display_name']}")
                    except Exception as e:
                        failed += 1
                        print(f"Failed to delete cache '{cache['display_name']}': {e}")

        return {"deleted": deleted, "failed": failed}

# Usage example
if __name__ == "__main__":
    manager = GeminiCacheManager()

    print("--- Current Caches ---")
    caches = manager.list_all_caches()
    for i, cache in enumerate(caches):
        print(f"{i+1}. {cache['display_name']} (Model: {cache['model']})")
        print(f"   Created: {cache['created']}")
        print(f"   Expires: {cache['expires']}")
        print(f"   Remaining: {cache['remaining_hours']} hours")
        print()

    # Example: Extend all caches by 12 more hours
    if len(caches) > 0:
        response = manager.extend_all_caches(additional_hours=12)
        print(f"Extended {response['extended']} caches, {response['failed']} failed")

    # Example: Clean up soon-to-expire caches
    response = manager.cleanup_expired_caches()
    print(f"Deleted {response['deleted']} expired caches, {response['failed']} failed")

Model Selection: Both Gemini 1.5 Pro and Flash support context caching. From my testing, Flash works great for most use cases and costs less.
Latency Expectations: Currently, context caching primarily reduces costs rather than latency. Don't expect dramatic performance improvements (yet).

When NOT to Use Context Caching

After burning through some unnecessary API costs, I learned when context caching isn't worth it:

Small Contexts: If your context is under 32,768 tokens, you can't use caching (current limitation).
Single-Query Use Cases: If users typically ask just one question about a document, the storage cost outweighs the benefits.
Rapidly Changing Data: If your reference data changes frequently, caching becomes inefficient.
Very Low Query Volume: For applications with few users or infrequent queries, standard approaches may be more cost-effective.

The Future of Context Caching

I'm optimistic about where this is heading. As LLM applications mature, features like context caching will become essential infrastructure. I expect future improvements to include:

Support for smaller context sizes
Improved latency
More granular caching controls
Potential for persistent caching beyond current TTL limits

Conclusion

Context caching is one of those features that might seem minor but can dramatically impact your application's economics and architecture. For multi-user applications dealing with large contexts, it's a potential game-changer that can cut costs by 90%+ in the right scenarios.

Have you implemented context caching in your Gemini applications? I'd love to hear about your experiences and any creative uses you've found for this feature. Drop a comment below or reach out on Linkedin.

Top comments (1)

Mursal Furqan Kumbhar • Feb 20

Loving the level of details you have tried to explain in this article while addressing a pressing issue.

Thank you for sharing.

DEV Community