Leverage open models like Gemma 2 on GKE with LangChain

#kubernetes #ai #langchain #googlecloud

In my previous posts, we explored how LangChain simplifies AI application development and how to deploy Gemini-powered LangChain applications on GKE. Now, let's take a look at a slightly different approach: running your own instance of Gemma, Google's open large language model, directly within your GKE cluster and integrating it with LangChain.

Why choose Gemma on GKE?

While using an LLM endpoint like Gemini is convenient, running an open model like Gemma 2 on your GKE cluster can offer several advantages:

Control: You have complete control over the model, its resources, and its scaling. This is particularly important for applications with strict performance or security requirements.
Customization: You can fine-tune the model on your own datasets to optimize it for specific tasks or domains.
Cost optimization: For high-volume usage, running your own instance can potentially be more cost-effective than using the API.
Data locality: Keep your data and model within your controlled environment, which can be crucial for compliance and privacy.
Experimentation: You can experiment with the latest research and techniques without being limited by the API's features.

Deploying Gemma on GKE

Deploying Gemma on GKE involves several steps, from setting up your GKE cluster to configuring LangChain to use your Gemma instance as its LLM.

Set up credentials

To be able to use the Gemma 2 model, you first need a Hugging Face account. Start by creating one if you don't already have one, and create a token key with read permissions from your settings page. Make sure to note down the token value, which we'll need in a bit.

Then, go to the model consent page to accept the terms and conditions of using the Gemma 2 model. Once that is done, we're ready to deploy our open model.

Set up your GKE Cluster

If you don't already have a GKE cluster, you can create one through the Google Cloud Console or using the gcloud command-line tool. Make sure to choose a machine type with sufficient resources to run Gemma, such as the g2-standard family which includes an attached NVIDIA L4 GPU. To simplify this, we can simply create a GKE Autopilot cluster.

gcloud container clusters create-auto langchain-cluster \
  --project=PROJECT_ID \
  --region=us-central1

Deploy a Gemma 2 instance

For this example we'll be deploying an instruction-tuned instance of Gemma 2 using a vLLM image. The following manifest describes a deployment and corresponding service for the gemma-2-2b-it model. Replace HUGGINGFACE_TOKEN with the token you generated earlier.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gemma-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gemma-server
  template:
    metadata:
      labels:
        app: gemma-server
        ai.gke.io/model: gemma-2-2b-it
        ai.gke.io/inference-server: vllm
        examples.ai.gke.io/source: model-garden
    spec:
      containers:
      - name: inference-server
        image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250114_0916_RC00_maas
        resources:
          requests:
            cpu: 2
            memory: 34Gi
            ephemeral-storage: 10Gi
            nvidia.com/gpu: 1
          limits:
            cpu: 2
            memory: 34Gi
            ephemeral-storage: 10Gi
            nvidia.com/gpu: 1
        args:
        - python
        - -m
        - vllm.entrypoints.api_server
        - --host=0.0.0.0
        - --port=8000
        - --model=google/gemma-2-2b-it
        - --tensor-parallel-size=1
        - --swap-space=16
        - --gpu-memory-utilization=0.95
        - --enable-chunked-prefill
        - --disable-log-stats
        env:
        - name: MODEL_ID
          value: google/gemma-2-2b-it
        - name: DEPLOY_SOURCE
          value: "UI_NATIVE_MODEL"
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-secret
              key: hf_api_token
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      volumes:
      - name: dshm
        emptyDir:
          medium: Memory
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-l4
---
apiVersion: v1
kind: Service
metadata:
  name: llm-service
spec:
  selector:
    app: gemma-server
  type: ClusterIP
  ports:
  - protocol: TCP
    port: 8000
    targetPort: 8000
---
apiVersion: v1
kind: Secret
metadata:
  name: hf-secret
type: Opaque
stringData:
  hf_api_token: HUGGINGFACE_TOKEN

Save this to a file called gemma-2-deployment.yaml, then deploy it to your cluster:

kubectl apply -f gemma-2-deployment.yaml

Deploying LangChain on GKE

Now that we have our GKE cluster and Gemma deployed, we need to create our LangChain application and deploy it. If you've followed my previous post, you'll notice that these steps are very similar. The main differences are that we're pointing LangChain to Gemma instead of Gemini, and that our LangChain application uses a custom LLM class to ingest our local instance of Gemma.

Containerize your LangChain application

First, we need to package our LangChain application into a Docker container. This involves creating a Dockerfile that specifies the environment and dependencies for our application. Here is a Python application using LangChain and Gemma, which we'll save as app.py:

from langchain_core.callbacks.manager import CallbackManagerForLLMRun
from langchain_core.language_models.llms import LLM
from langchain_core.prompts import ChatPromptTemplate
from typing import Any, Optional
from flask import Flask, request
import requests

class VLLMServerLLM(LLM):
    vllm_url: str
    model: Optional[str] = None
    temperature: float = 0.0
    max_tokens: int = 2048

    @property
    def _llm_type(self) -> str:
        return "vllm_server"

    def _call(
        self,
        prompt: str,
        run_manager: Optional[CallbackManagerForLLMRun] = None,
        **kwargs: Any,
    ) -> str:
        headers = {"Content-Type": "application/json"}
        payload = {
            "prompt": prompt,
            "temperature": self.temperature,
            "max_tokens": self.max_tokens,
            **kwargs
        }

        if self.model:
          payload["model"] = self.model

        try:
            response = requests.post(self.vllm_url, headers=headers, json=payload, timeout=120)
            response.raise_for_status()
            json_response = response.json()

            if isinstance(json_response, dict) and "predictions" in json_response:
              text = json_response["predictions"][0]
            else:
              raise ValueError(f"Unexpected response format from vLLM server: {json_response}")

            return text

        except requests.exceptions.RequestException as e:
            raise ValueError(f"Error communicating with vLLM server: {e}")
        except (KeyError, TypeError) as e:
          raise ValueError(f"Error parsing vLLM server response: {e}. Response was: {json_response}")

llm = VLLMServerLLM(vllm_url="http://llm-service:8000/generate", temperature=0.7, max_tokens=512)

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a helpful assistant that answers questions about a given topic.",
        ),
        ("human", "{input}"),
    ]
)

chain = prompt | llm

def create_app():
    app = Flask(__name__)

    @app.route("/ask", methods=['POST'])
    def talkToGemini():
        user_input = request.json['input']
        response = chain.invoke({"input": user_input})
        return response

    return app

if __name__ == "__main__":
    app = create_app()
    app.run(host='0.0.0.0', port=80)

Then, create a Dockerfile to define how to assemble our image:

# Use an official Python runtime as a parent image
FROM python:3-slim

# Set the working directory in the container
WORKDIR /app

# Copy the current directory contents into the container at /app
COPY . /app

# Install any needed packages specified in requirements.txt
RUN pip install -r requirements.txt

# Make port 80 available to the world outside this container
EXPOSE 80

# Run app.py when the container launches
CMD [ "python", "app.py" ]

For our dependencies, create the requirements.txt file containing LangChain and a web framework, Flask:

langchain
flask

Finally, build the container image and push it to Artifact Registry. Don't forget to replace PROJECT_ID with your Google Cloud project ID.

# Authenticate with Google Cloud
gcloud auth login

# Create the repository
gcloud artifacts repositories create images \
  --repository-format=docker \
  --location=us

# Configure authentication to the desired repository
gcloud auth configure-docker us-docker.pkg.dev/PROJECT_ID/images

# Build the image
docker build -t us-docker.pkg.dev/PROJECT_ID/images/my-langchain-app:v1 .

# Push the image
docker push us-docker.pkg.dev/PROJECT_ID/images/my-langchain-app:v1

After a handful of seconds, your container image should now be stored in your Artifact Registry repository.

Deploy to GKE

Create a YAML file with your Kubernetes deployment and service manifests. Let's call it deployment.yaml, replacing PROJECT_ID.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: langchain-deployment
spec:
  replicas: 3 # Scale as needed
  selector: # Add selector here
    matchLabels:
      app: langchain-app
  template:
    metadata:
      labels:
        app: langchain-app
    spec:
      containers:
      - name: langchain-container
        image: us-docker.pkg.dev/PROJECT_ID/images/my-langchain-app:v1
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: langchain-service
spec:
  selector:
    app: langchain-app
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80
  type: LoadBalancer # Exposes the service externally

Apply the manifest to your cluster:

# Get the context of your cluster
gcloud container clusters get-credentials langchain-cluster --region us-central1

# Deploy the manifest
kubectl apply -f deployment.yaml

This creates a deployment with three replicas of your LangChain application and exposes it externally through a load balancer. You can adjust the number of replicas based on your expected load.

Interact with your deployed application

Once the service is deployed, you can get the external IP address of your application using:

export EXTERNAL_IP=`kubectl get service/langchain-service \
  --output jsonpath='{.status.loadBalancer.ingress[0].ip}'`

You can now send requests to your LangChain application running on GKE. For example:

curl -X POST -H "Content-Type: application/json" \
  -d '{"input": "Tell me a fun fact about hummingbirds"}' \
  http://$EXTERNAL_IP/ask

Considerations and enhancements

Scaling: You can scale your Gemma deployment independently of your LangChain application based on the load generated by the model.
Monitoring: Use Cloud Monitoring and Cloud Logging to track the performance of both Gemma and your LangChain application. Look for error rates, latency, and resource utilization.
Fine-tuning: Consider fine-tuning Gemma on your own dataset to improve its performance on your specific use case.
Security: Implement appropriate security measures, such as network policies and authentication, to protect your Gemma instance.

Conclusion

Deploying Gemma on GKE and integrating it with LangChain provides a powerful and flexible way to build AI-powered applications. You gain fine-grained control over your model and infrastructure while still leveraging the developer-friendly features of LangChain. This approach allows you to tailor your setup to your specific needs, whether it's optimizing for performance, cost, or control.

Next steps:

Explore the Gemma documentation for more details on the model and its capabilities.
Check out the LangChain documentation for advanced use cases and integrations.
Dive deeper into GKE documentation for running production workloads.

In the next post, we will take a look at how to streamline LangChain deployments using LangServe.