In my previous posts, we explored how LangChain simplifies AI application development and how to deploy Gemini-powered LangChain applications on GKE. Now, let's take a look at a slightly different approach: running your own instance of Gemma, Google's open large language model, directly within your GKE cluster and integrating it with LangChain.
Why choose Gemma on GKE?
While using an LLM endpoint like Gemini is convenient, running an open model like Gemma 2 on your GKE cluster can offer several advantages:
- Control: You have complete control over the model, its resources, and its scaling. This is particularly important for applications with strict performance or security requirements.
- Customization: You can fine-tune the model on your own datasets to optimize it for specific tasks or domains.
- Cost optimization: For high-volume usage, running your own instance can potentially be more cost-effective than using the API.
- Data locality: Keep your data and model within your controlled environment, which can be crucial for compliance and privacy.
- Experimentation: You can experiment with the latest research and techniques without being limited by the API's features.
Deploying Gemma on GKE
Deploying Gemma on GKE involves several steps, from setting up your GKE cluster to configuring LangChain to use your Gemma instance as its LLM.
Set up credentials
To be able to use the Gemma 2 model, you first need a Hugging Face account. Start by creating one if you don't already have one, and create a token key with read
permissions from your settings page. Make sure to note down the token value, which we'll need in a bit.
Then, go to the model consent page to accept the terms and conditions of using the Gemma 2 model. Once that is done, we're ready to deploy our open model.
Set up your GKE Cluster
If you don't already have a GKE cluster, you can create one through the Google Cloud Console or using the gcloud
command-line tool. Make sure to choose a machine type with sufficient resources to run Gemma, such as the g2-standard
family which includes an attached NVIDIA L4 GPU. To simplify this, we can simply create a GKE Autopilot cluster.
gcloud container clusters create-auto langchain-cluster \
--project=PROJECT_ID \
--region=us-central1
Deploy a Gemma 2 instance
For this example we'll be deploying an instruction-tuned instance of Gemma 2 using a vLLM image. The following manifest describes a deployment and corresponding service for the gemma-2-2b-it
model. Replace HUGGINGFACE_TOKEN
with the token you generated earlier.
apiVersion: apps/v1
kind: Deployment
metadata:
name: gemma-deployment
spec:
replicas: 1
selector:
matchLabels:
app: gemma-server
template:
metadata:
labels:
app: gemma-server
ai.gke.io/model: gemma-2-2b-it
ai.gke.io/inference-server: vllm
examples.ai.gke.io/source: model-garden
spec:
containers:
- name: inference-server
image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250114_0916_RC00_maas
resources:
requests:
cpu: 2
memory: 34Gi
ephemeral-storage: 10Gi
nvidia.com/gpu: 1
limits:
cpu: 2
memory: 34Gi
ephemeral-storage: 10Gi
nvidia.com/gpu: 1
args:
- python
- -m
- vllm.entrypoints.api_server
- --host=0.0.0.0
- --port=8000
- --model=google/gemma-2-2b-it
- --tensor-parallel-size=1
- --swap-space=16
- --gpu-memory-utilization=0.95
- --enable-chunked-prefill
- --disable-log-stats
env:
- name: MODEL_ID
value: google/gemma-2-2b-it
- name: DEPLOY_SOURCE
value: "UI_NATIVE_MODEL"
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: hf_api_token
volumeMounts:
- mountPath: /dev/shm
name: dshm
volumes:
- name: dshm
emptyDir:
medium: Memory
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-l4
---
apiVersion: v1
kind: Service
metadata:
name: llm-service
spec:
selector:
app: gemma-server
type: ClusterIP
ports:
- protocol: TCP
port: 8000
targetPort: 8000
---
apiVersion: v1
kind: Secret
metadata:
name: hf-secret
type: Opaque
stringData:
hf_api_token: HUGGINGFACE_TOKEN
Save this to a file called gemma-2-deployment.yaml
, then deploy it to your cluster:
kubectl apply -f gemma-2-deployment.yaml
Deploying LangChain on GKE
Now that we have our GKE cluster and Gemma deployed, we need to create our LangChain application and deploy it. If you've followed my previous post, you'll notice that these steps are very similar. The main differences are that we're pointing LangChain to Gemma instead of Gemini, and that our LangChain application uses a custom LLM class to ingest our local instance of Gemma.
Containerize your LangChain application
First, we need to package our LangChain application into a Docker container. This involves creating a Dockerfile
that specifies the environment and dependencies for our application. Here is a Python application using LangChain and Gemma, which we'll save as app.py
:
from langchain_core.callbacks.manager import CallbackManagerForLLMRun
from langchain_core.language_models.llms import LLM
from langchain_core.prompts import ChatPromptTemplate
from typing import Any, Optional
from flask import Flask, request
import requests
class VLLMServerLLM(LLM):
vllm_url: str
model: Optional[str] = None
temperature: float = 0.0
max_tokens: int = 2048
@property
def _llm_type(self) -> str:
return "vllm_server"
def _call(
self,
prompt: str,
run_manager: Optional[CallbackManagerForLLMRun] = None,
**kwargs: Any,
) -> str:
headers = {"Content-Type": "application/json"}
payload = {
"prompt": prompt,
"temperature": self.temperature,
"max_tokens": self.max_tokens,
**kwargs
}
if self.model:
payload["model"] = self.model
try:
response = requests.post(self.vllm_url, headers=headers, json=payload, timeout=120)
response.raise_for_status()
json_response = response.json()
if isinstance(json_response, dict) and "predictions" in json_response:
text = json_response["predictions"][0]
else:
raise ValueError(f"Unexpected response format from vLLM server: {json_response}")
return text
except requests.exceptions.RequestException as e:
raise ValueError(f"Error communicating with vLLM server: {e}")
except (KeyError, TypeError) as e:
raise ValueError(f"Error parsing vLLM server response: {e}. Response was: {json_response}")
llm = VLLMServerLLM(vllm_url="http://llm-service:8000/generate", temperature=0.7, max_tokens=512)
prompt = ChatPromptTemplate.from_messages(
[
(
"system",
"You are a helpful assistant that answers questions about a given topic.",
),
("human", "{input}"),
]
)
chain = prompt | llm
def create_app():
app = Flask(__name__)
@app.route("/ask", methods=['POST'])
def talkToGemini():
user_input = request.json['input']
response = chain.invoke({"input": user_input})
return response
return app
if __name__ == "__main__":
app = create_app()
app.run(host='0.0.0.0', port=80)
Then, create a Dockerfile
to define how to assemble our image:
# Use an official Python runtime as a parent image
FROM python:3-slim
# Set the working directory in the container
WORKDIR /app
# Copy the current directory contents into the container at /app
COPY . /app
# Install any needed packages specified in requirements.txt
RUN pip install -r requirements.txt
# Make port 80 available to the world outside this container
EXPOSE 80
# Run app.py when the container launches
CMD [ "python", "app.py" ]
For our dependencies, create the requirements.txt
file containing LangChain and a web framework, Flask:
langchain
flask
Finally, build the container image and push it to Artifact Registry. Don't forget to replace PROJECT_ID
with your Google Cloud project ID.
# Authenticate with Google Cloud
gcloud auth login
# Create the repository
gcloud artifacts repositories create images \
--repository-format=docker \
--location=us
# Configure authentication to the desired repository
gcloud auth configure-docker us-docker.pkg.dev/PROJECT_ID/images
# Build the image
docker build -t us-docker.pkg.dev/PROJECT_ID/images/my-langchain-app:v1 .
# Push the image
docker push us-docker.pkg.dev/PROJECT_ID/images/my-langchain-app:v1
After a handful of seconds, your container image should now be stored in your Artifact Registry repository.
Deploy to GKE
Create a YAML file with your Kubernetes deployment and service manifests. Let's call it deployment.yaml
, replacing PROJECT_ID
.
apiVersion: apps/v1
kind: Deployment
metadata:
name: langchain-deployment
spec:
replicas: 3 # Scale as needed
selector: # Add selector here
matchLabels:
app: langchain-app
template:
metadata:
labels:
app: langchain-app
spec:
containers:
- name: langchain-container
image: us-docker.pkg.dev/PROJECT_ID/images/my-langchain-app:v1
ports:
- containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
name: langchain-service
spec:
selector:
app: langchain-app
ports:
- protocol: TCP
port: 80
targetPort: 80
type: LoadBalancer # Exposes the service externally
Apply the manifest to your cluster:
# Get the context of your cluster
gcloud container clusters get-credentials langchain-cluster --region us-central1
# Deploy the manifest
kubectl apply -f deployment.yaml
This creates a deployment with three replicas of your LangChain application and exposes it externally through a load balancer. You can adjust the number of replicas based on your expected load.
Interact with your deployed application
Once the service is deployed, you can get the external IP address of your application using:
export EXTERNAL_IP=`kubectl get service/langchain-service \
--output jsonpath='{.status.loadBalancer.ingress[0].ip}'`
You can now send requests to your LangChain application running on GKE. For example:
curl -X POST -H "Content-Type: application/json" \
-d '{"input": "Tell me a fun fact about hummingbirds"}' \
http://$EXTERNAL_IP/ask
Considerations and enhancements
- Scaling: You can scale your Gemma deployment independently of your LangChain application based on the load generated by the model.
- Monitoring: Use Cloud Monitoring and Cloud Logging to track the performance of both Gemma and your LangChain application. Look for error rates, latency, and resource utilization.
- Fine-tuning: Consider fine-tuning Gemma on your own dataset to improve its performance on your specific use case.
- Security: Implement appropriate security measures, such as network policies and authentication, to protect your Gemma instance.
Conclusion
Deploying Gemma on GKE and integrating it with LangChain provides a powerful and flexible way to build AI-powered applications. You gain fine-grained control over your model and infrastructure while still leveraging the developer-friendly features of LangChain. This approach allows you to tailor your setup to your specific needs, whether it's optimizing for performance, cost, or control.
Next steps:
- Explore the Gemma documentation for more details on the model and its capabilities.
- Check out the LangChain documentation for advanced use cases and integrations.
- Dive deeper into GKE documentation for running production workloads.
In the next post, we will take a look at how to streamline LangChain deployments using LangServe.
Top comments (2)
Wow this cleared up a lot for me
Happy to help! What bits stumped you before?