Timur Galeev for AWS Community Builders

Posted on Dec 18, 2024 • Originally published at tgaleev.com

Building an AI-Optimized Platform on Amazon EKS with NVIDIA NIM and OpenAI Models

#aws #eks #nvidia #ai

Introduction

The rise of artificial intelligence (AI) has brought about an unprecedented demand for infrastructure that can handle large-scale computations, support GPU acceleration, and provide scalable, flexible management of workloads. Kubernetes has emerged as a leading platform for orchestrating these workloads, and Amazon Elastic Kubernetes Service (EKS) extends Kubernetes’ capabilities by simplifying deployment and scaling in the cloud.

NVIDIA Infrastructure Manager (NIM) complements Kubernetes by optimizing GPU workloads, a critical need for training large language models (LLMs), computer vision, and other computationally intensive AI tasks. Additionally, OpenAI models can be integrated into this ecosystem to unlock cutting-edge AI capabilities, such as text generation, image recognition, and decision-making systems.

This article provides an in-depth guide to building a complete AI platform using EKS, NVIDIA NIM, and OpenAI models, with Terraform automating the deployment. Whether you are an AI researcher or a business looking to adopt AI, this guide outlines how to build a robust and scalable platform. Complete code for this setup is available on GitHub https://github.com/timurgaleev/eks-nim-llm-openai.

Why Choose NVIDIA NIM and EKS for AI Workloads?

Challenges of AI Workloads

AI applications, especially those involving LLMs, have unique challenges:

GPU Resource Management: Training and inference rely on GPUs, which are scarce and expensive resources. Efficient allocation and monitoring are crucial.

Scalability: AI workloads often need to scale dynamically based on user demand or data processing requirements.

Storage for Large Datasets: AI models and datasets can require hundreds of gigabytes, necessitating persistent, shared, and scalable storage.

Observability: Monitoring system performance, especially GPU utilization and latency, is essential for optimizing workloads.

NVIDIA NIM: A Solution for GPU Workloads

NVIDIA NIM addresses these challenges by providing:

GPU Scheduling: Maximizes GPU usage across workloads.

Integration with Kubernetes: Leverages Kubernetes to manage pods, jobs, and resources efficiently.

AI Model Management: Simplifies deployment and scaling of AI models with Helm charts and Kubernetes CRDs (Custom Resource Definitions).

Support for Persistent Storage: Integrates with shared storage solutions like AWS EFS for storing datasets and models.

Amazon EKS: A Scalable Kubernetes Solution

Amazon EKS adds value by:

Managed Kubernetes: Reduces operational overhead by handling Kubernetes cluster setup, updates, and management.

Elastic Compute Integration: Dynamically provisions GPU-enabled instances, such as g4dn and p4d, to handle AI workloads. Ensure that your AWS account has sufficient quotas and availability for these instance types to avoid provisioning issues.

Built-in Security: Integrates with AWS IAM and VPC for secure access and network segmentation.

Together, NVIDIA NIM and Amazon EKS create a powerful platform for AI model training, inference, and experimentation.

Architecture Overview

The platform architecture integrates NVIDIA NIM and OpenAI models into an EKS cluster, combining compute, storage, and monitoring components.

Key Components

EKS Cluster: Manages Kubernetes workloads and scales GPU-enabled nodes.

Karpenter: Dynamically provisions and scales nodes (CPU and GPU) based on workload demands, optimizing resource utilization and cost.

GPU Node Groups: Nodes equipped with NVIDIA GPUs for ML and AI inference tasks.

NVIDIA NIM: Deploys GPU workloads, manages AI pipelines, and integrates with Kubernetes.

OpenAI Web UI: Provides a user-friendly interface for interacting with AI models.

Persistent Storage: AWS EFS supports shared storage for datasets and models.

Observability Tools: Prometheus and Grafana offer real-time monitoring of system metrics, including GPU utilization and pod performance.

Deployment Guide

This guide provides step-by-step instructions to deploy the architecture using Terraform. While the focus is on essential components like EKS, GPU workloads, and observability, we skip detailed VPC configuration to allow flexibility based on your specific requirements.

For a VPC example that fits this deployment, refer to the repository: https://github.com/timurgaleev/eks-nim-llm-openai.

Step 1: Provisioning the EKS Cluster

Provisioning an Amazon EKS cluster is the foundation for Kubernetes workloads. Below is the EKS Cluster Configuration with key highlights to focus on scalability, system add-ons, and Karpenter integration.

EKS Cluster Configuration

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 19.15"

  cluster_name                   = local.name
  cluster_version                = var.eks_cluster_version
  cluster_endpoint_public_access = true

  vpc_id     = module.vpc.vpc_id
  subnet_ids = compact([
    for subnet_id, cidr_block in zipmap(module.vpc.private_subnets, module.vpc.private_subnets_cidr_blocks) :
    substr(cidr_block, 0, 4) == "100." ? subnet_id : null
  ])

  manage_aws_auth_configmap = true
  aws_auth_roles = [
    {
      rolearn  = module.eks_blueprints_addons.karpenter.node_iam_role_arn
      username = "system:node:{{EC2PrivateDNSName}}"
      groups = [
        "system:bootstrappers",
        "system:nodes"
      ]
    }
  ]

  eks_managed_node_group_defaults = {
    iam_role_additional_policies = {
      AmazonSSMManagedInstanceCore = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
    }
    ebs_optimized = true
    block_device_mappings = {
      xvda = {
        device_name = "/dev/xvda"
        ebs = {
          volume_size = 100
          volume_type = "gp3"
        }
      }
    }
  }

  eks_managed_node_groups = {
    core_node_group = {
      name            = "core-node-group"
      description     = "EKS Core node group for hosting system add-ons"
      subnet_ids      = compact([
        for subnet_id, cidr_block in zipmap(module.vpc.private_subnets, module.vpc.private_subnets_cidr_blocks) :
        substr(cidr_block, 0, 4) == "100." ? subnet_id : null
      ])
      ami_type        = "AL2_x86_64"
      instance_types  = ["m5.xlarge"]
      capacity_type   = "SPOT"
      desired_size    = 2
      min_size        = 2
      max_size        = 4
      labels = {
        WorkerType    = "SPOT"
        NodeGroupType = "core"
      }
      tags = merge(local.tags, { Name = "core-node-grp" })
    }
  }
}

Key Highlights

1. Networking:

Subnets are filtered to include only CIDR blocks starting with 100. to ensure specific subnet assignment for nodes.

2. IAM and Auth:

Integration with Karpenter is configured via the aws_auth_rolesblock, allowing Karpenter to dynamically provision nodes.

3. Managed Node Groups:

Core Node Group:
- Optimized for system-level workloads.
- Configured with m5.xlarge spot instances for cost efficiency.
- Labels such as NodeGroupType: core and taints can be used to restrict workloads to this node group.

4. Storage:

Nodes are configured with gp3 root volumes (100 GiB) for system usage. Additional storage for workloads should be configured separately.

5. Scaling:

Use Karpenter for workload-based scaling instead of additional managed node groups. The eks_managed_node_groups block here is only for critical system workloads.

Step 2: Deploying NVIDIA NIM for AI Workloads

Deploying NVIDIA NIM (NVIDIA Inference Manager) requires configuring persistent storage for large datasets and allocating GPU resources for optimal performance. Here's an expanded guide breaking down the essential steps.

1. Persistent Storage with AWS EFS

AI workloads often require storage that exceeds local node capacity. AWS EFS (Elastic File System) provides a shared and scalable storage solution across multiple pods. Below is the configuration for creating a Persistent Volume Claim (PVC) backed by EFS:

Code: Persistent Volume Claim (PVC)

kubernetes_persistent_volume_claim_v1 "efs_pvc" {
  metadata {
    name      = "efs-storage"
    namespace = "nim"
  }
  spec {
    access_modes       = ["ReadWriteMany"] # Enables sharing storage across multiple pods.
    storage_class_name = "efs"             # Links the PVC to an EFS storage class.
    resources {
      requests = {
        storage = "200Gi" # Reserves 200 GiB of scalable storage.
      }
    }
  }
}

Key Points:

Access Mode: "ReadWriteMany" allows simultaneous access by multiple pods, critical for parallel workloads.
Storage Class: Must correspond to an EFS provisioner configured in the Kubernetes cluster.
Capacity: Start with 200 GiB and scale as per your dataset requirements.

2. Deploying NVIDIA NIM Helm Chart

After configuring storage, deploy NVIDIA NIM using Helm. The Helm chart simplifies GPU allocation and links the persistent storage to NIM-managed workloads.

Configure the NGC API Key

Before deploying NVIDIA NIM, you need to retrieve your NGC API Key from NVIDIA’s cloud platform and set it as an environment variable. This key enables secure authentication with NVIDIA’s container registry and services.

Steps to Retrieve the NGC API Key:

Log in to your NGC account.
Navigate to Setup > API Keys.
Click Generate API Key if you don’t already have one.
Copy the generated key to use in your deployment process.

Set the NGC API Key as an Environment Variable:

Run the following command in your terminal to make the key accessible to Terraform during deployment:

export TF_VAR_ngc_api_key=<replace-me>

Replace <replace-me> with your actual API key. This key will be passed to NVIDIA NIM to enable seamless model deployment.

Code: Helm Release for NVIDIA NIM

helm_release "nim_llm" {
  name      = "nim-llm"
  chart     = "./nim-llm"                # Points to the NIM Helm chart location.
  namespace = "nim"
  values = [
    templatefile("nim-llm-values.yaml", {
      model_id    = var.model_id            # Specifies the LLM model (e.g., GPT-like models).
      num_gpu     = var.num_gpu             # Allocates GPU resources for inference tasks.
      ngc_api_key = var.ngc_api_key
      pvc_name    = kubernetes_persistent_volume_claim_v1.efs_pvc.metadata[0].name
    })
  ]
}

Key Points:

model_id: The identifier of the model being deployed (e.g., GPT-3, BERT).
num_gpu: Configures GPU resources for inference tasks. The value should align with the instance type used in your cluster (e.g., g4dn.xlarge for one GPU).
pvc_name: Links the EFS-backed PVC to the workload for storing large datasets or models.

3. Configuration Highlights

Why Persistent Storage?

AI models and datasets are often larger than the node's local storage. Using EFS ensures:
- Scalability: Adjust storage as required without downtime.
- High Availability: Accessible across multiple Availability Zones.

GPU Allocation

NVIDIA NIM optimizes GPU usage for inference. Use the num_gpu variable to specify the number of GPUs for your workload, ensuring efficient resource utilization.

Summary

Storage Configuration: Use AWS EFS with Kubernetes PVC for shared, scalable storage across pods.
GPU Allocation: NVIDIA NIM enables efficient GPU resource management for AI inference tasks.
Helm Chart Deployment: Leverage Helm for streamlined deployment, linking GPU resources and persistent storage.

Step 3: Adding OpenAI Web UI

The OpenAI Web UI provides an interface for users to interact with deployed AI models.

"helm_release" "openai_webui" {
  name       = "openai-webui"
  chart      = "open-webui"
  repository = "https://helm.openwebui.com/"
  namespace  = "openai-webui"
  values = [
    jsonencode({
      replicaCount = 1,
      image = {
        repository = "ghcr.io/open-webui/open-webui"
        tag        = "main"
      }
    })
  ]
}

Step 4: Observability with Prometheus, Grafana, and Custom Metrics

Prometheus and Grafana are essential tools for monitoring AI workloads. Prometheus collects resource metrics, including GPU-specific data, while Grafana visualizes these metrics through tailored dashboards. These tools help ensure that AI operations are running smoothly and efficiently.

To extend observability, the Prometheus Adapter is configured with custom rules for tracking AI-specific metrics. Key configurations include:

Tracking Active Requests: Using the num_requests_running metric, Prometheus monitors the number of ongoing requests, providing insights into workload intensity.
Inference Queue Monitoring: The nv_inference_queue_duration_us metric tracks NVIDIA inference queue times, converted into milliseconds for enhanced readability.

Sample Configuration for Prometheus Adapter:

prometheus:
  url: http://kube-prometheus-stack-prometheus.${prometheus_namespace}
  port: 9090
rules:
  default: false
  custom:
  - seriesQuery: '{__name__=~"num_requests_running"}'
    resources:
      template: <<.Resource>>
    name:
      matches: "num_requests_running"
      as: ""
    metricsQuery: sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)
  - seriesQuery: 'nv_inference_queue_duration_us{namespace!="", pod!=""}'
    resources:
      overrides:
        namespace:
          resource: "namespace"
        pod:
          resource: "pod"
    name:
      matches: "nv_inference_queue_duration_us"
      as: "nv_inference_queue_duration_ms"
    metricsQuery: 'avg(rate(nv_inference_queue_duration_us{<<.LabelMatchers>>}[1m])/1000) by (<<.GroupBy>>)'

These configurations enable Prometheus to expose meaningful custom metrics that are critical for scaling and optimizing AI workloads. By integrating these metrics into Grafana dashboards, users gain actionable insights into system performance and bottlenecks.

Step 5: Scaling and Optimization with Karpenter

In large-scale AI deployments, workload demands fluctuate significantly. Dynamic scaling is essential for managing these workloads effectively while minimizing costs. Karpenter, a Kubernetes-native cluster autoscaler, provides powerful mechanisms for optimizing resource utilization. It dynamically provisions nodes tailored to the specific demands of applications, including GPU-heavy AI workloads.

This section integrates Karpenter into the EKS Blueprint framework, highlighting its configuration for both CPU and GPU workloads. The full implementation and configurations are available in the https://github.com/timurgaleev/eks-nim-llm-openai.

Deploying Karpenter with EKS Blueprints

Karpenter is added to the EKS cluster as a Blueprint add-on. Below is an example of the configuration block for enabling Karpenter, focusing on both CPU and GPU workload optimization:

module "eks_blueprints_addons" {
  source  = "aws-ia/eks-blueprints-addons/aws"
  version = "~> 1.2"

  enable_karpenter                  = true
  karpenter_enable_spot_termination = true
  karpenter_node = {
    iam_role_additional_policies = {
      AmazonSSMManagedInstanceCore = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
    }
  }
  karpenter = {
    chart_version = "0.37.0"
  }
}

This configuration enables Karpenter with support for Spot instance termination handling and assigns additional IAM policies for managing nodes.

Configuring Karpenter for CPU and GPU Workloads

For effective scaling, Karpenter relies on Provisioner configurations tailored to workload requirements. The following examples showcase how Karpenter dynamically provisions CPU and GPU nodes.

CPU Workloads

name: cpu-karpenter
clusterName: ${module.eks.cluster_name}
ec2NodeClass:
  karpenterRole: ${split("/", module.eks_blueprints_addons.karpenter.node_iam_role_arn)[1]}
  subnetSelectorTerms:
    id: ${module.vpc.private_subnets[2]}
  securityGroupSelectorTerms:
    tags:
      Name: ${module.eks.cluster_name}-node
  instanceStorePolicy: RAID0

nodePool:
  labels:
    - type: karpenter
    - NodeGroupType: cpu-karpenter
  requirements:
    - key: "karpenter.k8s.aws/instance-family"
      operator: In
      values: ["m5"]
    - key: "karpenter.k8s.aws/instance-size"
      operator: In
      values: ["xlarge", "2xlarge", "4xlarge"]
    - key: "kubernetes.io/arch"
      operator: In
      values: ["amd64"]
    - key: "karpenter.sh/capacity-type"
      operator: In
      values: ["spot", "on-demand"]
  limits:
    cpu: 1000
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 180s
    expireAfter: 720h
  weight: 100

GPU Workloads

name: gpu-workloads
clusterName: ${module.eks.cluster_name}
ec2NodeClass:
  karpenterRole: ${split("/", module.eks_blueprints_addons.karpenter.node_iam_role_arn)[1]}
  subnetSelectorTerms:
    id: ${module.vpc.private_subnets[1]}
  securityGroupSelectorTerms:
    tags:
      Name: ${module.eks.cluster_name}-node
  instanceStorePolicy: RAID0

nodePool:
  labels:
    - type: karpenter
    - NodeGroupType: gpu-workloads
  requirements:
    - key: "karpenter.k8s.aws/instance-family"
      operator: In
      values: ["g5", "p4", "p5"]  # GPU instances
    - key: "karpenter.k8s.aws/instance-size"
      operator: In
      values: ["2xlarge", "4xlarge", "8xlarge", "12xlarge"]
    - key: "kubernetes.io/arch"
      operator: In
      values: ["amd64"]
    - key: "karpenter.sh/capacity-type"
      operator: In
      values: ["spot", "on-demand"]
  limits:
    cpu: 1000
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 180s
    expireAfter: 720h
  weight: 100

Terraform Automation Scripts

To streamline the deployment and teardown of resources, the project includes two utility scripts: install.sh and cleanup.sh.

install.sh: Automates the deployment process. It initializes Terraform, applies modules sequentially (e.g., VPC and EKS), and ensures all resources are provisioned successfully. A final Terraform apply captures any remaining dependencies.

cleanup.sh: Safely destroys the deployed infrastructure. It handles dependencies like Kubernetes services, Load Balancers, and Security Groups, ensuring proper teardown order. Each module is destroyed sequentially, with a final pass to catch residual resources.

These scripts enhance operational efficiency and minimize errors during deployment and cleanup phases, making the workflow more robust and reproducible.

Key Features of Karpenter in AI Ecosystems

Dynamic Node Provisioning: Automatically provisions CPU or GPU nodes based on real-time workload needs.
Cost Optimization: Leverages Spot instances while ensuring reliable on-demand scaling for critical workloads.
Enhanced Resource Utilization: Consolidates underutilized nodes and removes idle resources with disruption policies.
Tailored Scaling Policies: Supports node pools for diverse workload types, such as inference tasks or data preprocessing.

Karpenter’s integration with GPU-optimized workloads ensures that demanding AI models benefit from high-performance compute nodes while maintaining cost efficiency.

Use Cases

1. AI Model Training

NVIDIA NIM’s GPU optimizations allow for efficient training of models like BERT or GPT, reducing runtime and costs.

2. Real-Time Inference

Deploy models for real-time applications such as fraud detection, image recognition, or natural language understanding.

3. Experimentation and Research

With the OpenAI Web UI, data scientists can quickly test and iterate on models.

Conclusion

This platform enables the scalable and efficient deployment of AI workloads by integrating NVIDIA NIM with Amazon EKS. Terraform automates the process, ensuring repeatable and reliable setups. With GPU optimization, persistent storage, and observability tools, the platform is well-suited for businesses and researchers alike.

By following this guide, you can build a scalable and efficient AI platform. For detailed code and further exploration, visit the GitHub repository https://github.com/timurgaleev/eks-nim-llm-openai.

DEV Community