Ensuring service availability across multiple regions is crucial for modern workloads. Recent advancements in the Kubernetes ecosystem, specifically Dynamic Resource Allocation (DRA) and the Inference Gateway, provide powerful tools for managing AI inference workloads. This article examines an experimental setup utilizing these capabilities within Google Cloud.
The experiment focuses on deploying a large language model (Gemma 3) across two Google Kubernetes Engine (GKE) clusters located in different regions. By leveraging TPUs and a Multi-cluster Inference Gateway, the aim is to ensure seamless service continuity and optimal resource utilization.
Key Components
The following tools and features were utilized in this experiment:
- Google Kubernetes Engine (GKE) Managed DRANET: A managed feature that allows resource sharing among Pods, supporting both GPUs and TPUs.
- Multi-cluster GKE Inference Gateway: This gateway balances AI/ML inference workloads across multiple clusters, facilitating failover capabilities.
- Cloud Storage FUSE: Enables direct storage of data, models, and logs in Cloud Storage, enhancing deployment speed.
- Virtual Private Cloud (VPC): Provides secure communication for internal load balancers and compute nodes.
- GKE Fleets: Groups separate regional clusters under unified management.
- TPU v6e: Custom AI accelerators designed for high-performance computing.
Deployment Strategy
The objective was to deploy the Gemma 3 model on two GKE clusters, each utilizing four TPU v6e chips, with the model stored in Cloud Storage. The GKE Inference Gateway was configured to route traffic to the nearest region and failover in case of a region failure.
Setting Up the Environment
To access TPUs across regions, the following steps were taken:
- Create a standard VPC with appropriate firewall rules and subnets.
- Establish a proxy-only subnet for the Internal regional application load balancer.
- Set up firewall rules to allow traffic and health checks.
- Reserve static internal IP addresses for the Gateway in both regions.
- Provision a Cloud Storage FUSE bucket and configure a dedicated IAM Service Account.
Creating GKE Clusters
Next, two GKE clusters were deployed:
- Enable the Gateway API and Cloud Storage FUSE CSI driver during cluster creation.
- Create dedicated TPU v6e node pools for both clusters.
- Activate managed DRANET on the TPU node pools with specific flags.
Establishing Global Mesh
The clusters were registered to a unified GKE Fleet:
- Enable Multi-Cluster Service Discovery and Ingress.
- Designate a primary region as the configuration hub for routing rules.
Deploying the AI Workload
A temporary Kubernetes job was used to download the Gemma 3 model weights into the Cloud Storage bucket. A ResourceClaimTemplate was defined to request managed DRANET device classes.
Configuring the Inference Gateway
The Multi-Cluster Inference Gateway was set up with necessary Custom Resource Definitions (CRDs) for routing:
- Deploy an AutoscalingMetric to monitor hardware utilization.
- Group AI deployments into a single InferencePool using Helm.
- Deploy the Cross-Region Gateway and configure HTTP routes for global traffic management.
Testing Failover
To validate the architecture, a primary region outage was simulated. The Gateway successfully rerouted user requests to the secondary cluster, ensuring uninterrupted service availability.
Conclusion
This experiment demonstrates the effectiveness of using GKE Managed DRANET and Multi-cluster Inference Gateway for managing AI workloads across regions. For those interested in implementing similar setups, further resources and hands-on codelabs are available.