Leveraging Google Cloud for Multi-Region AI Workloads with TPUs

Ensuring service availability across multiple regions is crucial for modern workloads. Recent advancements in the Kubernetes ecosystem, specifically Dynamic Resource Allocation (DRA) and the Inference Gateway, provide powerful tools for managing AI inference workloads. This article examines an experimental setup utilizing these capabilities within Google Cloud.

The experiment focuses on deploying a large language model (Gemma 3) across two Google Kubernetes Engine (GKE) clusters located in different regions. By leveraging TPUs and a Multi-cluster Inference Gateway, the aim is to ensure seamless service continuity and optimal resource utilization.

Key Components

The following tools and features were utilized in this experiment:

Google Kubernetes Engine (GKE) Managed DRANET: A managed feature that allows resource sharing among Pods, supporting both GPUs and TPUs.
Multi-cluster GKE Inference Gateway: This gateway balances AI/ML inference workloads across multiple clusters, facilitating failover capabilities.
Cloud Storage FUSE: Enables direct storage of data, models, and logs in Cloud Storage, enhancing deployment speed.
Virtual Private Cloud (VPC): Provides secure communication for internal load balancers and compute nodes.
GKE Fleets: Groups separate regional clusters under unified management.
TPU v6e: Custom AI accelerators designed for high-performance computing.

Deployment Strategy

The objective was to deploy the Gemma 3 model on two GKE clusters, each utilizing four TPU v6e chips, with the model stored in Cloud Storage. The GKE Inference Gateway was configured to route traffic to the nearest region and failover in case of a region failure.

Setting Up the Environment

To access TPUs across regions, the following steps were taken:

Create a standard VPC with appropriate firewall rules and subnets.
Establish a proxy-only subnet for the Internal regional application load balancer.
Set up firewall rules to allow traffic and health checks.
Reserve static internal IP addresses for the Gateway in both regions.
Provision a Cloud Storage FUSE bucket and configure a dedicated IAM Service Account.

Creating GKE Clusters

Next, two GKE clusters were deployed:

Enable the Gateway API and Cloud Storage FUSE CSI driver during cluster creation.
Create dedicated TPU v6e node pools for both clusters.
Activate managed DRANET on the TPU node pools with specific flags.

Establishing Global Mesh

The clusters were registered to a unified GKE Fleet:

Enable Multi-Cluster Service Discovery and Ingress.
Designate a primary region as the configuration hub for routing rules.

Deploying the AI Workload

A temporary Kubernetes job was used to download the Gemma 3 model weights into the Cloud Storage bucket. A ResourceClaimTemplate was defined to request managed DRANET device classes.

Configuring the Inference Gateway

The Multi-Cluster Inference Gateway was set up with necessary Custom Resource Definitions (CRDs) for routing:

Deploy an AutoscalingMetric to monitor hardware utilization.
Group AI deployments into a single InferencePool using Helm.
Deploy the Cross-Region Gateway and configure HTTP routes for global traffic management.

Testing Failover

To validate the architecture, a primary region outage was simulated. The Gateway successfully rerouted user requests to the secondary cluster, ensuring uninterrupted service availability.

Conclusion

This experiment demonstrates the effectiveness of using GKE Managed DRANET and Multi-cluster Inference Gateway for managing AI workloads across regions. For those interested in implementing similar setups, further resources and hands-on codelabs are available.