Businesses are increasingly leveraging advanced infrastructure to build and serve AI models effectively. Google Cloud provides the flexibility to tailor AI infrastructure to meet specific workload requirements. A recent experiment involved deploying a model for inference using NVIDIA B200 GPUs on Google Kubernetes Engine (GKE) managed DRANET.
Understanding DRANET
Dynamic Resource Allocation (DRA) is a feature that facilitates the request and sharing of resources among Pods. DRANET enhances this by enabling the allocation of networking resources for Pods, including support for TPUs and Remote Direct Memory Access (RDMA), particularly beneficial for high-end GPU utilization.
GPU RDMA VPC Configuration
The RDMA network is established as an isolated regional VPC with a RoCEv2 network profile, dedicated to GPU-to-GPU communication. This setup allows GPUs within VM families equipped with RDMA-capable NICs to communicate efficiently across multiple nodes, leveraging a low-latency, high-speed connection.
Designing the Deployment
The objective was to deploy a large language model (LLM), Deepseek, on a GKE cluster featuring A4 nodes with support for eight B200 GPUs, accessible via a GKE Inference gateway. While the Cluster Toolkit can facilitate AI Hypercomputer GKE cluster setups, this experiment specifically tested the dynamic networking capabilities of GKE managed DRANET.
The design incorporates the following components:
- VPC: Three VPCs in total—one manually created and two automatically generated by GKE managed DRANET, including one standard and one for RDMA.
- GKE: Utilized for workload deployment.
- GKE Inference Gateway: Used to internally expose the workload through a regional internal Application Load Balancer.
- A4 VMs: These support RoCEv2 with NVIDIA B200 GPUs.
Setting Up the Environment
The setup process began with creating a standard VPC, including firewall rules and a subnet in alignment with the reservation zone. A proxy-only subnet was also established for the internal regional application load balancer linked to the GKE Inference gateway.
Creating the GKE Cluster
A standard GKE cluster node and default node pool were created, followed by connecting to the cluster. A GPU node pool was then established using A4 VMs with specific flags to enable high-performance networking.
Deploying the Model
With the cluster and node pool ready, the next step involved deploying the Deepseek model and serving it through the Inference gateway. Key actions included:
- Using the
nodeSelectorto assign the GPU node. - Attaching defined networking resources through
resourceClaims.
A secret was created for authentication, followed by the deployment configuration for the model.
Implementing the GKE Inference Gateway
The installation required Custom Resource Definitions (CRDs) in the GKE cluster. For GKE versions 1.34.0-gke.1626000 or later, only the alpha InferenceObjective CRD was necessary.
Creating the Inference Pool
Next, an inference pool was created using Helm, specifying the model server's match labels and enabling monitoring features.
Establishing Gateway and Routing
Finally, a Gateway, HTTPRoute, and InferenceObjective were created to manage traffic and performance logic for the deployed model.
After completing these steps, a test VM could be created in the main VPC to interact with the GKE Inference Gateway.
Next Steps
For those interested in further exploring GKE managed DRANET and the GKE Inference Gateway, additional resources include documentation on Dynamic Resource Allocation and AI Hypercomputers.