Transforming Infrastructure: Google Cloud's Role in CNCF's llm-d Project

Google Cloud is committed to addressing the extensive needs of large foundation model builders and AI-native companies. As generative technology moves into critical production settings, these innovators require adaptable and efficient infrastructure to tackle complex challenges.

We are excited to announce that llm-d has been recognized as a Cloud Native Computing Foundation (CNCF) Sandbox project. Google Cloud proudly joins Red Hat, IBM Research, CoreWeave, and NVIDIA in this initiative, promoting a vision of any model, any accelerator, any cloud.

This partnership highlights Google’s ongoing commitment to open-source innovation. With the guidance of the Linux Foundation, we aim to ensure that the future of distributed inference is based on open standards, allowing builders to deploy their models globally without vendor lock-in.

Enhancing Kubernetes for Modern Workloads

Kubernetes is the leading standard for orchestration, but it was not initially designed for the complex demands of LLM inference. To adapt Kubernetes for these workloads, we introduced the GKE Inference Gateway. This tool offers native APIs that extend beyond basic load balancing.

The gateway utilizes the llm-d Endpoint Picker (EPP) for intelligent scheduling. By managing routing decisions, it employs a multi-objective policy that factors in real-time cache hit rates and request loads, ensuring optimal backend processing.

Our Vertex AI team has recently validated this architecture, demonstrating its capability to manage unpredictable traffic effectively. For instance, Time-to-First-Token (TTFT) latency for context-heavy coding tasks dropped by over 35%, while P95 tail latency improved by 52% for chat workloads.

Robust Orchestration for AI Deployments

To facilitate multi-node deployments, Google is advancing the Kubernetes LeaderWorkerSet (LWS) API. This allows for efficient orchestration of compute-heavy and memory-intensive tasks across scalable pods, managing extensive fleets of TPUs and GPUs.

Additionally, we have recently enhanced vLLM for Cloud TPUs, achieving up to 5x throughput improvements. This ensures that whether using Google Cloud TPUs or NVIDIA GPUs, AI serving remains optimized and versatile.

Collaborating for Next-Generation Infrastructure

To create a leading AI infrastructure, collaboration between cloud-native orchestration and advanced research is essential. Our partnerships with the Linux Foundation, CNCF, and the PyTorch Foundation are pivotal in developing the next generation of infrastructure.

We are establishing proven, replicable blueprints to support high-performance AI as an open and accessible ecosystem. We encourage foundation model builders, platform engineers, and researchers to engage with us in shaping the future of inference:

Explore: Visit the llm-d guides for deploying advanced inference stacks.
Learn more: Check out the official website at https://llm-d.ai.
Contribute: Join our community on Slack and participate in our GitHub repositories at https://github.com/llm-d/.

We look forward to advancing llm-d within the CNCF and scaling our efforts together.