Managing generative AI costs effectively without compromising on performance and availability is a critical concern for many organizations. The challenge lies in finding the optimal mix of tools and services that align with specific workload patterns.
This guide delves into Google Cloud's flexible generative AI infrastructure options, aimed at helping users discover the sweet spot between cost and performance. It begins with an overview of the foundational pay-as-you-go (PayGo) models and then discusses how to enhance this strategy with specialized options.
Understanding Pay-as-You-Go (PayGo) Options
Google Cloud's standard PayGo offerings serve as a robust starting point for various workloads. To maximize their potential, it is essential to understand the underlying mechanisms that influence performance and availability.
Dynamic Shared Quota (DSQ)
The PayGo environment operates on the Dynamic Shared Quota (DSQ) principle, which promotes fairness and efficiency by distributing available generative AI capacity among all customers.
- High-priority lane: Requests within a default Tokens Per Second (TPS) threshold receive higher priority, ensuring high availability with a target of 99.5% service level objective (SLO).
- Best-effort lane: Requests exceeding the TPS threshold are processed with lower priority when spare capacity is available, preventing immediate drops during traffic spikes.
This system ensures that sudden spikes from one customer do not adversely affect the baseline performance for others, providing reliable service for everyday needs.
Usage Tiers: Rewarding Investment
As generative AI usage increases, Google Cloud automatically categorizes organizations into Usage Tiers based on their rolling 30-day expenditure on eligible Vertex AI services. Higher tiers correlate with higher guaranteed Tokens Per Minute (TPM) limits.
| Model Family | Tier | Spend (30 days) | TPM |
|---|---|---|---|
| Pro Models | Tier 1 | $10 - $250 | 500,000 |
| Tier 2 | $250 - $2,000 | 1,000,000 | |
| Tier 3 | > $2,000 | 2,000,000 | |
| Flash / Flash-Lite Models | Tier 1 | $10 - $250 | 2,000,000 |
| Tier 2 | $250 - $2,000 | 4,000,000 | |
| Tier 3 | > $2,000 | 10,000,000 |
Organizations should view their tier limit as a floor rather than a ceiling, ensuring minimal to no 429 (resource exhausted) errors for critical traffic within their tier limit.
Priority PayGo: Flexibility for Spikes
For workloads that face unpredictable spikes, Priority PayGo offers a solution that combines the flexibility of PayGo with the high availability needed for important traffic. By paying a premium, specific API requests can be tagged for higher priority.
To utilize Priority PayGo, simply add a header to API calls, with no sign-up required. However, caution is advised regarding ramp limits, as rapid increases in priority requests may lead to downgrading if capacity is constrained.
Provisioned Throughput (PT) for Critical Workloads
For organizations with business-critical workloads requiring explicit availability guarantees, Provisioned Throughput (PT) is the recommended option. PT allows for reserving a specific amount of processing capacity for a fixed monthly cost, providing an availability SLA.
PT is ideal for:
- Large, predictable production workloads.
- Applications with strict performance requirements where throttling is not acceptable.
Monitoring and Control of PT Requests
Organizations can monitor their PT usage through Cloud Monitoring metrics, ensuring they receive the value they paid for. Key metrics include dedicated limits in Generative Scale Units (GSUs) and actual throughput usage.
Combining Options for Optimal Results
To effectively manage costs and performance, organizations can combine different options based on their workload characteristics:
- Provisioned Throughput: Covers predictable, mission-critical baseloads with an availability SLA.
- Priority PayGo: Handles predictable peaks above PT commitments or important variable traffic.
- Standard PayGo: Serves as a foundation for general, non-critical traffic within tier limits.
- Opportunistic Bursting: Utilizes best-effort bursting for non-critical jobs without impacting core user experience.
By understanding and leveraging these tools, organizations can optimize their generative AI strategies for a balance of performance, availability, and cost.
Batch API and Flex PayGo
For workloads that do not require immediate execution, the Batch API allows customers to bundle requests into a single file for asynchronous processing, typically yielding a 50% discount on standard token costs.
Flex PayGo offers a cost-effective alternative for non-critical workloads, providing a 50% discount compared to Standard PayGo. This option is suitable for tasks that can tolerate longer response times, such as offline analysis and data annotation.
Next Steps
- Explore Models in Vertex AI: Discover the range of Google’s first-party models and over 100 open-source models.
- Dive Deeper into Documentation: For the latest technical details and code samples, refer to the official Vertex AI documentation.
- Review Pricing Details: Access detailed breakdowns of token costs and pricing for Provisioned Throughput and Batch APIs.