Google's Approach to Large-Scale A/B Experimentation in Infrastructure

A/B experimentation is commonly associated with user interface tweaks, but at Google, it extends to critical infrastructure enhancements. By optimizing components like memory allocators and kernel schedulers, substantial savings in compute resources and reduced latency can be achieved for millions of users. However, the inherent risks of experimenting with such foundational changes necessitate a robust methodology.

This article outlines the key principles of Google's A/B experimentation framework, honed over years of practice. It emphasizes four core pillars:

Application-level vs. machine-level experimentation
Maintaining a balanced setup
Ensuring binary hermeticity
Selecting the right performance metrics

Importance of Infrastructure Experiments

Infrastructure experiments target essential components such as operating systems, core libraries, and compilers. These experiments are crucial for unlocking performance enhancements that application-level tests cannot achieve alone. Key areas of focus include:

Core Libraries: Optimizations here can significantly impact all binaries across the fleet.
Compiler: Adjustments to compilation flags can yield performance gains without altering source code.
Kernel: Improvements in memory management and scheduling can enhance overall machine efficiency.
Cluster Management: Decisions made by systems like kube-scheduler affect performance and resource utilization.

Measuring Improvement Scale

While large optimizations are easy to quantify, many of Google's infrastructure improvements yield sub-1% gains. Although these changes may seem minor individually, they accumulate over time, leading to significant overall enhancements. A systematic framework is crucial for tracking and measuring these incremental changes.

Challenges of Application-Level Experimentation

Evaluating infrastructure changes through specific applications presents challenges:

Selection Bias: Some applications may not effectively represent the impact of certain changes.
Lack of Fleet Representation: A limited application set may not reflect overall fleet behavior.
Invisible System-Wide Benefits: Isolated measurements may miss broader impacts on co-located applications.
Technical Constraints: Fundamental changes cannot be accurately assessed through application-level tests.

Adopting Machine-Level Experimentation

Google addresses these challenges by implementing changes at the machine level rather than at the application level. This allows for comprehensive measurement across all workloads on a machine, capturing effects that application-level tests might overlook.

Experiment Implementation

Typically, 1% of the fleet is designated for both experimental and control groups, with a gradual rollout following established best practices. Continuous data collection during this process allows for performance impact analysis and regression detection.

Ensuring a Balanced Setup

The success of an experiment relies on the careful selection of machines. Both experimental and control groups must represent the fleet accurately, with balanced proportions of machine types. A mismatch can lead to skewed data, invalidating results.

Maintaining Binary Hermeticity

When modifying library behavior, binaries must be recompiled to include experimental changes, which only activate on designated machines. A two-step rollout process is essential for ensuring reliable rollbacks and minimizing production disruptions.

Key Performance Metrics

Infrastructure experiments focus on metrics that reflect application and machine performance rather than superficial engagement metrics. Important measures include:

Application Productivity: Defined metrics capture the work done by applications.
Machine-Level Performance: Metrics like cache misses and memory bandwidth provide additional insights.
Reliability: Monitoring for abnormal terminations and timeouts ensures new changes do not introduce failures.

Statistical Analysis Tools

Analyzing collected data is critical for understanding the impact of changes before full rollout. Advanced statistical tools match jobs in experimental and control groups, allowing for reliable performance comparisons across the fleet.

In conclusion, as cloud infrastructure grows, optimizing resource efficiency becomes essential. Implementing a rigorous A/B experimentation framework is vital for validating optimizations, presenting challenges that require careful management and analysis.