Buildkite Enhances Test Analytics at Scale Using Amazon MSK and Flink

Buildkite serves engineering teams at major companies like Slack, Reddit, and Airbnb by providing a robust CI/CD platform that manages complex build, test, and deployment processes. The platform processes over 50 billion requests per month, catering to various workloads, including routine code commits and AI model training.

Central to Buildkite's operations is the Test Engine, an analytics tool that helps teams optimize their test suites. It aggregates results from thousands of builds, identifies flaky tests, facilitates parallel test execution, and offers interactive analytics on test data. The system supports extensive metadata tagging, allowing for detailed insights across different dimensions.

However, the challenge lies in delivering these analytics in real-time across multiple enterprise clients while managing vast data volumes. To address this, Buildkite has implemented Amazon Managed Streaming for Apache Kafka (Amazon MSK) and Amazon Managed Service for Apache Flink, creating a scalable streaming-first analytics architecture.

With a need to process analytics from thousands of distributed pipelines, Buildkite faces an unforgiving scale, dealing with 500,000 events per second at peak times and webhook payloads up to 21 MB. The previous Rails and PostgreSQL setup was insufficient for this growth, prompting a re-architecture to a distributed streaming model.

The new architecture includes a stateful stream processor for pre-aggregations and various specialized data stores, including a key-value store for quick lookups and a relational database for pre-computed aggregates. However, the system still struggled with providing the interactive analytics that enterprise customers required.

To overcome these limitations, Buildkite turned to Amazon MSK and Amazon Flink. Amazon MSK serves as the ingestion layer, collecting test execution events from CI/CD pipelines. It efficiently handles between 5 MB/sec and 100 MB/sec of incoming data, accommodating the bursty nature of CI/CD workloads.

Amazon Flink acts as the stateful processing engine, transforming raw event streams into enriched, queryable data. This setup allows for real-time analytics, enabling users to interactively query vast datasets without delays. Flink's capabilities include flaky test detection and enriching execution events with relevant metadata.

The transition to this streaming-first architecture has led to significant operational enhancements:

  • On-demand analytics: Customers can now perform complex queries across 70 billion records in seconds.
  • Real-time log streaming: Developers can diagnose failures instantly without waiting for builds to complete.
  • Proactive test intelligence: Flink detects flaky tests as they occur, allowing for immediate action.

This transformation illustrates a broader trend among enterprise SaaS companies, emphasizing the necessity of a reliable streaming infrastructure for scaling operations effectively. By utilizing Amazon MSK and Flink, Buildkite has not only improved its analytics capabilities but also reduced operational complexity and costs.

This editorial summary reflects AWS and other public reporting on Buildkite Enhances Test Analytics at Scale Using Amazon MSK and Flink.

Reviewed by WTGuru editorial team.