Data engineers often face challenges in managing compatibility and optimizing performance across various analytics engines. To address these issues, Google Cloud has launched gcs-analytics-core, an open-source Java library aimed at centralizing and accelerating analytics optimizations for Google Cloud Storage (GCS).
This library allows users to choose their preferred analytics engine while ensuring high performance on GCS. Currently, it optimizes engines like the Iceberg Spark engine, with plans to extend support to additional analytics engines by year’s end.
Understanding gcs-analytics-core
The gcs-analytics-core library acts as a centralized optimization layer between analytics engines—such as Apache Spark, Trino, and Apache Hive—and the GCS Java SDK. It enhances read operations by intercepting calls and injecting performance improvements, thus providing a consistent experience without the need for engine-specific tuning.
For users of Apache Iceberg, the library integrates with the GCSFileIO implementation, substituting traditional sequential reads with parallelized methods to reduce latency and increase throughput.
Key Optimizations
This library introduces several optimizations to minimize I/O time and execution duration:
- Vectored I/O (threaded): This feature enhances read performance by fetching multiple data ranges in parallel, significantly reducing the overhead associated with GCS calls.
- Smart Parquet Prefetching: When accessing Parquet data, the library prefetches footer data in a single chunk, avoiding multiple network calls that typically occur during metadata retrieval.
Integration with Apache Iceberg
The gcs-analytics-core library has been integrated into Apache Iceberg, starting with version 1.11.0. Users can leverage these performance enhancements by ensuring their Iceberg catalog is set to use the native GCS FileIO.
This integration allows users to benefit from optimizations like Parquet footer prefetching and multi-threaded vectored reads without complex configurations.
Catalog Compatibility
The library is compatible with all Iceberg catalogs, including REST and Hive, allowing for consistent read improvements without necessitating changes to existing infrastructure setups.
Performance Benchmarks
To validate the library's effectiveness, benchmarking was conducted using an open-source Apache Spark cluster with an Iceberg catalog configured to utilize GCSFileIO. The benchmarks employed the TPC-DS schema across various dataset sizes, comparing the optimizations of gcs-analytics-core against the standard GCSFileIO implementation.
| TPC-DS Schema Size | Scan Time Improvement | Execution Time Improvement |
|---|---|---|
| 1 GB | 71.51% | 32.61% |
| 10 GB | 48.48% | 18.94% |
| 100 GB | 40.98% | 10.95% |
| 1 TB | 35.86% | 3.38% |
| 10 TB | 18.40% | 1.58% |
The results indicate consistent improvements across all dataset sizes, showcasing the library's capability to enhance performance for complex query patterns.
Getting Started
Before deploying Spark workloads, users should ensure the following configurations are in place:
- Utilize Apache Iceberg Spark runtime 1.11.0+ and the iceberg-gcp-bundle 1.11.0+.
- Configure the catalog to use GCSFileIO.
- Enable the gcs-analytics-core optimization flag.
- Activate vectorized I/O for optimal read performance.
The gcs-analytics-core library is open-source, inviting developers to contribute and explore its source code. Detailed implementation and micro-benchmark configurations are available in the repository.
For further information, users can review the design document for in-depth architectural details.