Introducing gcs-analytics-core: Enhancing Apache Iceberg and Spark Performance on GCS

Data engineers often face challenges in managing compatibility and optimizing performance across various analytics engines. To address these issues, Google Cloud has launched gcs-analytics-core, an open-source Java library aimed at centralizing and accelerating analytics optimizations for Google Cloud Storage (GCS).

This library allows users to choose their preferred analytics engine while ensuring high performance on GCS. Currently, it optimizes engines like the Iceberg Spark engine, with plans to extend support to additional analytics engines by year’s end.

Understanding gcs-analytics-core

The gcs-analytics-core library acts as a centralized optimization layer between analytics engines—such as Apache Spark, Trino, and Apache Hive—and the GCS Java SDK. It enhances read operations by intercepting calls and injecting performance improvements, thus providing a consistent experience without the need for engine-specific tuning.

For users of Apache Iceberg, the library integrates with the GCSFileIO implementation, substituting traditional sequential reads with parallelized methods to reduce latency and increase throughput.

Key Optimizations

This library introduces several optimizations to minimize I/O time and execution duration:

Vectored I/O (threaded): This feature enhances read performance by fetching multiple data ranges in parallel, significantly reducing the overhead associated with GCS calls.
Smart Parquet Prefetching: When accessing Parquet data, the library prefetches footer data in a single chunk, avoiding multiple network calls that typically occur during metadata retrieval.

Integration with Apache Iceberg

The gcs-analytics-core library has been integrated into Apache Iceberg, starting with version 1.11.0. Users can leverage these performance enhancements by ensuring their Iceberg catalog is set to use the native GCS FileIO.

This integration allows users to benefit from optimizations like Parquet footer prefetching and multi-threaded vectored reads without complex configurations.

Catalog Compatibility

The library is compatible with all Iceberg catalogs, including REST and Hive, allowing for consistent read improvements without necessitating changes to existing infrastructure setups.

Performance Benchmarks

To validate the library's effectiveness, benchmarking was conducted using an open-source Apache Spark cluster with an Iceberg catalog configured to utilize GCSFileIO. The benchmarks employed the TPC-DS schema across various dataset sizes, comparing the optimizations of gcs-analytics-core against the standard GCSFileIO implementation.

TPC-DS Schema Size	Scan Time Improvement	Execution Time Improvement
1 GB	71.51%	32.61%
10 GB	48.48%	18.94%
100 GB	40.98%	10.95%
1 TB	35.86%	3.38%
10 TB	18.40%	1.58%

The results indicate consistent improvements across all dataset sizes, showcasing the library's capability to enhance performance for complex query patterns.

Getting Started

Before deploying Spark workloads, users should ensure the following configurations are in place:

Utilize Apache Iceberg Spark runtime 1.11.0+ and the iceberg-gcp-bundle 1.11.0+.
Configure the catalog to use GCSFileIO.
Enable the gcs-analytics-core optimization flag.
Activate vectorized I/O for optimal read performance.

The gcs-analytics-core library is open-source, inviting developers to contribute and explore its source code. Detailed implementation and micro-benchmark configurations are available in the repository.

For further information, users can review the design document for in-depth architectural details.