Google Enhances Apache Iceberg Lakehouse with BigQuery Interoperability

During the Apache Iceberg Summit in San Francisco, Google announced a preview of read and write interoperability between BigQuery and various Iceberg-compatible engines like Trino and Spark. This new feature allows users to leverage enterprise-grade native storage for their lakehouses while maintaining the openness and flexibility that Iceberg is known for.

Importance of the Update: Apache Iceberg has become a preferred choice for data platform teams needing to support multiple compute engines accessing the same data for diverse workloads. However, many users have reported that achieving true openness often comes with compromises, particularly regarding cost and performance when compared to traditional enterprise storage solutions.

To address these challenges, Google has developed a robust storage infrastructure that integrates real-time metadata and unified governance across its Cloud Storage and various query engines. This infrastructure is now available for use directly within Iceberg.

Interoperability Features

Previously, users had to choose between Iceberg tables in the Google-managed Iceberg REST catalog or those managed by BigQuery, which limited their ability to utilize features across different ETL engines. The new interoperability allows users to create, update, and query Iceberg tables in the Google serverless Iceberg REST catalog using BigQuery or other compatible engines like Spark and Flink. This two-way interoperability enables data teams to work with a single table type in an open manner.

Additionally, the Iceberg REST Catalog now provides table-level access controls, ensuring consistent governance across all engines querying or modifying Iceberg tables.

Enhanced Performance and Management

Table Management Automation: Achieving optimal query performance on Iceberg tables can be complex. Users can now delegate table maintenance tasks, such as compaction and garbage collection, to Google Cloud BigLake, which will be available in preview for Google-managed Iceberg REST catalog tables next month. This feature is expected to enhance BigQuery performance significantly.

Advanced Runtime for BigQuery: The advanced runtime for BigQuery introduces performance enhancements that automatically accelerate analytical workloads. This feature will be available in preview for Google-managed Iceberg REST catalog tables and aims to double query performance compared to self-managed approaches.

Lightning Engine for Spark: The Lightning Engine significantly boosts Apache Spark query performance, improving it by over four times compared to open-source Spark.

Advanced Analytics Capabilities

Real-time Streaming: Users can utilize BigQuery’s Vortex streaming infrastructure for high-throughput ingestion with near-zero read latency. This feature is already available for BigQuery-managed Iceberg tables and will be in preview for Google-managed tables next month.

Data Replication with Datastream: Google Cloud now supports easy replication of data from various operational databases into managed Iceberg tables using Datastream integration.

Change Data Capture: The BigQuery storage write API allows real-time streaming of changes from OLTP databases to Iceberg tables, eliminating the need for complex ETL pipelines.

Next Steps for Users

With the introduction of bidirectional interoperability between BigQuery and other Iceberg-compatible engines, users can modernize their lakehouses without compromising on performance, governance, or advanced analytics. Those interested in exploring these new capabilities can refer to the available quickstart guides for implementation.