Streamlined Access to Amazon S3 with Lake Formation Permissions

Data scientists and machine learning engineers frequently require access to raw data files stored in Amazon S3 for various tasks, including training models and exploring datasets. Traditionally, when access to these files was controlled by AWS Lake Formation, managing permissions could become cumbersome, often requiring separate policies for S3 buckets and IAM roles. This approach not only increased operational overhead but also posed risks of permission drift.

With the latest updates, Lake Formation now allows direct access to S3 data file locations for tables under its governance. Previously, users could query tables in the AWS Glue Data Catalog using spark.sql(). Now, they can also read and write the underlying S3 files through spark.read.parquet() or spark.read.csv() in Amazon EMR Spark jobs, Amazon SageMaker notebooks, and custom applications, all while adhering to Lake Formation permissions.

How It Works

This new capability relies on the GetTemporaryDataLocationCredentials() API, which provides temporary credentials for registered S3 locations based on the user's Lake Formation permissions. This innovation negates the need for separate S3 bucket policies for file-level access, allowing for fine-grained control over data access while maintaining a unified governance model.

Real-World Application

A financial services company exemplifies this functionality by utilizing Spark in EMR to process raw transaction records stored in S3. They transform these records and store them in a different S3 location, registering the processed data in Lake Formation. The ETL job accesses the raw data via IAM permissions while leveraging Lake Formation permissions for reading and writing to the curated table.

Setting Up the Environment

To utilize this feature, users should prepare their environment as follows:

Organize S3 bucket structure:

Raw data: s3:///raw/transactions/dt=2024-03-21/
Processed data: s3:///processed/transactions/
Spark scripts: s3:///scripts/
Logs: s3:///logs/

Grant necessary permissions to both the EMR runtime role and Data-Analyst role.

Using the Java Plugin

A specialized AWS Lake Formation Credential Vending Plugin for the AWS SDK V2 for Java enhances this functionality. This plugin checks Lake Formation permissions for requested S3 locations and provides temporary credentials when access is granted. If permissions are not managed by Lake Formation, it defaults to checking S3 Access Grants and IAM permissions.

Conclusion

The introduction of direct S3 location access through Lake Formation significantly simplifies data governance, allowing data scientists to utilize both spark.sql() and direct S3 access seamlessly. This unified approach ensures that all access is governed by the same permissions, facilitating efficient data management and compliance.

To explore this feature, launching an EMR 7.13 cluster is recommended. This capability represents a significant step forward in managing data access within AWS environments, empowering teams to work faster without compromising security.