Organizations leveraging Apache Spark on Amazon EMR often utilize Kerberos authentication to secure connections to a centralized Apache Hive Metastore (HMS). With the introduction of Amazon EMR on EKS, users can run Spark jobs with Kubernetes-based container orchestration, enhancing resource utilization and job startup times. However, since an HMS deployment supports only one authentication method at a time, configuring Kerberos for Spark jobs on EMR on EKS is essential for connecting to an existing Kerberos-enabled HMS.
This article outlines the steps to implement Kerberos authentication for Spark jobs on Amazon EMR on EKS, enabling seamless operation alongside Amazon EMR on EC2 while maintaining a secure data environment.
Understanding the Architecture
Consider a data platform team that has been operating Spark jobs on Amazon EMR on EC2. Their architecture includes a Kerberos-enabled standalone HMS, with Microsoft Active Directory acting as the Key Distribution Center (KDC). As they explore Amazon EMR on EKS for new workloads, it is crucial that their existing HMS continues to serve both EMR on EC2 and EMR on EKS, requiring configuration for Kerberos authentication across both platforms.
Key Components of the Solution
The architecture involves two Amazon Virtual Private Clouds (VPCs) connected via VPC peering, with distinct layers for identity management, compute, and metadata services:
- Identity and Authentication Layer: A self-managed Microsoft Active Directory Domain Controller serves as the KDC for Kerberos authentication, hosting service principals for both the HMS and Spark jobs.
- Data Platform Layer: This layer includes an EKS cluster hosting both the HMS service and Spark jobs, with data stored in Amazon S3.
- Hive Metastore Service: Deployed in the EKS hive-metastore namespace, the HMS operates independently, authenticating with the KDC using its service principal.
- Apache Spark Execution Layer: Spark jobs are deployed using the Spark Operator on EKS, with pods configured to use Kerberos credentials through mounted ConfigMaps and Kubernetes secrets.
Implementation Steps
The implementation requires collaboration among three key stakeholders:
- Active Directory Administrator: Responsible for creating service accounts, setting up service principal names, generating keytab files, and configuring permissions.
- Data Platform Team: Manages configurations for Amazon EMR on EKS, retrieves keytabs from Secrets Manager, and sets up Kubernetes secrets and Helm charts.
- Data Engineering Operations: Submits jobs using the configured service account and monitors job execution.
Verification and Security Considerations
After deployment, it is crucial to verify Kerberos authentication by executing a Spark job that connects to the Kerberized HMS. Logs should confirm successful authentication and display available databases and tables.
For enhanced security, consider the following:
- Enable EKS envelope encryption for Kubernetes secrets containing keytabs.
- Implement TLS on HiveServer2 / Hive Metastore Thrift to encrypt data over the connection.
Conclusion
This guide provides a framework for implementing Kerberos authentication for Spark jobs on Amazon EMR on EKS, addressing the needs of organizations with existing Kerberos-enabled HMS deployments. This solution is applicable whether migrating from on-premises Hadoop or building a new cloud-native platform.
For further exploration, sample code is available in the AWS Samples GitHub repository, along with detailed verification steps for each deployment stage.