Achieving Cross-Region Resilience for Amazon OpenSearch Service

Cross-Region resilience for Amazon OpenSearch Service has traditionally posed challenges, often involving complex manual failover procedures that can lead to downtime and data inconsistencies. To address these issues, a new solution leverages an active-active replication model, ensuring synchronized data across AWS Regions without the need for reestablishing relationships during fail-back.

AWS provides two OpenSearch offerings: the Amazon OpenSearch Service, which is a managed cluster-based service, and Amazon OpenSearch Serverless, a fully managed serverless option. This article focuses on implementing cross-Region resiliency using Amazon OpenSearch Serverless as the primary example.

Solution Overview

This approach utilizes Amazon MSK Replicator for bidirectional data replication between Amazon MSK clusters in different regions, while Amazon OpenSearch Ingestion (OSI) pipelines index data into OpenSearch Serverless collections. This setup allows for near real-time data replication, supporting active-active operations with automatic loop prevention and consumer group offset synchronization.

Architecture Design

The architecture follows a Regional-first approach, where data sources write to a local Amazon MSK cluster. An AWS Lambda function acts as the producer, streaming data into the MSK cluster. OSI pipelines then consume this data and persist it in an OpenSearch Serverless collection. To achieve synchronization, Amazon MSK Replicator facilitates bidirectional replication, ensuring that both regions maintain identical datasets.

Deployment Steps

To implement this solution, deploy an AWS CloudFormation template that sets up the necessary configurations. The primary AWS Region is typically set to us-east-1, with a secondary region like us-west-2.

Configuration Details

The OSI pipeline configuration requires an IAM role with permissions for both Amazon MSK and OpenSearch Serverless, allowing it to consume and write data. For effective active-active replication, two Amazon MSK Replicators should be deployed in each region, with appropriate cluster policies to facilitate connections.

Operational Benefits

When an application generates an event, it publishes messages to an Apache Kafka topic in the regional streaming cluster. These messages are durably stored, providing a reliable buffer. The ingestion pipeline reads from the topic and indexes the data into OpenSearch Serverless, making it searchable in near real-time. Simultaneously, Amazon MSK Replicator ensures that the same event stream is available in the secondary region.

Failover and Recovery

In the event of a failure, applications can seamlessly switch to the OpenSearch Serverless collection in the other region, with data from before the failure remaining accessible. Upon recovery, operations resume automatically, with any data written during the impairment being backfilled to the recovered region.

Cost Considerations

It’s important to note that cross-Region data transfers may incur additional costs. To ensure reliability, configure Dead Letter Queues (DLQ) for OSI pipelines and monitor key metrics via Amazon CloudWatch, including replication latency and ingestion failures.

Conclusion

This solution demonstrates how to establish cross-Region resiliency for Amazon OpenSearch Serverless and managed clusters, enabling low-latency searches and high availability for distributed workloads. For detailed implementation guidance, refer to the disaster recovery section in the provided GitHub repository.