Deploying RAG-Powered AI Solutions with AWS Local Zones and Outposts

Organizations in regulated industries are increasingly adopting generative AI while grappling with the challenge of maintaining strict data residency. A viable solution involves deploying self-managed Small Language Models (SLMs) on-premises using AWS Outposts or in nearby AWS Local Zones.

SLMs can deliver accuracy similar to larger models for specific use cases, but they are limited by a static knowledge base. This limitation is particularly pronounced in SLMs due to their smaller parametric memory. To enhance their performance in enterprise settings, SLMs require an architecture that integrates fresh, governed data.

Retrieval-Augmented Generation (RAG) serves as the key architectural pattern that connects a model’s static knowledge with dynamic enterprise data. This article outlines a solution template for deploying an SLM augmented with RAG, which not only improves accuracy but also reduces total cost of ownership by minimizing size and latency.

A practical application of this architecture is demonstrated through a chatbot designed to address technical queries about AWS Hybrid Edge products, specifically AWS Local Zones and AWS Outposts.

Architecture Overview

The chatbot solution is deployed on four EC2 instances, each serving a distinct role in the RAG pipeline:

g4dn or G7e (GPU)
m5.xlarge

All instances utilize the Deep Learning Base OSS Nvidia Driver GPU AMI (Amazon Linux 2023) for GPU workloads and Amazon Linux 2023 for the database instance.

RAG enhances model output by referencing an authoritative knowledge base before generating responses, significantly reducing inaccuracies and improving traceability. The RAG workflow operates through a structured seven-stage pipeline, ensuring that data remains within controlled environments.

Implementing RAG

Key steps for deploying the RAG environment include:

Creating vector embeddings for proprietary data and user queries using a suitable model, such as BAAI/bge-large-en-v1.5.
Using recursive character chunking to split documents into manageable sizes (600–800 tokens with 10–15% overlap) to maintain context.
Deploying a specialized database, like Milvus, for efficient storage and similarity searches.

Additionally, a reranking step can enhance retrieval quality by re-scoring initial results with a cross-encoder model, ensuring that only the most relevant information is processed by the SLM.

Performance Optimization

To optimize performance, it is crucial to filter out less relevant chunks before they reach the SLM. This selective approach not only improves processing efficiency but also enhances the quality of generated responses.

Quality Assurance

Before deploying the RAG system in production, it is essential to establish quality gates, including a “Golden Dataset” of curated questions with known answers. This ensures that updates to the model do not degrade performance.

Security Considerations

Implementing RAG solutions on AWS Local Zones and Outposts necessitates a robust security strategy to maintain data residency and compliance. Key security controls should include:

Production hardening of the code.
Adhering to AWS documentation for data residency architecture.

This guide illustrates how organizations can leverage proprietary data in AI applications while ensuring compliance with regulatory requirements. By utilizing SLMs augmented with RAG, enterprises can achieve both security and enhanced AI capabilities.