Amazon OpenSearch Service has introduced a unified observability workspace that combines application monitoring, integration with Amazon Managed Service for Prometheus, and AI agent tracing. This enhancement allows users to query Prometheus metrics alongside logs and traces, facilitating a more streamlined troubleshooting process.
Two practical scenarios illustrate these capabilities using the OpenTelemetry sample app: a multi-agent travel planner experiencing slow responses and a checkout flow failing due to issues with a microservice.
Scenario 1: Multi-Agent Travel Planner
As users report slow responses from the travel planner, the new AI agent tracing feature enables tracing the full processing path of the agent to identify performance issues.
In the OpenSearch UI, users can access the Application Map to visualize the system's topology, including the travel agent and its sub-agents. Elevated latency and errors are visible on the travel agent node, prompting further investigation.
Understanding the Reasoning Chain
To delve deeper into the agent's performance, users can select Agent Traces and filter by service name and time range. This leads to a trace tree that organizes the agent's reasoning chain, showing the root agent span, LLM calls, and nested tool invocations.
In this case, a tool call within the weather agent failed, resulting in additional reasoning time before the agent returned a partial response, explaining the latency spikes.
Scenario 2: E-commerce Checkout Flow
During peak traffic, the checkout service experiences performance degradation. Users can navigate to APM Services to view health indicators for each instrumented service. The checkout service displays an increased error rate.
By selecting the affected service, users can access detailed metrics, including Request, Error, and Duration (RED) metrics. A spike in fault rates and doubled p99 duration indicates when the issue began.
Drilling Down to Root Cause
Users can examine the correlated spans for the affected time window, revealing multiple failed requests hitting the same endpoint. The trace waterfall shows that the checkout service's call to prepareOrder failed due to a product retrieval error, pinpointing the root cause.
Integration with Prometheus
To determine if issues stem from the application or underlying infrastructure, users can query Prometheus metrics directly within the OpenSearch UI. This integration allows for simultaneous access to logs, traces, and metrics without switching interfaces.
For instance, users can run PromQL queries to assess database read/write throughput or check LLM endpoint response times to identify the source of latency.
Architectural Overview
The architecture maintains distinct storage for metrics and logs, with OpenSearch UI federating queries across both. This approach preserves the operational model of each backend while simplifying the troubleshooting workflow.
Getting Started
To utilize these features, users should log in to the OpenSearch UI observability workspace and ensure the Observability:apmEnabled toggle is activated. The OpenSearch Observability Stack can also be explored locally, providing a fully configured environment for testing.
Conclusion
Amazon OpenSearch Service's unified observability workspace enhances the ability to monitor and troubleshoot applications effectively. By integrating AI agent tracing and Prometheus metrics, users can gain deeper insights into application performance and resolve issues more efficiently.