AWS DevOps Agent is capable of autonomously diagnosing a range of production incidents, including CrashLoopBackOff failures and ConfigMap deletions. However, it has limitations when the necessary data resides outside its native integrations, such as on a node's operating system or within third-party monitoring tools.
This article explores how to extend the AWS DevOps Agent's functionality by creating a custom Model Context Protocol (MCP) server. This server enables structured access to Amazon EKS worker node diagnostics, enhancing the agent's ability to perform root cause analysis without manual intervention.
Getting Started
To implement this solution, ensure that you have the following:
- A working AWS environment with EKS and SSM Agent.
- Basic knowledge of AWS services and Kubernetes.
The MCP standard allows AWS DevOps Agent to connect to custom servers, providing new capabilities without altering the agent itself. By following a structured process, users can build an MCP server that grants the agent access to various data sources.
Steps to Build the MCP Server
The extensibility model consists of three key steps:
- Identify the Data Source: Determine which data the AWS DevOps Agent cannot access natively.
- Build the MCP Server: Create a server that provides safe, structured access to the identified data source.
- Connect to AWS DevOps Agent: Link the MCP server to the agent, enabling it to utilize the new tools in its investigations.
Design Principles
Three core principles guide the design of the MCP server:
- Return Structured Data: Provide findings with severity levels and stable IDs for easy filtering and correlation.
- Control Access: Avoid giving the agent shell access; instead, mediate interactions through a controlled execution model.
- Composable Tools: Design outputs from one tool to serve as inputs for others, creating a cohesive evidence chain.
Integration with EKS
AWS DevOps Agent integrates with Amazon EKS to monitor pod status, analyze container logs, and correlate cluster events. However, many production issues arise from the node operating system, where critical artifacts such as iptables rules and kernel messages reside. These elements are essential for diagnosing issues like DNS resolution failures and network policy enforcement problems.
Sample Implementation
The sample-eks-node-diagnostics-mcp repository illustrates this approach, providing an MCP server that offers structured access to node-level diagnostics. It leverages AWS Systems Manager (SSM) Automation for safe execution.
Investigation Workflow
When an incident occurs, AWS DevOps Agent can autonomously initiate an investigation. For example, if DNS resolution fails for running pods, the agent collects node logs and performs health checks. It then conducts parallel tasks to analyze iptables rules, search for network errors, and compare diagnostics with a known-good node.
Conclusion
This method not only enhances the AWS DevOps Agent's capabilities but also provides a framework for accessing data from various sources, including EC2 instances and network devices. By implementing a custom MCP server, organizations can bridge visibility gaps and improve incident response times.