11. (Appendix) Designing & Deploying Scalable Agentic Systems#

TL;DR (for practitioners)#

If you’re in a hurry, here’s the high-level recipe:

  1. Define the agentic workflow as roles (planner, tools executor, critic, router) and tools (APIs, databases, search, etc.).

  2. Use Amazon Bedrock for:

    • Foundation models and Agents for Bedrock (single or multi-agent orchestrations). (AWS Documentation)

    • Knowledge bases, guardrails, and (increasingly) AgentCore for runtime, memory, and observability.

  3. Use SageMaker for classic MLOps:

    • Train/domain-adapt your models, embedding models, ranking models, safety filters.

    • Register them in SageMaker Model Registry, promote via CI/CD. (AWS Documentation)

  4. Orchestrate the system via:

    • Agents for Bedrock / AgentCore (built-in agent orchestration), plus

    • AWS Step Functions and Lambda / ECS / EKS for workflows around the agents. (AWS Documentation)

  5. Persist memory and knowledge using S3, DynamoDB, OpenSearch, and Bedrock Knowledge Bases.

  6. Instrument heavily with CloudWatch metrics, logs, and traces, plus AgentCore / CloudWatch Application Signals for agent-level observability. (Amazon Web Services, Inc.)

  7. Scale safely using:

    • Horizontal scaling of stateless components,

    • Caching and model-routing (small vs large models),

    • Guardrails, policy enforcement, and staged rollouts (canaries / blue–green).

The rest of this markdown walks through all of this in depth, but still at a no-code / architecture & MLOps level.

What Is an “Agentic System” (in Cloud Terms)?#

In production, an agentic system is not just “an LLM with tools”. It’s a distributed application where:

  • Agents = LLM-driven components with a role (e.g., planner, data retriever, reasoner, code executor, critic).

  • Tools = APIs and services the agent can call (databases, SaaS APIs, internal microservices).

  • Memory = Long-term and short-term context (user profile, task history, RAG documents, intermediate steps).

  • Orchestration = Control flow across agents and tools: planning, retries, timeouts, routing, escalation.

  • Guardrails & Governance = Policies for safety, cost, compliance, and auditability.

On AWS, this maps roughly to:

  • Amazon Bedrock for foundation models, agents, knowledge bases, guardrails. (Amazon Web Services, Inc.)

  • Agentic runtime & orchestration via:

    • Agents for Bedrock and Bedrock AgentCore for agent workflows, tooling, memory, and observability. (Amazon Web Services, Inc.)

    • AWS Step Functions and/or open-source frameworks (LangGraph, CrewAI, pydantic-ai, etc.) for higher-level workflows. (AWS Documentation)

  • SageMaker for training, evaluation, and lifecycle of supporting models (embeddings, rerankers, safety classifiers, etc.).

  • Cloud-native infra for tools: Lambda, ECS/EKS, SQS, EventBridge, DynamoDB, S3, OpenSearch.

Non-Functional Requirements for Agentic Systems#

Before picking services, lock down your NFRs:

  • Latency: Chat-like apps want sub-2s “first token” latency; back-office agents can tolerate more.

  • Throughput & concurrency: How many parallel sessions? Spikes? Global vs regional traffic?

  • Reliability & fault-tolerance: What if a tool is down? A model throttles? A region fails?

  • Cost constraints: Per-session / per-user budget, cost allocation per team/product.

  • Safety & compliance: PII, PCI, HIPAA/GDPR, data residency.

  • Governance: Who can change prompts, tools, models? How are changes reviewed and rolled out?

  • Auditability: Can you reconstruct what an agent did and why?

Everything else follows from this.

A Reference Architecture for Agentic AI on AWS#

Think in layers:

Experience & API Layer#

  • Channels: Web/mobile app, internal console, Slack/Teams bot.

  • Ingress:

    • Amazon API Gateway / Amazon CloudFront for public-facing APIs.

    • Auth via Cognito / SSO / IAM roles.

Orchestration & Agents Layer#

  • Amazon Bedrock Agents / AgentCore for:

    • Defining agents with tools, knowledge bases, and guardrails.

    • Multi-agent collaboration (planner agent + domain experts). (AWS Documentation)

  • AWS Step Functions to:

    • Wrap the agent calls in durable workflows (start → plan → act → verify → respond).

    • Coordinate multiple services (logging, billing, notifications). (Amazon Web Services, Inc.)

  • Optional:

    • Open-source frameworks (LangGraph, CrewAI, pydantic-ai) hosted on ECS/EKS or Lambda, integrated with Bedrock models and AgentCore.

Tools & Microservices Layer#

  • Business tools:

    • REST/GraphQL APIs on Lambda, ECS, or EKS.

    • Database-backed services on RDS, DynamoDB, etc.

  • Observability tools:

    • CloudWatch Logs and Metrics.

    • Incident management systems, ticketing APIs (Jira, ServiceNow).

All tools must be:

  • Stateless, with externalized state in DBs or queues,

  • Idempotent or safely retryable,

  • Strongly authenticated/authorized (IAM, VPC, private link).

Memory, Knowledge, and State#

  • Long-term memory & knowledge:

    • Amazon S3 as the data lake.

    • Bedrock Knowledge Bases for RAG over documents.

    • OpenSearch or vector DB (self-managed or via partner) for low-latency semantic search.

  • Short-term / session memory:

    • Amazon DynamoDB or ElastiCache (Redis) for session state, conversation history pointers.

    • AgentCore Memory for managed agent memory without custom infra. (Amazon Web Services, Inc.)

Model & Data Science Layer (MLOps)#

  • SageMaker:

    • Data processing pipelines,

    • Model training (embeddings, ranking, safety filters),

    • Evaluation jobs,

    • Model Registry for versioning & approvals. (AWS Documentation)

  • Integration with Bedrock:

    • Use Bedrock models for core LLM tasks; SageMaker models as tools (e.g., classifier endpoints).

Designing the Agentic Workflow#

Decompose the Use Case into Agent Roles#

For any use case (e.g., “cloud ops copilot”, “biopharma business expert”), identify roles:

  • Router / Intent classifier: Which agent or workflow should handle this?

  • Planner: Breaks the task into steps and selects tools/agents.

  • Domain agents: Specialized in support, billing, infra, legal, etc.

  • Critic / verifier: Checks outputs (hallucinations, safety, consistency).

  • Summarizer / presenter: Formats responses for end-users or APIs.

In Bedrock, each of these can be an agent definition, or you can use multi-agent collaboration within a single Bedrock Agents setup. (Amazon Web Services, Inc.)

Tool Design Principles#

Tools define how agents act on the world. Design them as:

  • Clear, narrow APIs: e.g., get_ticket_status, scale_service, list_failed_deployments.

  • Typed and validated inputs/outputs, documented in OpenAPI where possible.

  • Secure by default:

    • Use AgentCore Identity or scoped IAM roles for tools that access AWS or third-party services. (Amazon Web Services, Inc.)

    • Enforce least privilege and explicit allow-lists.

  • Observable: Log each call with correlation IDs and agent metadata.

Memory & Knowledge Strategy#

Avoid a single “magic memory” bucket.

  • Episodic memory: Per-conversation or per-session context, stored in DynamoDB/Redis, with TTLs.

  • Semantic memory (RAG):

    • Knowledge Bases in Bedrock backed by S3/OpenSearch.

    • Clear separation between public knowledge, team knowledge, and per-tenant private knowledge.

  • Working memory:

    • Intermediate steps, tool results, scratchpads; often ephemeral but logged for debugging.

Safety, Guardrails & Policy Enforcement#

Use multiple layers:

  • Bedrock Guardrails:

  • Custom safety models:

    • Trained on SageMaker (toxicity detectors, PII detectors, policy compliance).

  • Policy-aware tools:

    • Tools themselves should validate that requested action is allowed for given user/role.

  • Audit logging:

    • Every action and tool call traceable to user, agent, and policy decision.

MLOps Lifecycle for Agentic AI#

Agentic systems are multi-model, multi-prompt, multi-tool. Your MLOps must manage all three: models, prompts, and tools.

Data & Feature Pipelines#

  • Raw data in S3 (logs, chat transcripts, tool responses, business metrics).

  • Feature stores & embeddings:

    • Embedding pipelines in SageMaker for documents and user profiles.

    • Periodic jobs to refresh indexes in knowledge bases or vector stores.

  • Labelling & feedback loops:

    • Human feedback on agent interactions (was this helpful/safe?).

    • Weak labels from monitoring (tool mismatch, errors, escalations).

Model Training & Evaluation (SageMaker)#

Typical supporting models:

  • Embedding models for RAG.

  • Rerankers or recommendation models.

  • Safety filters (toxicity, PII).

  • Routing models (which agent/tool/model to use).

Use SageMaker Pipelines to:

  • Automate data preparation → training → evaluation → registration.

  • Store trained models and metrics in Model Registry, with stage tags like staging and production. (AWS Documentation)

Prompt & Agent Versioning#

Treat prompts and agent configs like code:

  • Version-control prompts, tool schemas, and agent graphs in Git.

  • Use environment-specific configs (dev / staging / prod).

  • Run automated tests:

    • Regression tests on curated prompt suites.

    • “Safety tests” against red-team datasets.

CI/CD for Agentic Systems#

Typical flow:

  1. Developer changes prompts/agents/tools in Git.

  2. CI:

    • Lints prompts/configs.

    • Runs synthetic tests using Bedrock in a dev environment.

  3. CD:

    • Deploys updated Bedrock agents or AgentCore configurations to staging.

    • Runs shadow traffic or A/B tests.

    • On approval, promotes to production.

AWS tools: CodeCommit/CodeBuild/CodePipeline or GitHub Actions + AWS CDK/CloudFormation.

Example End-to-End System: Cloud Operations Copilot#

Let’s ground this in a concrete (but code-free) example, inspired by AWS scenarios where agents triage CloudWatch Logs and mitigate incidents. (Amazon Web Services, Inc.)

Roles & Agents#

  • Triage Agent:

    • Reads error summaries from CloudWatch Logs Insights.

    • Classifies severity and suggests likely root cause.

  • Remediation Agent:

    • Proposes runbooks or direct actions (restart service, roll back deployment).

    • Interfaces with Systems Manager Automation / Change Manager.

  • Communicator Agent:

    • Drafts human-readable updates for Slack/Teams, incident tickets.

These can be multi-agent collaborators inside Bedrock (using Bedrock Agents multi-agent features), or orchestrated externally via Step Functions and AgentCore. (Amazon Web Services, Inc.)

Tools#

  • Log analysis tool:

    • Calls CloudWatch Logs Insights to query error patterns.

  • Deployment tool:

    • Calls CodeDeploy / ECS APIs to roll back tasks.

  • Runbook tool:

    • Looks up remediation steps stored in S3 or a knowledge base.

  • Notification tool:

    • Sends updates via SNS/Slack webhook.

Orchestration#

  • Step Functions orchestrates:

    1. Trigger from CloudWatch alarm or an operator.

    2. Invoke Triage Agent (Bedrock).

    3. Parallel branch:

      • Run more detailed diagnostics.

      • Notify on-call engineer.

    4. If low/medium severity, let Remediation Agent propose and possibly execute an automated runbook.

    5. Use Communicator Agent to update incident tickets and channels.

  • All of this is instrumented with CloudWatch metrics and traces, with AgentCore Observability and CloudWatch Application Signals capturing agent steps and tool calls. (Amazon Web Services, Inc.)

Scaling Strategies for Agentic Workloads#

Concurrency, Throttling & Backpressure#

  • Understand Bedrock limits for model calls and configure:

    • Rate limits in API Gateway.

    • Concurrency limits in Lambda or ECS services.

  • Use queues (SQS, EventBridge) for non-interactive work to smooth spikes.

  • For interactive chat:

    • Use streaming responses from Bedrock to deliver tokens early while longer tools run in the background.

Caching#

  • Prompt-level caching:

    • Cache “expensive” results (e.g., summarizing a static document) in DynamoDB or Redis.

  • Embedding caching:

    • Avoid re-computing embeddings for unchanged documents or frequent queries.

  • Tool result caching:

    • Cache stable API responses (configuration, catalogs).

Model & Agent Routing for Cost/Latency#

  • Tiered models in Bedrock (small, medium, large; different vendors). (Amazon Web Services, Inc.)

  • Strategies:

    • Use a smaller / cheaper model for lightweight tasks, and escalate to larger models only when needed.

    • Use a router model or heuristic to choose which model or agent is appropriate.

  • For multi-agent setups:

    • Avoid “agent explosion”: have a router that limits which agents get invoked per request.

Horizontal Scaling of Tools and Orchestrators#

  • Design stateless orchestrator components:

    • Lambdas can scale quickly for spiky workloads.

    • ECS/EKS services for more predictable high-volume flows.

  • Use auto scaling policies based on:

    • Queue depth,

    • Error rates,

    • Latency percentiles (p95, p99).

Multi-Region and Multi-Account#

  • For global users:

    • Deploy agents in multiple regions close to users.

    • Use Route 53 or CloudFront for routing.

  • For large organizations:

    • Multi-account structure with central governance:

      • Central Bedrock / CloudWatch observability account,

      • Application accounts hosting tools and workloads. (Medium)

Observability, Monitoring, and Evaluation#

Agentic systems are complex. You need deep observability:

Telemetry Pillars#

  • Metrics:

    • Latency per step (LLM calls, tools).

    • Success/failure rates, retries.

    • Cost metrics (tokens, Bedrock usage).

  • Logs:

    • Structured logs with correlation IDs.

    • Redacted inputs/outputs where necessary.

  • Traces:

    • End-to-end traces from user request → agents → tools → response.

AWS CloudWatch and AgentCore Observability now provide features specifically for generative AI and agents (Application Signals, Bedrock observability, multi-framework support). (Amazon Web Services, Inc.)

Agentic-Specific Metrics#

Track:

  • Tool utilization:

    • Frequency and latency of each tool.

    • Error codes and failure rates.

  • Agent behavior:

    • Number of steps per request.

    • Looping/oscillation detection (too many iterations).

    • Escalation rate (how often agents ask humans for help).

  • Quality & safety:

    • User satisfaction scores (thumbs up/down).

    • Safety incidents (flagged content, blocked tool calls).

    • Consistency between tool outputs and final responses.

Continuous Evaluation#

  • Maintain golden test sets:

    • Realistic scenarios with expected outputs or constraints.

  • Periodically run:

    • Offline evaluations (accuracy, helpfulness, safety).

    • Load tests to ensure capacity and SLO adherence.

  • Integrate evaluation with CI:

    • Block deployments if regressions in quality or safety exceed thresholds.

Governance, Risk, and Compliance#

Permissions & Identity#

  • Use IAM roles with least privilege for:

  • Separate roles for:

    • Model access,

    • Data access,

    • Write vs read operations.

Data Governance#

  • Encrypt data at rest (KMS) and in transit (TLS).

  • Tag and isolate sensitive datasets; ensure correct residency.

  • Implement data retention policies:

    • How long do you store prompts, tool calls, and transcripts?

    • Can users request deletion?

Change Management & Rollback#

  • Treat agents as deployable artifacts:

    • Use staging and production environments.

    • Implement canary deployments (route a small percentage of traffic to new versions).

    • Have a one-click rollback path.

Responsible AI#

  • Maintain model cards and agent cards describing:

    • Intended use,

    • Limitations,

    • Safety mitigations.

  • Periodically review:

    • Bias assessments (especially if agents make high-stakes decisions).

    • Misuse patterns and new threat models.

Implementation Checklist#

Here’s a pragmatic sequence for building a real-world agentic system on AWS:

  1. Scope & design

    • Define user journeys, agent roles, tools, success metrics.

  2. Choose AWS components

    • Bedrock models + Agents/AgentCore for core agent logic.

    • SageMaker for supporting models and MLOps.

    • Step Functions + Lambda/ECS for orchestration & tools.

  3. Prototype

    • Single-region, limited-traffic POC.

    • Simple logging, manual evaluation.

  4. Hardening

    • Add guardrails and safety filters.

    • Introduce proper observability and tracing.

    • Build CI/CD and basic canary deployment.

  5. Scale-out

    • Optimize for latency and cost (routing, caching).

    • Add multi-agent collaboration where beneficial.

    • Implement auto scaling.

  6. Continuous improvement

    • Close the feedback loop from telemetry & user feedback to training data.

    • Regularly refresh knowledge bases and retrain supporting models.

Common Pitfalls & Anti-Patterns#

  • “Single giant agent”: One agent doing everything with a massive prompt. Hard to debug, scale, and govern.

  • No explicit tool contracts: Letting the LLM “invent” tool usage instead of having strict schemas.

  • Unbounded loops: Agents that keep thinking & calling tools indefinitely—always add step and time limits.

  • Hidden state: Storing critical context only in prompts, not in explicit memory/state stores.

  • No observability: Debugging via ad hoc logs instead of structured metrics, logs, and traces.

  • Prompt sprawl: Unversioned prompts directly edited in consoles; use Git and environments instead.

Where SageMaker, Bedrock, and AgentCore Each Fit#

To summarize their roles in an agentic MLOps stack:

  • Amazon Bedrock

    • Foundation models (Claude, etc.).

    • Knowledge bases and guardrails.

    • Agents for Bedrock for integrated tool use and multi-agent workflows. (AWS Documentation)

  • Amazon Bedrock AgentCore (emerging platform)

    • Purpose-built runtime for agents.

    • Managed memory, identity, gateway, code interpreter, browser tool.

    • Deep observability for agent steps and interactions. (Amazon Web Services, Inc.)

  • Amazon SageMaker

    • Classic MLOps backbone:

      • Training pipelines,

      • Model Registry and approvals,

      • Batch/online endpoints for supporting models.

    • Ideal for domain-specific ML around your agents.

Together, they allow you to build production-grade, scalable, observable, and governable agentic systems that go far beyond “just calling an LLM”.