Building Multi-Agentic Systems with LangGraph — and Evaluating Them Like Adults
— #LangGraph#Multi-Agent Systems#AI Agents#LangChain#Human-in-the-Loop#Agent Evaluation#LLMOps
TL;DR: Multi‑agent ≠ magic. Start with a clear job-to-be-done, encode it as a stateful graph in LangGraph, keep agents small and tool-driven, wire in checkpoints + interrupts for control, and run repeatable evals (unit, component, end‑to‑end, and benchmark environments). Below you’ll find working patterns, code you can paste, and a pragmatic eval stack that won’t gaslight your on‑call charts.
1) Why multi‑agent now?
- Task decomposition: LLMs are better at a chain of narrow steps than one monolith.
- Control & debuggability: LangGraph models agents as a state machine; you get deterministic edges, replay, and time‑travel.
- Operational realities: You need memory, fault tolerance, human approval. LangGraph bakes these into the runtime via checkpointers and interrupts.
Blunt truth: If you can’t articulate the state transitions and success criteria, you don’t have an agentic system— you have vibes. Don’t ship vibes.
2) The core primitives you’ll use in LangGraph
- State: A typed dict (or pydantic model) your nodes read/write. Reducers aggregate values across nodes.
- Nodes: Pure functions:
State -> Partial[State]. Keep them small and testable. - Edges: Deterministic or conditional routing.
- Checkpointer: Persists state per thread; unlocks memory, time‑travel, and human‑in‑the‑loop.
- Interrupts: Pause execution to get human input or approval; resume with the answer.
- Prebuilt ReAct Agent:
create_react_agent(model, tools, ...)gives you a solid tool‑calling loop with optional memory.
3) Common multi‑agent patterns (that actually work)
- Supervisor → Specialists: One router delegates to focused workers (retriever, coder, clerk).
- Peer‑to‑Peer Handoffs: Agents invoke handoff tools that route via a command primitive (no brittle string parsing).
- Hierarchical Subgraphs: A node expands into its own subgraph for complex subtasks.
- Human checkpoints: Approve tool calls that hit money, prod, or legal.
Minimal supervisor → specialists (Python)
from typing import Annotated
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool
from langgraph.prebuilt import create_react_agent, InjectedState
from langgraph.graph import MessagesState
from langgraph.checkpoint.memory import InMemorySaver
# Tools for the Researcher
@tool
def search_knowledgebase(query: str) -> str:
"""Search internal KB and return a concise bullet list of findings."""
# call your KB / vector store here
return "- Policy X says Y\n- SLA is 24h\n- Link: /kb/123"
# Tools for the Clerk
@tool
def file_ticket(summary: str, priority: str = "low") -> str:
"""Create a ticket with the given summary and priority. Returns ticket id."""
# call your ticketing API here
return "TCK-92841"
# Agent models
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
researcher = create_react_agent(
model=llm,
tools=[search_knowledgebase],
)
clerk = create_react_agent(
model=llm,
tools=[file_ticket],
)
# Supervisor uses sub-agents as tools
@tool
def ask_researcher(q: Annotated[str, InjectedState]) -> str:
"""Delegate to Researcher agent with the current thread state."""
res = researcher.invoke({"messages": q["messages"]})
return res["messages"][-1].content
@tool
def ask_clerk(q: Annotated[str, InjectedState]) -> str:
"""Delegate to Clerk agent with the current thread state."""
res = clerk.invoke({"messages": q["messages"]})
return res["messages"][-1].content
supervisor = create_react_agent(
model=llm,
tools=[ask_researcher, ask_clerk],
# add memory via a checkpointer so conversations persist per thread
checkpointer=InMemorySaver(),
)
# Run it
thread_cfg = {"configurable": {"thread_id": "demo-1"}}
result = supervisor.invoke(
{"messages": [{"role": "user", "content": "Find our refund policy and file an urgent ticket summarising it."}]},
thread_cfg,
)
print(result["messages"][-1].content)Why this works
- The supervisor only decides who should act next. Real work happens inside tool‑rich specialists.
- Passing
InjectedStateprovides full conversational context to sub‑agents without manual plumbing. - The checkpointer (here, in‑memory) gives you threads you can replay or fork later.
If you're using Pydantic AI instead of LangChain, the same delegation pattern works with different primitives — see production multi-agent orchestration with Pydantic AI.
4) Human‑in‑the‑loop (HITL) where it matters
Use dynamic interrupts to pause inside a node, or static interrupts to break before/after a node.
from typing_extensions import TypedDict
from langgraph.graph import StateGraph, START, END
from langgraph.types import interrupt, Command
from langgraph.checkpoint.memory import InMemorySaver
class State(TypedDict):
messages: list
human_ok: bool | None
def risky_tool_call(state: State):
# Ask for human approval with context; execution pauses here
approve = interrupt("Approve external API call? yes/no")
if str(approve).strip().lower() != "yes":
return {"messages": [{"role": "system", "content": "Cancelled by human."}], "human_ok": False}
# ...call the external API safely...
return {"messages": [{"role": "system", "content": "API call done."}], "human_ok": True}
builder = StateGraph(State)
builder.add_node("risky", risky_tool_call)
builder.add_edge(START, "risky")
builder.add_edge("risky", END)
graph = builder.compile(checkpointer=InMemorySaver())
cfg = {"configurable": {"thread_id": "hitl-1"}}
# 1) Run until interrupt hits
try:
graph.invoke({"messages": []}, cfg)
except Exception:
pass # LangGraph surfaces an interrupt you can catch in your server/UI
# 2) Resume later after collecting a human answer (e.g., from your UI)
resumed = graph.invoke({"messages": [], "human_ok": True}, cfg)Where to place HITL
- Before irreversible side‑effects (payments, deletions, emails)
- When confidence/faithfulness < threshold
- On governance boundaries (legal, PII, KYC)
5) Turning graphs into systems: persistence, time‑travel, replay
- Threads: Isolate each conversation/run:
{"configurable": {"thread_id": "..."}} - Checkpoints at each super‑step: inspect state, replay, or fork.
- Time‑travel: Modify state and resume from a prior checkpoint to explore alternatives or fix a bad branch.
Practical tip: store thread IDs in your product session/user ids; you’ll thank yourself during incident review.
6) Evals that actually catch regressions (for multi‑agent)
Think three layers:
(A) Unit & component tests (fast, CI‑friendly)
- Node contracts: Deterministic nodes get ordinary pytest. LLM nodes get shims with fixtures + seeded prompts.
- Tool contracts: Validate schemas, required fields, and failure handling (e.g., mock 429s, timeouts).
(B) End‑to‑end task success
Measure on realistic tasks:
- Task Success Rate (TSR): binary pass/fail from a rubric
- Step Efficiency: steps per success, tool calls, handoffs
- Tool Correctness: arguments match ground truth; side‑effects verified
- Faithfulness/Groundedness: output traces back to sources (for RAG/analyst agents)
- Safety/Policy: jailbreaks, PII leakage
(C) Scenario & environment benchmarks
- Web or desktop task suites (web browsing, forms, files, multi‑app workflows)
- Safety benches (malicious instructions, risky APIs)
7) Concrete eval stack that pairs well with LangGraph
Pick one from each row to start; expand later.
| Layer | Recommended tools | Why it’s useful |
|---|---|---|
| Unit/component | pytest + DeepEval metrics | Pytest ergonomics with LLM metrics (relevancy, faithfulness, custom rubrics). |
| Tracing & scores | Logfire / LangSmith / Opik / Phoenix | Centralize traces, datasets, model‑as‑judge, human annotations, dashboards. |
| RAG‑specific | Ragas (via Opik/Phoenix integrations) | Standard RAG metrics: answer relevancy, context precision/recall, faithfulness. |
| Benchmarks | WebArena / VisualWebArena, OSWorld, BrowserGym, AgentBench, Agent‑SafetyBench | Execution‑based eval in realistic environments + safety coverage. |
8) Example: end‑to‑end tests for the supervisor graph
Goal: verify the system files the right ticket summary and cites the refund policy.
# tests/test_end_to_end.py
import os
from deepeval import assert_test
from deepeval.metrics import GEval, AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
# Suppose your app exposes a function run_supervisor(messages) -> str
from app.agents import run_supervisor
# 1) Quality rubric using an LLM-as-judge (GEval)
correctness = GEval(
name="task_success",
criteria=(
"Does the final message include an actionable ticket id AND a brief, correct refund policy summary?"
),
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
threshold=0.7,
)
# 2) Grounding: ensure the answer uses provided context (e.g., KB snippet)
relevancy = AnswerRelevancyMetric(threshold=0.7)
kb = "Customers get a full refund within 30 days at no extra cost."
def test_files_ticket_with_policy():
output = run_supervisor([{"role": "user", "content": "Find refund policy and file a ticket."}])
case = LLMTestCase(
input="Find refund policy and file a ticket.",
actual_output=output,
retrieval_context=[kb],
)
assert_test(case, [correctness, relevancy])Add numeric counters from your graph (steps, tool calls, handoffs) as plain pytest asserts. Example: no more than 6 steps for a pass.
9) Measuring the graph itself (telemetry you should log)
Emit these per‑run, and wire alerts on the deltas:
- steps_total, tool_calls_total, handoffs_total
- invalid_tool_args_total, retries_total, timeouts_total
- tsr (task success rate), latency_p95, cost_total
- faithfulness_score, relevancy_score, toxicity_score
Once logged, you can: compare branches in CI, gate deploys on TSR ≥ threshold, auto‑rollback on regressions.
10) Production hardening checklist (no-nonsense edition)
- Deterministic edges before you scale: avoid “ask the LLM who to call” unless you can afford loops.
- Guard tools: strict JSON schemas, idempotency, and side‑effect fuses (rate limiters, circuit breakers).
- HITL on money/legal/PII paths; make it boring to approve.
- Time‑travel on by default: you will need to replay.
- Eval everything in CI and on shadow traffic; never ship a prompt change untested.
11) Appendix: utilities you’ll paste a lot
Add memory to a prebuilt ReAct agent
from langgraph.prebuilt import create_react_agent
from langgraph.checkpoint.memory import InMemorySaver
from langchain_openai import ChatOpenAI
agent = create_react_agent(
model=ChatOpenAI(model="gpt-4o-mini", temperature=0),
tools=[...],
checkpointer=InMemorySaver(),
)Conditional routing with Command
from langgraph.types import Command
def router(state):
if "refund" in state["messages"][-1]["content"].lower():
return Command(goto="refund_agent")
return Command(goto="general_agent")Replay or fork from a checkpoint (time‑travel)
# Given thread_id + checkpoint_id, replay previous steps and branch
cfg = {"configurable": {"thread_id": "demo-1", "checkpoint_id": "abc-123"}}
result = supervisor.invoke({"messages": []}, cfg)12) Putting it all together
- Start with a one‑page graph spec: states, nodes, edges, success rubric.
- Build minimal agents + tools; wire checkpoints and HITL.
- Stand up end‑to‑end evals with a rubric and a few golden tasks.
- Add tracing + dashboards; gate deploys on TSR.
- Only then scale the number of agents. Most wins come from better tools + routing, not more bots.
If it’s hard to explain your graph to a new engineer in 5 minutes, it’s too clever. Simplify.
For the next layer up — making multiple agent systems interoperate across teams — see MCP + A2A: The Real Stack for Interoperable AI Agents. And for a reliability-first approach to shipping agents to production, check out the production AI agent reliability playbook.
Happy shipping. And yes—measure twice, deploy once.