Skip to Content
All blogs

Building Multi-Agentic Systems with LangGraph — and Evaluating Them Like Adults

 — #LangGraph#Multi-Agent Systems#AI Agents#LangChain#Human-in-the-Loop#Agent Evaluation#LLMOps

TL;DR: Multi‑agent ≠ magic. Start with a clear job-to-be-done, encode it as a stateful graph in LangGraph, keep agents small and tool-driven, wire in checkpoints + interrupts for control, and run repeatable evals (unit, component, end‑to‑end, and benchmark environments). Below you’ll find working patterns, code you can paste, and a pragmatic eval stack that won’t gaslight your on‑call charts.


1) Why multi‑agent now?

  • Task decomposition: LLMs are better at a chain of narrow steps than one monolith.
  • Control & debuggability: LangGraph models agents as a state machine; you get deterministic edges, replay, and time‑travel.
  • Operational realities: You need memory, fault tolerance, human approval. LangGraph bakes these into the runtime via checkpointers and interrupts.

Blunt truth: If you can’t articulate the state transitions and success criteria, you don’t have an agentic system— you have vibes. Don’t ship vibes.


2) The core primitives you’ll use in LangGraph

  • State: A typed dict (or pydantic model) your nodes read/write. Reducers aggregate values across nodes.
  • Nodes: Pure functions: State -> Partial[State]. Keep them small and testable.
  • Edges: Deterministic or conditional routing.
  • Checkpointer: Persists state per thread; unlocks memory, time‑travel, and human‑in‑the‑loop.
  • Interrupts: Pause execution to get human input or approval; resume with the answer.
  • Prebuilt ReAct Agent: create_react_agent(model, tools, ...) gives you a solid tool‑calling loop with optional memory.

3) Common multi‑agent patterns (that actually work)

  1. Supervisor → Specialists: One router delegates to focused workers (retriever, coder, clerk).
  2. Peer‑to‑Peer Handoffs: Agents invoke handoff tools that route via a command primitive (no brittle string parsing).
  3. Hierarchical Subgraphs: A node expands into its own subgraph for complex subtasks.
  4. Human checkpoints: Approve tool calls that hit money, prod, or legal.

Minimal supervisor → specialists (Python)

from typing import Annotated
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool
from langgraph.prebuilt import create_react_agent, InjectedState
from langgraph.graph import MessagesState
from langgraph.checkpoint.memory import InMemorySaver

# Tools for the Researcher
@tool
def search_knowledgebase(query: str) -> str:
    """Search internal KB and return a concise bullet list of findings."""
    # call your KB / vector store here
    return "- Policy X says Y\n- SLA is 24h\n- Link: /kb/123"

# Tools for the Clerk
@tool
def file_ticket(summary: str, priority: str = "low") -> str:
    """Create a ticket with the given summary and priority. Returns ticket id."""
    # call your ticketing API here
    return "TCK-92841"

# Agent models
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

researcher = create_react_agent(
    model=llm,
    tools=[search_knowledgebase],
)

clerk = create_react_agent(
    model=llm,
    tools=[file_ticket],
)

# Supervisor uses sub-agents as tools
@tool
def ask_researcher(q: Annotated[str, InjectedState]) -> str:
    """Delegate to Researcher agent with the current thread state."""
    res = researcher.invoke({"messages": q["messages"]})
    return res["messages"][-1].content

@tool
def ask_clerk(q: Annotated[str, InjectedState]) -> str:
    """Delegate to Clerk agent with the current thread state."""
    res = clerk.invoke({"messages": q["messages"]})
    return res["messages"][-1].content

supervisor = create_react_agent(
    model=llm,
    tools=[ask_researcher, ask_clerk],
    # add memory via a checkpointer so conversations persist per thread
    checkpointer=InMemorySaver(),
)

# Run it
thread_cfg = {"configurable": {"thread_id": "demo-1"}}
result = supervisor.invoke(
    {"messages": [{"role": "user", "content": "Find our refund policy and file an urgent ticket summarising it."}]},
    thread_cfg,
)
print(result["messages"][-1].content)

Why this works

  • The supervisor only decides who should act next. Real work happens inside tool‑rich specialists.
  • Passing InjectedState provides full conversational context to sub‑agents without manual plumbing.
  • The checkpointer (here, in‑memory) gives you threads you can replay or fork later.

If you're using Pydantic AI instead of LangChain, the same delegation pattern works with different primitives — see production multi-agent orchestration with Pydantic AI.


4) Human‑in‑the‑loop (HITL) where it matters

Use dynamic interrupts to pause inside a node, or static interrupts to break before/after a node.

from typing_extensions import TypedDict
from langgraph.graph import StateGraph, START, END
from langgraph.types import interrupt, Command
from langgraph.checkpoint.memory import InMemorySaver

class State(TypedDict):
    messages: list
    human_ok: bool | None

def risky_tool_call(state: State):
    # Ask for human approval with context; execution pauses here
    approve = interrupt("Approve external API call? yes/no")
    if str(approve).strip().lower() != "yes":
        return {"messages": [{"role": "system", "content": "Cancelled by human."}], "human_ok": False}
    # ...call the external API safely...
    return {"messages": [{"role": "system", "content": "API call done."}], "human_ok": True}

builder = StateGraph(State)
builder.add_node("risky", risky_tool_call)
builder.add_edge(START, "risky")
builder.add_edge("risky", END)

graph = builder.compile(checkpointer=InMemorySaver())
cfg = {"configurable": {"thread_id": "hitl-1"}}

# 1) Run until interrupt hits
try:
    graph.invoke({"messages": []}, cfg)
except Exception:
    pass  # LangGraph surfaces an interrupt you can catch in your server/UI

# 2) Resume later after collecting a human answer (e.g., from your UI)
resumed = graph.invoke({"messages": [], "human_ok": True}, cfg)

Where to place HITL

  • Before irreversible side‑effects (payments, deletions, emails)
  • When confidence/faithfulness < threshold
  • On governance boundaries (legal, PII, KYC)

5) Turning graphs into systems: persistence, time‑travel, replay

  • Threads: Isolate each conversation/run: {"configurable": {"thread_id": "..."}}
  • Checkpoints at each super‑step: inspect state, replay, or fork.
  • Time‑travel: Modify state and resume from a prior checkpoint to explore alternatives or fix a bad branch.

Practical tip: store thread IDs in your product session/user ids; you’ll thank yourself during incident review.


6) Evals that actually catch regressions (for multi‑agent)

Think three layers:

(A) Unit & component tests (fast, CI‑friendly)

  • Node contracts: Deterministic nodes get ordinary pytest. LLM nodes get shims with fixtures + seeded prompts.
  • Tool contracts: Validate schemas, required fields, and failure handling (e.g., mock 429s, timeouts).

(B) End‑to‑end task success

Measure on realistic tasks:

  • Task Success Rate (TSR): binary pass/fail from a rubric
  • Step Efficiency: steps per success, tool calls, handoffs
  • Tool Correctness: arguments match ground truth; side‑effects verified
  • Faithfulness/Groundedness: output traces back to sources (for RAG/analyst agents)
  • Safety/Policy: jailbreaks, PII leakage

(C) Scenario & environment benchmarks

  • Web or desktop task suites (web browsing, forms, files, multi‑app workflows)
  • Safety benches (malicious instructions, risky APIs)

7) Concrete eval stack that pairs well with LangGraph

Pick one from each row to start; expand later.

Layer Recommended tools Why it’s useful
Unit/component pytest + DeepEval metrics Pytest ergonomics with LLM metrics (relevancy, faithfulness, custom rubrics).
Tracing & scores Logfire / LangSmith / Opik / Phoenix Centralize traces, datasets, model‑as‑judge, human annotations, dashboards.
RAG‑specific Ragas (via Opik/Phoenix integrations) Standard RAG metrics: answer relevancy, context precision/recall, faithfulness.
Benchmarks WebArena / VisualWebArena, OSWorld, BrowserGym, AgentBench, Agent‑SafetyBench Execution‑based eval in realistic environments + safety coverage.

8) Example: end‑to‑end tests for the supervisor graph

Goal: verify the system files the right ticket summary and cites the refund policy.

# tests/test_end_to_end.py
import os
from deepeval import assert_test
from deepeval.metrics import GEval, AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

# Suppose your app exposes a function run_supervisor(messages) -> str
from app.agents import run_supervisor

# 1) Quality rubric using an LLM-as-judge (GEval)
correctness = GEval(
    name="task_success",
    criteria=(
        "Does the final message include an actionable ticket id AND a brief, correct refund policy summary?"
    ),
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.7,
)

# 2) Grounding: ensure the answer uses provided context (e.g., KB snippet)
relevancy = AnswerRelevancyMetric(threshold=0.7)

kb = "Customers get a full refund within 30 days at no extra cost."

def test_files_ticket_with_policy():
    output = run_supervisor([{"role": "user", "content": "Find refund policy and file a ticket."}])
    case = LLMTestCase(
        input="Find refund policy and file a ticket.",
        actual_output=output,
        retrieval_context=[kb],
    )
    assert_test(case, [correctness, relevancy])

Add numeric counters from your graph (steps, tool calls, handoffs) as plain pytest asserts. Example: no more than 6 steps for a pass.


9) Measuring the graph itself (telemetry you should log)

Emit these per‑run, and wire alerts on the deltas:

  • steps_total, tool_calls_total, handoffs_total
  • invalid_tool_args_total, retries_total, timeouts_total
  • tsr (task success rate), latency_p95, cost_total
  • faithfulness_score, relevancy_score, toxicity_score

Once logged, you can: compare branches in CI, gate deploys on TSR ≥ threshold, auto‑rollback on regressions.


10) Production hardening checklist (no-nonsense edition)

  • Deterministic edges before you scale: avoid “ask the LLM who to call” unless you can afford loops.
  • Guard tools: strict JSON schemas, idempotency, and side‑effect fuses (rate limiters, circuit breakers).
  • HITL on money/legal/PII paths; make it boring to approve.
  • Time‑travel on by default: you will need to replay.
  • Eval everything in CI and on shadow traffic; never ship a prompt change untested.

11) Appendix: utilities you’ll paste a lot

Add memory to a prebuilt ReAct agent

from langgraph.prebuilt import create_react_agent
from langgraph.checkpoint.memory import InMemorySaver
from langchain_openai import ChatOpenAI

agent = create_react_agent(
    model=ChatOpenAI(model="gpt-4o-mini", temperature=0),
    tools=[...],
    checkpointer=InMemorySaver(),
)

Conditional routing with Command

from langgraph.types import Command

def router(state):
    if "refund" in state["messages"][-1]["content"].lower():
        return Command(goto="refund_agent")
    return Command(goto="general_agent")

Replay or fork from a checkpoint (time‑travel)

# Given thread_id + checkpoint_id, replay previous steps and branch
cfg = {"configurable": {"thread_id": "demo-1", "checkpoint_id": "abc-123"}}
result = supervisor.invoke({"messages": []}, cfg)

12) Putting it all together

  1. Start with a one‑page graph spec: states, nodes, edges, success rubric.
  2. Build minimal agents + tools; wire checkpoints and HITL.
  3. Stand up end‑to‑end evals with a rubric and a few golden tasks.
  4. Add tracing + dashboards; gate deploys on TSR.
  5. Only then scale the number of agents. Most wins come from better tools + routing, not more bots.

If it’s hard to explain your graph to a new engineer in 5 minutes, it’s too clever. Simplify.


For the next layer up — making multiple agent systems interoperate across teams — see MCP + A2A: The Real Stack for Interoperable AI Agents. And for a reliability-first approach to shipping agents to production, check out the production AI agent reliability playbook.

Happy shipping. And yes—measure twice, deploy once.