Building Multi-Agentic Systems with LangGraph — and Evaluating Them Like Adults

January 15, 2025 — #LangGraph #Multi-Agent Systems #AI Agents #LangChain #Human-in-the-Loop #Agent Evaluation #LLMOps

TL;DR: Multi‑agent ≠ magic. Start with a clear job-to-be-done, encode it as a stateful graph in LangGraph, keep agents small and tool-driven, wire in checkpoints + interrupts for control, and run repeatable evals (unit, component, end‑to‑end, and benchmark environments). Below you’ll find working patterns, code you can paste, and a pragmatic eval stack that won’t gaslight your on‑call charts.

1) Why multi‑agent now?

Task decomposition: LLMs are better at a chain of narrow steps than one monolith.
Control & debuggability: LangGraph models agents as a state machine; you get deterministic edges, replay, and time‑travel.
Operational realities: You need memory, fault tolerance, human approval. LangGraph bakes these into the runtime via checkpointers and interrupts.

Blunt truth: If you can’t articulate the state transitions and success criteria, you don’t have an agentic system— you have vibes. Don’t ship vibes.

2) The core primitives you’ll use in LangGraph

State: A typed dict (or pydantic model) your nodes read/write. Reducers aggregate values across nodes.
Nodes: Pure functions: State -> Partial[State]. Keep them small and testable.
Edges: Deterministic or conditional routing.
Checkpointer: Persists state per thread; unlocks memory, time‑travel, and human‑in‑the‑loop.
Interrupts: Pause execution to get human input or approval; resume with the answer.
Prebuilt ReAct Agent: create_react_agent(model, tools, ...) gives you a solid tool‑calling loop with optional memory.

3) Common multi‑agent patterns (that actually work)

Supervisor → Specialists: One router delegates to focused workers (retriever, coder, clerk).
Peer‑to‑Peer Handoffs: Agents invoke handoff tools that route via a command primitive (no brittle string parsing).
Hierarchical Subgraphs: A node expands into its own subgraph for complex subtasks.
Human checkpoints: Approve tool calls that hit money, prod, or legal.

Minimal supervisor → specialists (Python)

from typing import Annotated
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool
from langgraph.prebuilt import create_react_agent, InjectedState
from langgraph.graph import MessagesState
from langgraph.checkpoint.memory import InMemorySaver

# Tools for the Researcher
@tool
def search_knowledgebase(query: str) -> str:
    """Search internal KB and return a concise bullet list of findings."""
    # call your KB / vector store here
    return "- Policy X says Y\n- SLA is 24h\n- Link: /kb/123"

# Tools for the Clerk
@tool
def file_ticket(summary: str, priority: str = "low") -> str:
    """Create a ticket with the given summary and priority. Returns ticket id."""
    # call your ticketing API here
    return "TCK-92841"

# Agent models
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

researcher = create_react_agent(
    model=llm,
    tools=[search_knowledgebase],
)

clerk = create_react_agent(
    model=llm,
    tools=[file_ticket],
)

# Supervisor uses sub-agents as tools
@tool
def ask_researcher(q: Annotated[str, InjectedState]) -> str:
    """Delegate to Researcher agent with the current thread state."""
    res = researcher.invoke({"messages": q["messages"]})
    return res["messages"][-1].content

@tool
def ask_clerk(q: Annotated[str, InjectedState]) -> str:
    """Delegate to Clerk agent with the current thread state."""
    res = clerk.invoke({"messages": q["messages"]})
    return res["messages"][-1].content

supervisor = create_react_agent(
    model=llm,
    tools=[ask_researcher, ask_clerk],
    # add memory via a checkpointer so conversations persist per thread
    checkpointer=InMemorySaver(),
)

# Run it
thread_cfg = {"configurable": {"thread_id": "demo-1"}}
result = supervisor.invoke(
    {"messages": [{"role": "user", "content": "Find our refund policy and file an urgent ticket summarising it."}]},
    thread_cfg,
)
print(result["messages"][-1].content)

Why this works

The supervisor only decides who should act next. Real work happens inside tool‑rich specialists.
Passing InjectedState provides full conversational context to sub‑agents without manual plumbing.
The checkpointer (here, in‑memory) gives you threads you can replay or fork later.

If you're using Pydantic AI instead of LangChain, the same delegation pattern works with different primitives — see production multi-agent orchestration with Pydantic AI.

4) Human‑in‑the‑loop (HITL) where it matters

Use dynamic interrupts to pause inside a node, or static interrupts to break before/after a node.

from typing_extensions import TypedDict
from langgraph.graph import StateGraph, START, END
from langgraph.types import interrupt, Command
from langgraph.checkpoint.memory import InMemorySaver

class State(TypedDict):
    messages: list
    human_ok: bool | None

def risky_tool_call(state: State):
    # Ask for human approval with context; execution pauses here
    approve = interrupt("Approve external API call? yes/no")
    if str(approve).strip().lower() != "yes":
        return {"messages": [{"role": "system", "content": "Cancelled by human."}], "human_ok": False}
    # ...call the external API safely...
    return {"messages": [{"role": "system", "content": "API call done."}], "human_ok": True}

builder = StateGraph(State)
builder.add_node("risky", risky_tool_call)
builder.add_edge(START, "risky")
builder.add_edge("risky", END)

graph = builder.compile(checkpointer=InMemorySaver())
cfg = {"configurable": {"thread_id": "hitl-1"}}

# 1) Run until interrupt hits
try:
    graph.invoke({"messages": []}, cfg)
except Exception:
    pass  # LangGraph surfaces an interrupt you can catch in your server/UI

# 2) Resume later after collecting a human answer (e.g., from your UI)
resumed = graph.invoke({"messages": [], "human_ok": True}, cfg)

Where to place HITL

Before irreversible side‑effects (payments, deletions, emails)
When confidence/faithfulness < threshold
On governance boundaries (legal, PII, KYC)

5) Turning graphs into systems: persistence, time‑travel, replay

Threads: Isolate each conversation/run: {"configurable": {"thread_id": "..."}}
Checkpoints at each super‑step: inspect state, replay, or fork.
Time‑travel: Modify state and resume from a prior checkpoint to explore alternatives or fix a bad branch.

Practical tip: store thread IDs in your product session/user ids; you’ll thank yourself during incident review.

6) Evals that actually catch regressions (for multi‑agent)

Think three layers:

(A) Unit & component tests (fast, CI‑friendly)

Node contracts: Deterministic nodes get ordinary pytest. LLM nodes get shims with fixtures + seeded prompts.
Tool contracts: Validate schemas, required fields, and failure handling (e.g., mock 429s, timeouts).

(B) End‑to‑end task success

Measure on realistic tasks:

Task Success Rate (TSR): binary pass/fail from a rubric
Step Efficiency: steps per success, tool calls, handoffs
Tool Correctness: arguments match ground truth; side‑effects verified
Faithfulness/Groundedness: output traces back to sources (for RAG/analyst agents)
Safety/Policy: jailbreaks, PII leakage

(C) Scenario & environment benchmarks

Web or desktop task suites (web browsing, forms, files, multi‑app workflows)
Safety benches (malicious instructions, risky APIs)

7) Concrete eval stack that pairs well with LangGraph

Pick one from each row to start; expand later.

Layer	Recommended tools	Why it’s useful
Unit/component	pytest + DeepEval metrics	Pytest ergonomics with LLM metrics (relevancy, faithfulness, custom rubrics).
Tracing & scores	Logfire / LangSmith / Opik / Phoenix	Centralize traces, datasets, model‑as‑judge, human annotations, dashboards.
RAG‑specific	Ragas (via Opik/Phoenix integrations)	Standard RAG metrics: answer relevancy, context precision/recall, faithfulness.
Benchmarks	WebArena / VisualWebArena, OSWorld, BrowserGym, AgentBench, Agent‑SafetyBench	Execution‑based eval in realistic environments + safety coverage.

8) Example: end‑to‑end tests for the supervisor graph

Goal: verify the system files the right ticket summary and cites the refund policy.

# tests/test_end_to_end.py
import os
from deepeval import assert_test
from deepeval.metrics import GEval, AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

# Suppose your app exposes a function run_supervisor(messages) -> str
from app.agents import run_supervisor

# 1) Quality rubric using an LLM-as-judge (GEval)
correctness = GEval(
    name="task_success",
    criteria=(
        "Does the final message include an actionable ticket id AND a brief, correct refund policy summary?"
    ),
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.7,
)

# 2) Grounding: ensure the answer uses provided context (e.g., KB snippet)
relevancy = AnswerRelevancyMetric(threshold=0.7)

kb = "Customers get a full refund within 30 days at no extra cost."

def test_files_ticket_with_policy():
    output = run_supervisor([{"role": "user", "content": "Find refund policy and file a ticket."}])
    case = LLMTestCase(
        input="Find refund policy and file a ticket.",
        actual_output=output,
        retrieval_context=[kb],
    )
    assert_test(case, [correctness, relevancy])

Add numeric counters from your graph (steps, tool calls, handoffs) as plain pytest asserts. Example: no more than 6 steps for a pass.

9) Measuring the graph itself (telemetry you should log)

Emit these per‑run, and wire alerts on the deltas:

steps_total, tool_calls_total, handoffs_total
invalid_tool_args_total, retries_total, timeouts_total
tsr (task success rate), latency_p95, cost_total
faithfulness_score, relevancy_score, toxicity_score

Once logged, you can: compare branches in CI, gate deploys on TSR ≥ threshold, auto‑rollback on regressions.

10) Production hardening checklist (no-nonsense edition)

Deterministic edges before you scale: avoid “ask the LLM who to call” unless you can afford loops.
Guard tools: strict JSON schemas, idempotency, and side‑effect fuses (rate limiters, circuit breakers).
HITL on money/legal/PII paths; make it boring to approve.
Time‑travel on by default: you will need to replay.
Eval everything in CI and on shadow traffic; never ship a prompt change untested.

11) Appendix: utilities you’ll paste a lot

Add memory to a prebuilt ReAct agent

from langgraph.prebuilt import create_react_agent
from langgraph.checkpoint.memory import InMemorySaver
from langchain_openai import ChatOpenAI

agent = create_react_agent(
    model=ChatOpenAI(model="gpt-4o-mini", temperature=0),
    tools=[...],
    checkpointer=InMemorySaver(),
)

Conditional routing with Command

from langgraph.types import Command

def router(state):
    if "refund" in state["messages"][-1]["content"].lower():
        return Command(goto="refund_agent")
    return Command(goto="general_agent")

Replay or fork from a checkpoint (time‑travel)

# Given thread_id + checkpoint_id, replay previous steps and branch
cfg = {"configurable": {"thread_id": "demo-1", "checkpoint_id": "abc-123"}}
result = supervisor.invoke({"messages": []}, cfg)

12) Putting it all together

Start with a one‑page graph spec: states, nodes, edges, success rubric.
Build minimal agents + tools; wire checkpoints and HITL.
Stand up end‑to‑end evals with a rubric and a few golden tasks.
Add tracing + dashboards; gate deploys on TSR.
Only then scale the number of agents. Most wins come from better tools + routing, not more bots.

If it’s hard to explain your graph to a new engineer in 5 minutes, it’s too clever. Simplify.

For the next layer up — making multiple agent systems interoperate across teams — see MCP + A2A: The Real Stack for Interoperable AI Agents. And for a reliability-first approach to shipping agents to production, check out the production AI agent reliability playbook.

Happy shipping. And yes—measure twice, deploy once.