Skip to Content
All blogs

MLflow 3 Has Cracked the LLM Observability Code

 — #MLflow 3 Observability#LLM Monitoring Tools#GenAI Evaluation Platform#Open Source MLOps#Databricks MLflow Integration#OpenTelemetry#LLMOps

Most teams deploying LLM applications hit the same wall: something breaks in production, and debugging turns into archaeology. You are sifting through scattered logs trying to reconstruct what actually happened across retrievals, API calls, tool executions, and reasoning steps.

MLflow 3 (released June 2025, now at 3.4.0) solves this with a tracing-first approach to LLM observability. One-line setup for 20+ frameworks, full OpenTelemetry compatibility, custom evaluation scorers, and zero licensing cost. It is the open-source option that serious teams are adopting — 30+ million monthly downloads, AWS SageMaker support, and deep Databricks integration.

Here is what matters about it and how it compares to the alternatives.

How MLflow Tracing works

MLflow Tracing organizes execution data into Traces containing hierarchical Spans. A RAG application trace might have spans for: query preprocessing, vector database retrieval, document reranking, context assembly, LLM generation, and response formatting. Each span captures inputs, outputs, timestamps, parent-child relationships, and metadata like token counts.

The architecture is split for efficiency:

graph LR
    subgraph App["🤖 YOUR APP"]
        direction TB
        REQ["User Request"]
        S1["📥 Span: Preprocessing"]
        S2["🔍 Span: Retrieval"]
        S3["🧠 Span: LLM Call"]
        S4["📤 Span: Formatting"]
        REQ --> S1 --> S2 --> S3 --> S4
    end

    subgraph Storage["💾 MLflow STORAGE"]
        direction TB
        TI["TraceInfo<br/>Relational DB<br/>(fast queries)"]
        TD["TraceData<br/>Artifact Storage<br/>(S3 / Blob)"]
    end

    subgraph Export["📊 EXPORT"]
        direction TB
        UI["MLflow UI"]
        DD["Datadog"]
        NR["New Relic"]
        JG["Jaeger"]
    end

    App -->|"async logging"| TI
    App -->|"full spans"| TD
    TI --> UI
    TD --> UI
    TI -->|"OTel Export"| DD
    TI -->|"OTel Export"| NR
    TI -->|"OTel Export"| JG

    style REQ fill:#7E57C2,stroke:#4527A0,color:#fff
    style S1 fill:#AB47BC,stroke:#6A1B9A,color:#fff
    style S2 fill:#29B6F6,stroke:#0277BD,color:#fff
    style S3 fill:#FF7043,stroke:#D84315,color:#fff
    style S4 fill:#66BB6A,stroke:#2E7D32,color:#fff
    style TI fill:#FFA726,stroke:#E65100,color:#fff
    style TD fill:#FFB74D,stroke:#EF6C00,color:#fff
    style UI fill:#FFEE58,stroke:#F9A825,color:#333
    style DD fill:#26C6DA,stroke:#00838F,color:#fff
    style NR fill:#4FC3F7,stroke:#0288D1,color:#fff
    style JG fill:#00BCD4,stroke:#006064,color:#fff
  • TraceInfo: lightweight metadata (trace ID, state, duration) in a relational database for fast querying
  • TraceData: full execution details (complete span trees) in artifact storage like S3 or Azure Blob

This separation enables sub-second trace searches across millions of requests without database bloat.

Auto-tracing: the part that actually matters

When you call mlflow.openai.autolog(), MLflow patches the OpenAI SDK to intercept calls, create spans with proper metadata, and track token usage and costs — all transparently. This works for 20+ frameworks: OpenAI, Anthropic, Bedrock, Gemini, LangChain, LangGraph, LlamaIndex, DSPy, AutoGen, CrewAI, PydanticAI, smolagents, Instructor, LiteLLM, Ollama, and more.

For custom business logic, the @mlflow.trace decorator wraps functions to create spans automatically. Auto-traced and manual spans compose seamlessly — auto-traced spans nest correctly within manual spans, creating complete execution trees.

Production deployment

Three features matter for production:

  1. Async logging sends traces in background threads — no latency impact on request handling
  2. Configurable sampling (e.g., 10% of requests) controls throughput at scale
  3. mlflow-tracing package strips heavy dependencies for a 95% smaller footprint in serving environments

The OpenTelemetry foundation means traces export to Jaeger, Zipkin, Datadog, New Relic, or any OTel collector. Set OTEL_EXPORTER_OTLP_TRACES_ENDPOINT and you are done. No lock-in.

MLflow 3 vs Logfire vs LangFuse

Three tools dominate the LLM observability space in 2025. Each takes a different approach. I have worked with all three, and the choice depends on your stack and priorities.

MLflow 3 Pydantic Logfire LangFuse
Philosophy Unified ML + GenAI platform Developer experience first LLM lifecycle management
Cost Completely free, self-host SaaS (10M events/mo free) Free self-host, paid cloud
OTel Compliant, export anywhere Built on OTel end-to-end OTel compatible
Best for Teams with ML + GenAI, budget constraints, data sovereignty Python/Pydantic stacks, fast setup Regulated industries, prompt management, A/B testing
Weakness Steeper learning curve, functional UI Closed-source backend, SaaS only Slower performance in benchmarks
Evals Built-in scorers, LLM-as-judge, Guidelines Pair with external tools Built-in evaluation suite

When to pick each:

  • MLflow 3 if you need unified ML/GenAI, want zero SaaS cost, or need data sovereignty
  • Logfire if you are on Pydantic/FastAPI, want the fastest setup, and are comfortable with SaaS. I wrote a detailed Logfire vs LangSmith comparison that covers this further.
  • LangFuse if you need full open-source backend, robust prompt management, or are in a regulated industry

Custom evaluations with scorers

The evaluation framework is where MLflow 3 gets genuinely useful. The @scorer decorator turns any function into an evaluation metric:

@scorer
def response_quality(outputs: str) -> Feedback:
    issues = []
    if len(outputs.split()) < 50:
        issues.append("Too short (minimum 50 words)")
    if "[source]" not in outputs:
        issues.append("Missing citations")

    if issues:
        return Feedback(value=False, rationale="; ".join(issues))
    return Feedback(value=True, rationale="Meets all criteria")

For non-technical stakeholders, Guidelines let you define evaluation criteria in plain English:

from mlflow.genai.scorers import Guidelines

tone = "The response must maintain courteous, respectful tone throughout."
clarity = "The response must use clear, concise language avoiding jargon."

results = mlflow.genai.evaluate(
    data=eval_dataset,
    scorers=[
        Guidelines(name="tone", guidelines=tone),
        Guidelines(name="clarity", guidelines=clarity)
    ]
)

Behind the scenes, Guidelines use an LLM-as-judge (default GPT-4o) to assess compliance. Product managers can write evaluation criteria without touching code.

The iteration trick worth knowing

Generate traces once with a placeholder scorer, store them, then test scorer variations against stored traces without re-running the expensive application:

# Generate traces once
initial_results = mlflow.genai.evaluate(
    data=test_dataset,
    predict_fn=my_expensive_app,
    scorers=[lambda **kwargs: 1]  # Placeholder
)

# Store traces
traces = mlflow.search_traces(run_id=initial_results.run_id)

# Iterate rapidly on scorers — no predict_fn needed
results = mlflow.genai.evaluate(
    data=traces,
    scorers=[my_refined_scorer]
)

This turns evaluation from a slow feedback loop into rapid experimentation. Discover a new quality metric three months after launch? Apply it to stored production traces without rerunning anything.

Production monitoring

Scorers defined during development deploy unchanged to production:

scorers = [Safety(), custom_business_scorer, Guidelines(...)]

# Development
dev_results = mlflow.genai.evaluate(data=eval_dataset, scorers=scorers)

# Production — same scorers, with sampling
registered = [s.register(name=f"scorer_{i}") for i, s in enumerate(scorers)]
for scorer in registered:
    scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.1))

Databricks integration (for enterprise teams)

If you are already on Databricks, MLflow 3 integration is worth knowing about. Unity Catalog provides model governance with three-level namespaces (catalog.schema.model), fine-grained permissions, and full lineage tracking from raw data through deployment.

The LoggedModel entity (new in MLflow 3) links traces across environments, connects evaluation metrics across runs, and tracks code versions via Git commits. Deployment Jobs automate staged rollouts (1% → 10% → 100% traffic) with automatic rollback on error rate spikes.

For teams not on Databricks: none of this is required. Open-source MLflow works independently. The Databricks layer is enterprise acceleration, not a dependency.

What shipped in 3.1–3.4

MLflow 3 has been shipping fast since the June 2025 launch:

  • 3.4 (Sept 2025): OpenTelemetry Metrics Export, MCP Server for AI-assisted experiment management, custom Judges API (make_judge), evaluation datasets management
  • 3.3 (Aug 2025): Model Registry Webhooks for CI/CD, GenAI Evaluation Suite in open source, FastAPI replacing Flask as default server
  • 3.2 (Aug 2025): TypeScript tracing SDK, Semantic Kernel integration, Feedback Tracking APIs in open source, PII masking in traces

Framework coverage now includes 20+ auto-tracing integrations: OpenAI, Anthropic, Bedrock, Gemini, LangChain, LangGraph, LlamaIndex, DSPy, AutoGen, CrewAI, PydanticAI, smolagents, Strands, Semantic Kernel, and more.

When MLflow 3 fits — and when it does not

Good fit:

  • You have both traditional ML and GenAI workloads
  • Budget matters (zero licensing cost)
  • Data sovereignty or regulatory requirements
  • You want OpenTelemetry portability
  • You are on Databricks already

Not the best fit:

  • GenAI-only team wanting fastest possible setup → consider Logfire
  • Deep LangChain/LangGraph integration → consider LangSmith
  • Robust prompt management and A/B testing → consider LangFuse

The practical adoption path: start with mlflow.openai.autolog() locally, add custom scorers, deploy tracing to staging with sampling, extend to production with the lightweight mlflow-tracing SDK.

Related reading

Observability is not optional for production LLM applications. MLflow 3 makes it accessible without vendor lock-in or SaaS costs. Whether it is the right tool for your team depends on your stack, but it is worth evaluating seriously.