MLflow 3 Has Cracked the LLM Observability Code
— #MLflow 3 Observability#LLM Monitoring Tools#GenAI Evaluation Platform#Open Source MLOps#Databricks MLflow Integration#OpenTelemetry#LLMOps
Most teams deploying LLM applications hit the same wall: something breaks in production, and debugging turns into archaeology. You are sifting through scattered logs trying to reconstruct what actually happened across retrievals, API calls, tool executions, and reasoning steps.
MLflow 3 (released June 2025, now at 3.4.0) solves this with a tracing-first approach to LLM observability. One-line setup for 20+ frameworks, full OpenTelemetry compatibility, custom evaluation scorers, and zero licensing cost. It is the open-source option that serious teams are adopting — 30+ million monthly downloads, AWS SageMaker support, and deep Databricks integration.
Here is what matters about it and how it compares to the alternatives.
How MLflow Tracing works
MLflow Tracing organizes execution data into Traces containing hierarchical Spans. A RAG application trace might have spans for: query preprocessing, vector database retrieval, document reranking, context assembly, LLM generation, and response formatting. Each span captures inputs, outputs, timestamps, parent-child relationships, and metadata like token counts.
The architecture is split for efficiency:
graph LR
subgraph App["🤖 YOUR APP"]
direction TB
REQ["User Request"]
S1["📥 Span: Preprocessing"]
S2["🔍 Span: Retrieval"]
S3["🧠 Span: LLM Call"]
S4["📤 Span: Formatting"]
REQ --> S1 --> S2 --> S3 --> S4
end
subgraph Storage["💾 MLflow STORAGE"]
direction TB
TI["TraceInfo<br/>Relational DB<br/>(fast queries)"]
TD["TraceData<br/>Artifact Storage<br/>(S3 / Blob)"]
end
subgraph Export["📊 EXPORT"]
direction TB
UI["MLflow UI"]
DD["Datadog"]
NR["New Relic"]
JG["Jaeger"]
end
App -->|"async logging"| TI
App -->|"full spans"| TD
TI --> UI
TD --> UI
TI -->|"OTel Export"| DD
TI -->|"OTel Export"| NR
TI -->|"OTel Export"| JG
style REQ fill:#7E57C2,stroke:#4527A0,color:#fff
style S1 fill:#AB47BC,stroke:#6A1B9A,color:#fff
style S2 fill:#29B6F6,stroke:#0277BD,color:#fff
style S3 fill:#FF7043,stroke:#D84315,color:#fff
style S4 fill:#66BB6A,stroke:#2E7D32,color:#fff
style TI fill:#FFA726,stroke:#E65100,color:#fff
style TD fill:#FFB74D,stroke:#EF6C00,color:#fff
style UI fill:#FFEE58,stroke:#F9A825,color:#333
style DD fill:#26C6DA,stroke:#00838F,color:#fff
style NR fill:#4FC3F7,stroke:#0288D1,color:#fff
style JG fill:#00BCD4,stroke:#006064,color:#fff- TraceInfo: lightweight metadata (trace ID, state, duration) in a relational database for fast querying
- TraceData: full execution details (complete span trees) in artifact storage like S3 or Azure Blob
This separation enables sub-second trace searches across millions of requests without database bloat.
Auto-tracing: the part that actually matters
When you call mlflow.openai.autolog(), MLflow patches the OpenAI SDK to intercept calls, create spans with proper metadata, and track token usage and costs — all transparently. This works for 20+ frameworks: OpenAI, Anthropic, Bedrock, Gemini, LangChain, LangGraph, LlamaIndex, DSPy, AutoGen, CrewAI, PydanticAI, smolagents, Instructor, LiteLLM, Ollama, and more.
For custom business logic, the @mlflow.trace decorator wraps functions to create spans automatically. Auto-traced and manual spans compose seamlessly — auto-traced spans nest correctly within manual spans, creating complete execution trees.
Production deployment
Three features matter for production:
- Async logging sends traces in background threads — no latency impact on request handling
- Configurable sampling (e.g., 10% of requests) controls throughput at scale
- mlflow-tracing package strips heavy dependencies for a 95% smaller footprint in serving environments
The OpenTelemetry foundation means traces export to Jaeger, Zipkin, Datadog, New Relic, or any OTel collector. Set OTEL_EXPORTER_OTLP_TRACES_ENDPOINT and you are done. No lock-in.
MLflow 3 vs Logfire vs LangFuse
Three tools dominate the LLM observability space in 2025. Each takes a different approach. I have worked with all three, and the choice depends on your stack and priorities.
| MLflow 3 | Pydantic Logfire | LangFuse | |
|---|---|---|---|
| Philosophy | Unified ML + GenAI platform | Developer experience first | LLM lifecycle management |
| Cost | Completely free, self-host | SaaS (10M events/mo free) | Free self-host, paid cloud |
| OTel | Compliant, export anywhere | Built on OTel end-to-end | OTel compatible |
| Best for | Teams with ML + GenAI, budget constraints, data sovereignty | Python/Pydantic stacks, fast setup | Regulated industries, prompt management, A/B testing |
| Weakness | Steeper learning curve, functional UI | Closed-source backend, SaaS only | Slower performance in benchmarks |
| Evals | Built-in scorers, LLM-as-judge, Guidelines | Pair with external tools | Built-in evaluation suite |
When to pick each:
- MLflow 3 if you need unified ML/GenAI, want zero SaaS cost, or need data sovereignty
- Logfire if you are on Pydantic/FastAPI, want the fastest setup, and are comfortable with SaaS. I wrote a detailed Logfire vs LangSmith comparison that covers this further.
- LangFuse if you need full open-source backend, robust prompt management, or are in a regulated industry
Custom evaluations with scorers
The evaluation framework is where MLflow 3 gets genuinely useful. The @scorer decorator turns any function into an evaluation metric:
@scorer
def response_quality(outputs: str) -> Feedback:
issues = []
if len(outputs.split()) < 50:
issues.append("Too short (minimum 50 words)")
if "[source]" not in outputs:
issues.append("Missing citations")
if issues:
return Feedback(value=False, rationale="; ".join(issues))
return Feedback(value=True, rationale="Meets all criteria")For non-technical stakeholders, Guidelines let you define evaluation criteria in plain English:
from mlflow.genai.scorers import Guidelines
tone = "The response must maintain courteous, respectful tone throughout."
clarity = "The response must use clear, concise language avoiding jargon."
results = mlflow.genai.evaluate(
data=eval_dataset,
scorers=[
Guidelines(name="tone", guidelines=tone),
Guidelines(name="clarity", guidelines=clarity)
]
)Behind the scenes, Guidelines use an LLM-as-judge (default GPT-4o) to assess compliance. Product managers can write evaluation criteria without touching code.
The iteration trick worth knowing
Generate traces once with a placeholder scorer, store them, then test scorer variations against stored traces without re-running the expensive application:
# Generate traces once
initial_results = mlflow.genai.evaluate(
data=test_dataset,
predict_fn=my_expensive_app,
scorers=[lambda **kwargs: 1] # Placeholder
)
# Store traces
traces = mlflow.search_traces(run_id=initial_results.run_id)
# Iterate rapidly on scorers — no predict_fn needed
results = mlflow.genai.evaluate(
data=traces,
scorers=[my_refined_scorer]
)This turns evaluation from a slow feedback loop into rapid experimentation. Discover a new quality metric three months after launch? Apply it to stored production traces without rerunning anything.
Production monitoring
Scorers defined during development deploy unchanged to production:
scorers = [Safety(), custom_business_scorer, Guidelines(...)]
# Development
dev_results = mlflow.genai.evaluate(data=eval_dataset, scorers=scorers)
# Production — same scorers, with sampling
registered = [s.register(name=f"scorer_{i}") for i, s in enumerate(scorers)]
for scorer in registered:
scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.1))Databricks integration (for enterprise teams)
If you are already on Databricks, MLflow 3 integration is worth knowing about. Unity Catalog provides model governance with three-level namespaces (catalog.schema.model), fine-grained permissions, and full lineage tracking from raw data through deployment.
The LoggedModel entity (new in MLflow 3) links traces across environments, connects evaluation metrics across runs, and tracks code versions via Git commits. Deployment Jobs automate staged rollouts (1% → 10% → 100% traffic) with automatic rollback on error rate spikes.
For teams not on Databricks: none of this is required. Open-source MLflow works independently. The Databricks layer is enterprise acceleration, not a dependency.
What shipped in 3.1–3.4
MLflow 3 has been shipping fast since the June 2025 launch:
- 3.4 (Sept 2025): OpenTelemetry Metrics Export, MCP Server for AI-assisted experiment management, custom Judges API (
make_judge), evaluation datasets management - 3.3 (Aug 2025): Model Registry Webhooks for CI/CD, GenAI Evaluation Suite in open source, FastAPI replacing Flask as default server
- 3.2 (Aug 2025): TypeScript tracing SDK, Semantic Kernel integration, Feedback Tracking APIs in open source, PII masking in traces
Framework coverage now includes 20+ auto-tracing integrations: OpenAI, Anthropic, Bedrock, Gemini, LangChain, LangGraph, LlamaIndex, DSPy, AutoGen, CrewAI, PydanticAI, smolagents, Strands, Semantic Kernel, and more.
When MLflow 3 fits — and when it does not
Good fit:
- You have both traditional ML and GenAI workloads
- Budget matters (zero licensing cost)
- Data sovereignty or regulatory requirements
- You want OpenTelemetry portability
- You are on Databricks already
Not the best fit:
- GenAI-only team wanting fastest possible setup → consider Logfire
- Deep LangChain/LangGraph integration → consider LangSmith
- Robust prompt management and A/B testing → consider LangFuse
The practical adoption path: start with mlflow.openai.autolog() locally, add custom scorers, deploy tracing to staging with sampling, extend to production with the lightweight mlflow-tracing SDK.
Related reading
- Logfire vs LangSmith — the pragmatic playbook for choosing your observability stack
- Building multi-agent systems with LangGraph for the workflow layer that feeds into observability
- Azure OpenAI monitoring for Azure-specific operational metrics
- Production AI agent reliability playbook for the operational discipline layer
Observability is not optional for production LLM applications. MLflow 3 makes it accessible without vendor lock-in or SaaS costs. Whether it is the right tool for your team depends on your stack, but it is worth evaluating seriously.