Production Monitoring for Azure OpenAI: The Metrics That Actually Matter
— #Azure OpenAI#LLM Monitoring#KQL#Azure Monitor#Responsible AI#LLM Guardrails#MLOps
Deploying an LLM on Azure OpenAI is the easy part. Knowing whether it is actually working — safe, fast, cost-effective, and not hallucinating — is the hard part.
Traditional application monitoring does not cover it. You are not just tracking uptime and CPU. You are supervising a non-deterministic system that can generate toxic content, leak PII, or silently burn through your budget. That requires two layers of monitoring: operational health and responsible AI guardrails.
Here is what to track and how to set it up.
Part 1: Operational metrics
Your first priority is keeping the service available, fast, and affordable. Start by configuring Diagnostic settings on your Azure OpenAI resource to stream logs and metrics to a Log Analytics workspace. Without this, you have no observability.
API health and latency
You need to know if your API is responsive and whether errors are coming from your code (4xx) or from Azure (5xx). A sudden spike in 429s means you are hitting rate limits. A surge in 5xx errors means you need a support ticket.
Track these:
- Request volume — total calls over time for capacity planning
- Error rates — split by HTTP 4xx vs 5xx
- p95 and p99 latency — averages hide outliers that real users experience
KQL to get a 1-hour overview with per-model breakdown:
AzureDiagnostics
| where TimeGenerated > ago(1h)
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where Category == "RequestResponse"
| extend properties = todynamic(properties_s)
| extend modelDeploymentName = properties.modelDeploymentName
| summarize count() by OperationName, bin(TimeGenerated, 5m), ResultSignature, modelDeploymentName
| render timechartToken consumption and cost
Tokens are how you get billed. If you are not tracking them, you will be surprised by the invoice. The breakdown between prompt tokens and completion tokens tells you where the money goes — a summarization app has a very different cost profile than a code generation tool.
AzureMetrics
| where TimeGenerated > ago(24h)
| where MetricName in ("ProcessedPromptTokens", "GeneratedCompletionTokens")
| summarize Tokens = sum(Total) by MetricName, bin(TimeGenerated, 1h)
| render timechartWatch for unexpected spikes. A prompt injection attack or a broken retry loop can burn thousands of tokens in minutes.
Part 2: Responsible AI guardrails
Operational stability is half the problem. The other half is making sure the model behaves correctly. LLMs hallucinate, generate toxic content, and can leak sensitive data if inputs are not sanitized. You need automated guardrails, not just hope.
Input (prompt) analysis
Catch bad inputs before they reach the model. Azure's built-in Prompt Shields are a starting point, but they are not enough on their own.
What to scan for:
- PII in inputs — users paste emails, phone numbers, and support tickets into chatbots without thinking. Detect and redact before the model sees it.
- Prompt injection — attempts to override your system prompt ("Ignore all previous instructions and..."). These are real and increasingly sophisticated.
- Toxic or abusive inputs — protect the model from being used for harassment or abuse.
Output (response) analysis
This is your last line of defense. Every response the model generates should be evaluated before reaching the user.
What to check:
- Toxicity and harmful content — is the output offensive or unsafe?
- Relevance — did the model actually answer the question, or did it go off-topic?
- Hallucination — is the model inventing facts? This matters most in RAG systems where the model should be grounding its answers in retrieved documents.
RAG-specific metrics
If you are running a RAG application, basic relevance checks are not sufficient. You need to know if the model is using the retrieved context correctly.
- Faithfulness — does the answer contradict the source documents? High unfaithfulness means the model is ignoring the context you gave it.
- Contextual precision — of the retrieved documents, how many were actually relevant?
- Contextual recall — of all relevant documents that should have been retrieved, how many were? Poor recall means your retrieval pipeline is feeding the model incomplete information.
Implementing guardrails in code
Azure's content filtering is a start, but you will need programmatic guardrails for anything beyond basic toxicity. Libraries like Guardrails AI, DeepEval, and platforms like Langfuse give you more control.
Here is a basic example using Guardrails AI to check for toxic output:
import guardrails as gd
from guardrails.hub import ToxicLanguage
# on_fail="fix" tells Guardrails to attempt a corrective re-prompt
guard = gd.Guard().use(ToxicLanguage, threshold=0.5, on_fail="fix")
try:
validated_output = guard.parse(
llm_output="This is some potentially problematic text.",
metadata={"prompt": "User's original prompt"}
)
print(validated_output)
except Exception as e:
print(f"Validation failed: {e}")You can chain multiple validators together — toxicity, PII detection, JSON format validation, faithfulness checks — to build a multi-layered defense. For RAG-specific metrics like faithfulness and contextual precision, DeepEval and Ragas are solid choices.
Putting it together
The monitoring stack for a production Azure OpenAI deployment should look like this:
- Azure Monitor + Log Analytics for operational metrics (latency, errors, tokens)
- Azure Prompt Shields for basic input filtering
- Programmatic guardrails (Guardrails AI, DeepEval) for output validation
- RAG evaluation (Ragas, DeepEval) for retrieval quality
- Alerting on error rate spikes, token consumption anomalies, and guardrail failure rates
Set up dashboards for the KQL queries above. Create alerts on 5xx error rate > 1%, token spend > daily budget, and guardrail failure rate > 5%. Iterate from there.
Related reading
- Logfire vs LangSmith — for choosing an observability platform beyond Azure Monitor
- MLflow 3 and LLM observability — for framework-agnostic tracing and evaluation
- Production AI agent reliability playbook — for the reliability layer on top of monitoring
- Context engineering vs prompt engineering — for building better inputs to reduce the guardrail burden
Monitoring LLM applications is not the same as monitoring traditional APIs. The non-deterministic nature of these systems means you need both operational metrics and quality guardrails running continuously. Start with the KQL queries and basic guardrails above, then expand as your application matures.