Production Monitoring for Azure OpenAI: The Metrics That Actually Matter

August 20, 2025 — #Azure OpenAI #LLM Monitoring #KQL #Azure Monitor #Responsible AI #LLM Guardrails #MLOps

Deploying an LLM on Azure OpenAI is the easy part. Knowing whether it is actually working — safe, fast, cost-effective, and not hallucinating — is the hard part.

Traditional application monitoring does not cover it. You are not just tracking uptime and CPU. You are supervising a non-deterministic system that can generate toxic content, leak PII, or silently burn through your budget. That requires two layers of monitoring: operational health and responsible AI guardrails.

Here is what to track and how to set it up.

Part 1: Operational metrics

Your first priority is keeping the service available, fast, and affordable. Start by configuring Diagnostic settings on your Azure OpenAI resource to stream logs and metrics to a Log Analytics workspace. Without this, you have no observability.

API health and latency

You need to know if your API is responsive and whether errors are coming from your code (4xx) or from Azure (5xx). A sudden spike in 429s means you are hitting rate limits. A surge in 5xx errors means you need a support ticket.

Track these:

Request volume — total calls over time for capacity planning
Error rates — split by HTTP 4xx vs 5xx
p95 and p99 latency — averages hide outliers that real users experience

KQL to get a 1-hour overview with per-model breakdown:

AzureDiagnostics
| where TimeGenerated > ago(1h)
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where Category == "RequestResponse"
| extend properties = todynamic(properties_s)
| extend modelDeploymentName = properties.modelDeploymentName
| summarize count() by OperationName, bin(TimeGenerated, 5m), ResultSignature, modelDeploymentName
| render timechart

Token consumption and cost

Tokens are how you get billed. If you are not tracking them, you will be surprised by the invoice. The breakdown between prompt tokens and completion tokens tells you where the money goes — a summarization app has a very different cost profile than a code generation tool.

AzureMetrics
| where TimeGenerated > ago(24h)
| where MetricName in ("ProcessedPromptTokens", "GeneratedCompletionTokens")
| summarize Tokens = sum(Total) by MetricName, bin(TimeGenerated, 1h)
| render timechart

Watch for unexpected spikes. A prompt injection attack or a broken retry loop can burn thousands of tokens in minutes.

Part 2: Responsible AI guardrails

Operational stability is half the problem. The other half is making sure the model behaves correctly. LLMs hallucinate, generate toxic content, and can leak sensitive data if inputs are not sanitized. You need automated guardrails, not just hope.

Input (prompt) analysis

Catch bad inputs before they reach the model. Azure's built-in Prompt Shields are a starting point, but they are not enough on their own.

What to scan for:

PII in inputs — users paste emails, phone numbers, and support tickets into chatbots without thinking. Detect and redact before the model sees it.
Prompt injection — attempts to override your system prompt ("Ignore all previous instructions and..."). These are real and increasingly sophisticated.
Toxic or abusive inputs — protect the model from being used for harassment or abuse.

Output (response) analysis

This is your last line of defense. Every response the model generates should be evaluated before reaching the user.

What to check:

Toxicity and harmful content — is the output offensive or unsafe?
Relevance — did the model actually answer the question, or did it go off-topic?
Hallucination — is the model inventing facts? This matters most in RAG systems where the model should be grounding its answers in retrieved documents.

RAG-specific metrics

If you are running a RAG application, basic relevance checks are not sufficient. You need to know if the model is using the retrieved context correctly.

Faithfulness — does the answer contradict the source documents? High unfaithfulness means the model is ignoring the context you gave it.
Contextual precision — of the retrieved documents, how many were actually relevant?
Contextual recall — of all relevant documents that should have been retrieved, how many were? Poor recall means your retrieval pipeline is feeding the model incomplete information.

Implementing guardrails in code

Azure's content filtering is a start, but you will need programmatic guardrails for anything beyond basic toxicity. Libraries like Guardrails AI, DeepEval, and platforms like Langfuse give you more control.

Here is a basic example using Guardrails AI to check for toxic output:

import guardrails as gd
from guardrails.hub import ToxicLanguage

# on_fail="fix" tells Guardrails to attempt a corrective re-prompt
guard = gd.Guard().use(ToxicLanguage, threshold=0.5, on_fail="fix")

try:
    validated_output = guard.parse(
        llm_output="This is some potentially problematic text.",
        metadata={"prompt": "User's original prompt"}
    )
    print(validated_output)
except Exception as e:
    print(f"Validation failed: {e}")

You can chain multiple validators together — toxicity, PII detection, JSON format validation, faithfulness checks — to build a multi-layered defense. For RAG-specific metrics like faithfulness and contextual precision, DeepEval and Ragas are solid choices.

Putting it together

The monitoring stack for a production Azure OpenAI deployment should look like this:

Azure Monitor + Log Analytics for operational metrics (latency, errors, tokens)
Azure Prompt Shields for basic input filtering
Programmatic guardrails (Guardrails AI, DeepEval) for output validation
RAG evaluation (Ragas, DeepEval) for retrieval quality
Alerting on error rate spikes, token consumption anomalies, and guardrail failure rates

Set up dashboards for the KQL queries above. Create alerts on 5xx error rate > 1%, token spend > daily budget, and guardrail failure rate > 5%. Iterate from there.