Day 21: Observability for AI Agents - Monitoring, Logging, and Debugging Production Agents
Today's deep-dive: How do you know if your AI agent is working correctly? What metrics should you track? How do you debug when things go wrong?
The answer: Observability - the practice of understanding what your system is doing based on its outputs. For AI agents, this is even more critical because they make autonomous decisions that aren't always deterministic.
Why Observability Matters for Agents
The Opaque Decision Problem
Unlike traditional software where you can trace execution flows, AI agents involve probabilistic decision-making:
Traditional software:
User clicks "Submit"
→ Form validation runs
→ Database query executes
→ Result displayed
AI agent:
User requests "Organize my files"
→ Agent considers 23 possible actions
→ LLM scores each by relevance (0.72-0.94 range)
→ Agent decides to: 1) Read file list 2) Categorize by date 3) Move 17 files
→ Each step involves multiple model calls
The challenge: How do you debug when the agent's reasoning doesn't match expectations?
Key Agent Observability Signals
1. Decision Quality Metrics
Confidence Scores
Track the model's self-assessed confidence for each decision:
interface AgentDecisionMetrics {
// Decision confidence (0-1)
confidence: number;
// What alternatives were considered
alternatives: Array<{
action: string;
score: number;
reason: string;
}>;
// How long reasoning took
reasoningDurationMs: number;
// Context tokens processed
contextTokens: number;
}
// Track decision quality over time
class DecisionQualityTracker {
private decisions: AgentDecisionMetrics[] = [];
recordDecision(decision: AgentDecisionMetrics): void {
this.decisions.push(decision);
// Alert on low confidence
if (decision.confidence < 0.6) {
this.alertLowConfidence(decision);
}
// Alert on unusual alternatives
if (decision.alternatives.some(a => a.score > 0.9)) {
this.alertHighAlternativeScore(decision);
}
}
// Calculate decision quality score over time
getQualityTrend(days: number): TrendAnalysis {
const recent = this.decisions.filter(
d => Date.parse(d.timestamp) > Date.now() - days * 24 * 60 * 60 * 1000
);
return {
avgConfidence: avg(recent.map(d => d.confidence)),
avgReasoningTime: avg(recent.map(d => d.reasoningDurationMs)),
trend: this.calculateTrend(recent.map(d => d.confidence)),
anomalies: this.detectAnomalies(recent)
};
}
}
Latency Patterns
Monitor how long different tasks take:
interface TaskLatencyProfile {
taskType: string;
avgLatencyMs: number;
p50LatencyMs: number;
p95LatencyMs: number;
p99LatencyMs: number;
successRate: number;
retries: number;
}
class LatencyMonitor {
private taskLatencies = new Map<string, number[]>();
recordLatency(taskType: string, latencyMs: number): void {
const latencies = this.taskLatencies.get(taskType) || [];
latencies.push(latencyMs);
// Keep rolling window of last 1000 samples
if (latencies.length > 1000) {
latencies.shift();
}
this.taskLatencies.set(taskType, latencies);
// Check for latency spikes
if (this.isLatencySpike(taskType, latencyMs)) {
this.alertLatencySpike(taskType, latencyMs);
}
}
private isLatencySpike(taskType: string, current: number): boolean {
const historical = this.taskLatencies.get(taskType) || [];
if (historical.length < 50) return false;
const avg = historical.reduce((a, b) => a + b, 0) / historical.length;
const std = Math.sqrt(
historical.reduce((a, b) => a + Math.pow(b - avg, 2), 0) / historical.length
);
return current > avg + 3 * std;
}
}
2. Context Understanding Signals
Token Usage Tracking
Monitor tokens consumed per interaction:
class TokenUsageTracker {
private sessionTokens = new Map<string, TokenUsage>();
recordTokenUsage(sessionId: string, tokens: TokenUsage): void {
const session = this.sessionTokens.get(sessionId) || {
totalTokens: 0,
inputTokens: 0,
outputTokens: 0,
breakdown: new Map<string, number>(),
cost: 0
};
session.totalTokens += tokens.total;
session.inputTokens += tokens.input;
session.outputTokens += tokens.output;
session.cost += tokens.cost;
for (const [component, count] of tokens.breakdown) {
session.breakdown.set(
component,
(session.breakdown.get(component) || 0) + count
);
}
this.sessionTokens.set(sessionId, session);
// Alert on budget thresholds
if (session.cost > 10.00) {
this.alertBudgetWarning(sessionId, session.cost);
}
}
getTokenBreakdown(sessionId: string): ComponentBreakdown {
const session = this.sessionTokens.get(sessionId);
if (!session) return {};
return {
planning: session.breakdown.get('planning') || 0,
toolCalls: session.breakdown.get('tools') || 0,
memoryAccess: session.breakdown.get('memory') || 0,
output: session.breakdown.get('output') || 0,
total: session.totalTokens
};
}
}
Memory Access Patterns
Understand how the agent's memory system is being used:
interface MemoryAccessLog {
timestamp: string;
accessType: 'read' | 'write' | 'delete';
memoryType: 'episodic' | 'semantic' | 'procedural' | 'working';
relevanceScore?: number;
retrievedItems: string[];
decisionJustification?: string;
}
class MemoryObservability {
private accessLogs: MemoryAccessLog[] = [];
logAccess(log: MemoryAccessLog): void {
this.accessLogs.push(log);
// Detect unusual patterns
if (this.isMemoryStarvation(log)) {
this.alertMemoryStarvation(log);
}
if (this.isMemoryLeak(log)) {
this.alertMemoryLeak(log);
}
}
private isMemoryStarvation(log: MemoryAccessLog): boolean {
// High frequency reads with low relevance
const recentReads = this.accessLogs.filter(
a => a.accessType === 'read' &&
Date.parse(log.timestamp) - Date.parse(a.timestamp) < 60 * 1000
);
const lowRelevance = recentReads.filter(a =>
!a.relevanceScore || a.relevanceScore < 0.3
).length;
return lowRelevance > recentReads.length * 0.8;
}
}
3. Action Outcome Signals
Success Rate Tracking
Monitor whether agent actions achieve intended outcomes:
interface ActionOutcome {
actionType: string;
intent: string;
success: boolean;
actualOutcome: string;
expectedOutcome: string;
feedbackSignal?: string;
latencyMs: number;
}
class ActionOutcomeTracker {
private outcomes: ActionOutcome[] = [];
recordOutcome(outcome: ActionOutcome): void {
this.outcomes.push(outcome);
// Track success rate by action type
this.updateSuccessRate(outcome);
// Detect action failures
if (!outcome.success) {
this.analyzeFailure(outcome);
}
// Check for action loops (same action repeated with failures)
if (this.detectActionLoop(outcome)) {
this.alertActionLoop(outcome);
}
}
private detectActionLoop(outcome: ActionOutcome): boolean {
const recentFailures = this.outcomes.filter(o =>
o.actionType === outcome.actionType &&
!o.success &&
Date.parse(outcome.timestamp) - Date.parse(o.timestamp) < 300 * 1000
);
// Alert if 3 consecutive failures for same action
return recentFailures.length >= 3;
}
getActionSuccessRates(): Record<string, number> {
const byType = new Map<string, { success: number; total: number }>();
for (const outcome of this.outcomes) {
const stats = byType.get(outcome.actionType) || { success: 0, total: 0 };
stats.total += 1;
if (outcome.success) stats.success += 1;
byType.set(outcome.actionType, stats);
}
return Object.fromEntries(
Array.from(byType.entries()).map(([type, stats]) => [
type,
(stats.success / stats.total) * 100
])
);
}
}
Building an Observability Dashboard
Dashboard Components
Real-Time Agent Health Panel:
interface AgentHealthMetrics {
active: boolean;
decisionsPerMinute: number;
avgDecisionConfidence: number;
currentTask: string | null;
memoryUsage: {
workingSet: number;
episodicCount: number;
semanticCount: number;
};
tokenBudgetRemaining: number;
tokenBudgetTotal: number;
}
class AgentHealthDashboard {
renderMetrics(metrics: AgentHealthMetrics): React.ReactNode {
return (
<div className="dashboard-panel">
<h3>Agent Health</h3>
<HealthStatus status={metrics.active ? 'active' : 'idle'} />
<ActivityGraph rate={metrics.decisionsPerMinute} />
<ConfidenceMeter value={metrics.avgDecisionConfidence} />
<TaskIndicator task={metrics.currentTask} />
<MemoryUsage {...metrics.memoryUsage} />
<TokenBudget
remaining={metrics.tokenBudgetRemaining}
total={metrics.tokenBudgetTotal}
/>
</div>
);
}
}
Historical Performance Analysis:
class HistoricalAnalysis {
async generateWeeklyReport(startDate: Date, endDate: Date): Promise<AgentReport> {
const dailyData = await this.collectDailyData(startDate, endDate);
return {
summary: {
totalInteractions: dailyData.reduce((sum, day) => sum + day.interactions, 0),
avgConfidence: this.calculateAvg(dailyData, 'confidence'),
avgLatency: this.calculateAvg(dailyData, 'latency'),
successRate: this.calculateSuccessRate(dailyData),
costTotal: dailyData.reduce((sum, day) => sum + day.cost, 0),
},
trends: {
confidence: this.analyzeTrend(dailyData.map(d => d.confidence)),
latency: this.analyzeTrend(dailyData.map(d => d.latency)),
usage: this.analyzeTrend(dailyData.map(d => d.interactions)),
},
insights: this.generateInsights(dailyData),
recommendations: this.generateRecommendations(dailyData),
};
}
private generateInsights(data: DailyMetrics[]): Insight[] {
const insights: Insight[] = [];
// Confidence degradation over time
if (data.length > 7) {
const firstWeek = data.slice(0, 7);
const secondWeek = data.slice(7);
const firstAvg = this.calculateAvg(firstWeek, 'confidence');
const secondAvg = this.calculateAvg(secondWeek, 'confidence');
if (secondAvg < firstAvg - 0.1) {
insights.push({
type: 'warning',
message: 'Decision confidence has decreased over the past week',
severity: 'medium',
suggestion: 'Review recent interactions for patterns causing lower confidence'
});
}
}
// High latency correlation
const highLatencyDays = data.filter(d => d.latency > 5000);
if (highLatencyDays.length > 0) {
insights.push({
type: 'info',
message: `${highLatencyDays.length} days had high decision latency`,
severity: 'low',
suggestion: 'Consider optimizing frequently-used action paths'
});
}
return insights;
}
}
Alerting Strategies
Warning Conditions:
interface AlertConfig {
condition: string;
threshold: number | string;
windowMs: number;
message: string;
severity: 'info' | 'warning' | 'critical';
}
class AgentAlerts {
private alertConfigs: AlertConfig[] = [
{
condition: 'low_confidence',
threshold: 0.5,
windowMs: 3600000, // 1 hour
message: 'Agent making low-confidence decisions',
severity: 'warning'
},
{
condition: 'high_latency',
threshold: 10000, // 10 seconds
windowMs: 300000, // 5 minutes
message: 'Agent response time exceeding threshold',
severity: 'warning'
},
{
condition: 'failure_rate',
threshold: 0.2, // 20% failure rate
windowMs: 3600000,
message: 'Agent action failure rate above threshold',
severity: 'critical'
},
{
condition: 'token_budget',
threshold: '10%', // Below 10% remaining
windowMs: 86400000, // 24 hours
message: 'Token budget consumption rate high',
severity: 'warning'
},
];
configureAlerts(configs: AlertConfig[]): void {
this.alertConfigs = configs;
// Apply to monitoring system
this.setupMonitoring(configs);
}
async evaluateAlerts(metrics: RealTimeMetrics): Promise<Alert[] | null> {
const alerts: Alert[] = [];
for (const config of this.alertConfigs) {
if (await this.conditionMet(config, metrics)) {
alerts.push({
type: config.condition,
severity: config.severity,
message: config.message,
timestamp: new Date(),
thresholds: {
value: this.getMetric(config.condition, metrics),
threshold: config.threshold,
windowMs: config.windowMs
},
recommendedAction: this.getRecommendedAction(config.condition)
});
}
}
return alerts.length > 0 ? alerts : null;
}
private getRecommendedAction(condition: string): string {
const actions: Record<string, string> = {
'low_confidence': 'Review decision logs and context quality',
'high_latency': 'Check model performance and cache effectiveness',
'failure_rate': 'Audit recent action outcomes for patterns',
'token_budget': 'Review agent efficiency and consider optimization'
};
return actions[condition] || 'Investigate system metrics';
}
}
Debugging Common Agent Issues
Scenario 1: Agent Making Poor Decisions
Symptoms: Low confidence scores, incorrect action choices
Debug Steps:
async function debugPoorDecisions(sessionId: string): Promise<Diagnosis> {
const logs = await getAgentLogs(sessionId, {
filter: { confidence: { lt: 0.6 } }
});
const patterns = analyzeDecisionPatterns(logs);
return {
diagnosis: patterns.confidenceBreakdown < 0.5
? 'Context quality issues'
: patterns.actionAlignment < 0.6
? 'Intent-action misalignment'
: 'General uncertainty',
contributingFactors: [
...patterns.contextGaps,
...patterns.toolAvailabilityIssues,
...patterns.uncertaintyIndicators
],
recommendations: [
patterns.contextGaps.length > 0
? 'Improve context collection for this task type'
: patterns.toolAvailabilityIssues.length > 0
? 'Ensure required tools are available'
: 'Consider providing clearer intent specifications'
]
};
}
Scenario 2: Agent Not Completing Tasks
Symptoms: Tasks timeout, actions don't produce expected results
Debug Steps:
async function debugIncompleteTasks(sessionId: string): Promise<Diagnosis> {
const actionHistory = await getActionHistory(sessionId);
const bottleneck = identifyBottleneck(actionHistory);
return {
diagnosis: bottleneck.type,
details: {
currentTask: bottleneck.task,
stuckAtStep: bottleneck.step,
repeatedAttempts: bottleneck.repeatedAttempts,
lastError: bottleneck.lastError
},
recommendations: [
'Review step configuration',
'Verify tool availability',
'Check context windows',
'Consider task decomposition'
]
};
}
Practical Observability Setup
Implementation Checklist
Before Deploying:
- Configure confidence score alerting
- Set up token budget monitoring
- Define success metrics for each action type
- Create baseline latency expectations
- Establish memory usage thresholds
- Set up integration with external logging (Datadog, New Relic, etc.)
Daily Operations:
- Review daily summary dashboard
- Check for anomalies in decision patterns
- Monitor token budget consumption
- Review failed actions and their causes
Weekly Review:
- Analyze weekly performance trends
- Review confidence score distributions
- Update baseline metrics based on learned patterns
- Identify improvement opportunities
Summary
Key Observability Principles for AI Agents:
- Track decisions, not just outcomes - Understand why agents choose actions
- Monitor confidence levels - Low confidence signals uncertainty
- Profile latency patterns - Understand where time is spent
- Track token usage - Cost management requires visibility
- Measure memory effectiveness - Is the agent's knowledge being used well?
- Set up automated alerting - Don't wait for users to report problems
Next Step: Day 22 will cover practical patterns for using AI agents effectively in daily workflows - the consumer-facing companion to this technical deep-dive.