可观测性设计:Logging、Metrics、Tracing 与告警
一、核心概念
Agent 可观测性(Observability)是理解 Agent 系统内部状态和行为的能力,是 Agent 产品从"黑箱"走向"透明"的关键。与传统软件不同,Agent 系统的非确定性(LLM 输出不可预测)使可观测性成为安全、调试、优化和信任的基础。
1.1 Agent 可观测性的三大支柱
┌────────────────────────────────────────────────────────────┐
│ Agent 可观测性 │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Logging │ │ Metrics │ │ Tracing │ │
│ │ (日志) │ │ (指标) │ │ (链路) │ │
│ │ │ │ │ │ │ │
│ │ 执行日志 │ │ 延迟分布 │ │ 请求链路 │ │
│ │ 决策轨迹 │ │ 成功率 │ │ 依赖分析 │ │
│ │ 错误记录 │ │ Token 消耗 │ │ 跨服务追踪 │ │
│ │ LLM 调用 │ │ 成本统计 │ │ 性能瓶颈 │ │
│ │ │ │ │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ └────────────────┼────────────────┘ │
│ ▼ │
│ ┌──────────────┐ │
│ │ 告警 & 调试 │ │
│ │ (Alerting) │ │
│ └──────────────┘ │
└────────────────────────────────────────────────────────────┘1.2 为什么 Agent 可观测性更困难?
| 传统软件 | AI Agent 系统 |
|---|---|
| 确定性逻辑,可复现 | LLM 输出非确定性,难复现 |
| 错误可精确定位 | 错误可能是语义层面的"看似合理但错误" |
| 状态变更可追踪 | 推理链复杂,Token 级决策不可追踪 |
| 延迟可预测 | LLM 推理延迟差异大(秒级到分钟级) |
| 成本可预估 | Token 消耗随输入变化,难以精确预估 |
二、Logging(日志)
2.1 日志体系分层
日志分层架构:
┌──────────────────────────────────────┐
│ Layer 1: 执行日志 │
│ (谁、何时、做了什么、结果如何) │
├──────────────────────────────────────┤
│ Layer 2: 决策轨迹 │
│ (Agent 为什么做出这个决定) │
├──────────────────────────────────────┤
│ Layer 3: LLM 调用日志 │
│ (Prompt、Completion、Token 消耗) │
├──────────────────────────────────────┤
│ Layer 4: 系统日志 │
│ (基础设施、网络、错误栈) │
└──────────────────────────────────────┘2.2 执行日志
python
# Agent 执行日志记录器
class AgentLogger:
"""结构化 Agent 执行日志"""
def __init__(self, log_client):
self.client = log_client
async def log_execution(
self,
session_id: str,
turn_number: int,
phase: str, # perceive | reason | act | feedback
input_data: dict,
output_data: dict,
duration_ms: float,
status: str # success | error | timeout | interrupted
):
log_entry = {
"timestamp": datetime.utcnow().isoformat(),
"session_id": session_id,
"turn_number": turn_number,
"phase": phase,
"duration_ms": duration_ms,
"status": status,
# 输入
"input": {
"type": input_data.get("type"),
"content_length": len(str(input_data.get("content", ""))),
"tool_calls": input_data.get("tool_calls", []),
},
# 输出
"output": {
"type": output_data.get("type"),
"content_length": len(str(output_data.get("content", ""))),
"tool_results": output_data.get("tool_results", []),
},
# 元数据
"metadata": {
"model": input_data.get("model"),
"temperature": input_data.get("temperature"),
"tokens_used": output_data.get("token_count", 0),
"error": output_data.get("error"),
}
}
await self.client.emit("agent_execution", log_entry)2.3 决策轨迹
决策轨迹记录了 Agent 每一步的思考过程,是调试和审计的关键。
yaml
decision_trail:
session_id: "sess_abc123"
task: "为用户撰写一份市场分析报告"
turns:
- turn: 1
phase: perceive
action: "解析用户需求"
input: "撰写市场分析报告,关注AI芯片领域"
output:
parsed_intent: "市场分析"
domain: "AI芯片"
scope: "全球市场"
- turn: 2
phase: reason
action: "制定搜索计划"
reasoning_chain: |
1. 用户需要 AI 芯片市场分析
2. 需要覆盖:市场规模、主要玩家、技术趋势
3. 计划先搜索行业报告,再搜索具体厂商
alternatives_considered:
- "先搜索具体厂商" (rejected: 缺少全局视角)
- "直接生成报告" (rejected: 需要数据支持)
- turn: 3
phase: act
action: "调用 web_search"
tool: web_search
parameters:
query: "AI chip market size 2026"
sources: [gartner, idc, mckinsey]
result:
success: true
pages_found: 15
- turn: 4
phase: feedback
action: "评估搜索结果"
evaluation: "搜索结果充足,包含市场数据和竞争格局"
decision: "继续搜索具体厂商数据"配置示例(YAML):
yaml
logging:
execution_log:
enabled: true
storage:
type: elasticsearch
index: agent-execution-logs
retention: 90d # 保留 90 天
structured_fields:
- session_id
- turn_number
- phase
- agent_id
- model
- duration_ms
- token_count
- status
- error_type
decision_trail:
enabled: true
storage:
type: postgresql
table: decision_trails
retention: 30d
capture_level: full # full | summary | minimal
# full: 包含完整推理链和备选方案
# summary: 仅包含决策结果
# minimal: 仅包含关键转折点
pii_redaction:
enabled: true
patterns:
- email: [r'[\w\.-]+@[\w\.-]+']
- phone: [r'\d{11,}']
- api_key: [r'(?i)(api[_-]?key|secret).{0,5}["\']?([^"\'&\s]{8,})']
redaction_token: "***"三、Metrics(指标)
3.1 核心指标体系
python
class AgentMetrics:
"""Agent 核心指标收集器"""
def __init__(self, metrics_client):
self.client = metrics_client
self.counters = defaultdict(int)
self.histograms = defaultdict(list)
def record_execution(self, phase: str, duration_ms: float, status: str):
# 延迟指标(直方图)
self.client.histogram(
"agent.phase.duration",
duration_ms,
tags={"phase": phase, "status": status}
)
# 计数指标
self.client.increment(
"agent.execution.count",
tags={"phase": phase, "status": status}
)
def record_llm_call(self, model: str, tokens: int, cost: float, duration_ms: float):
self.client.histogram("llm.call.duration", duration_ms, tags={"model": model})
self.client.histogram("llm.token.total", tokens, tags={"model": model})
self.client.increment("llm.cost", cost, tags={"model": model})
self.client.increment("llm.call.count", tags={"model": model})
def record_tool_call(self, tool: str, duration_ms: float, status: str):
self.client.histogram("tool.duration", duration_ms, tags={"tool": tool})
self.client.increment("tool.call.count", tags={"tool": tool, "status": status})3.2 核心指标定义
yaml
metrics:
# 延迟指标
latency:
- name: agent.session.duration
type: histogram
description: "单次会话总耗时"
unit: milliseconds
tags: [agent_id, session_type]
- name: agent.turn.duration
type: histogram
description: "单轮循环耗时(感知+推理+行动)"
unit: milliseconds
tags: [agent_id, turn_number]
percentiles: [50, 90, 95, 99]
- name: agent.phase.duration
type: histogram
description: "各阶段耗时分布"
unit: milliseconds
tags: [agent_id, phase]
- name: llm.first_token_latency
type: histogram
description: "LLM 首 Token 延迟(TTFT)"
unit: milliseconds
tags: [model, provider]
# 成功率指标
success_rate:
- name: agent.session.completion_rate
type: gauge
description: "会话完成率(成功/总数)"
unit: percent
tags: [agent_id]
- name: agent.turn.success_rate
type: gauge
description: "单轮成功率"
unit: percent
tags: [agent_id, phase]
- name: tool.call.success_rate
type: gauge
description: "工具调用成功率"
unit: percent
tags: [tool_name]
- name: llm.call.success_rate
type: gauge
description: "LLM 调用成功率"
unit: percent
tags: [model]
# 成本指标
cost:
- name: agent.session.cost
type: histogram
description: "每次会话的 LLM 成本"
unit: usd
tags: [agent_id]
- name: agent.turn.cost
type: histogram
description: "单轮 LLM 成本"
unit: usd
tags: [agent_id]
- name: llm.token.count
type: histogram
description: "Token 消耗量"
unit: tokens
tags: [model, type] # type: prompt | completion | total
- name: tool.call.cost
type: histogram
description: "工具调用成本(API 费用)"
unit: usd
tags: [tool_name]
# 业务指标
business:
- name: agent.tasks_per_user
type: gauge
description: "每用户平均任务数"
unit: count
- name: agent.avg_turns_per_task
type: histogram
description: "每任务平均轮数"
unit: turns
- name: agent.user_satisfaction
type: gauge
description: "用户满意度评分"
unit: score(1-5)
tags: [agent_id]3.3 指标可视化看板
yaml
dashboard:
name: "Agent 运维大盘"
panels:
- title: "实时延迟监控"
type: time_series
metrics:
- agent.turn.duration.p99
- agent.phase.duration.p95
refresh: 10s
- title: "成功率走势"
type: time_series
metrics:
- agent.turn.success_rate
- tool.call.success_rate
refresh: 30s
- title: "成本消耗"
type: bar_chart
metrics:
- agent.session.cost.avg
- agent.session.cost.p95
group_by: agent_id
refresh: 60s
- title: "Token 使用分布"
type: pie_chart
metrics:
- llm.token.count
group_by: model
refresh: 60s
- title: "Top 失败原因"
type: table
metrics:
- agent.turn.status
group_by: error_type
limit: 10
refresh: 60s四、Tracing(链路追踪)
4.1 请求追踪
python
class AgentTracer:
"""Agent 分布式链路追踪"""
def __init__(self, tracer):
self.tracer = tracer
async def trace_session(self, session_id: str, user_id: str):
"""追踪整个 Agent 会话"""
with self.tracer.start_span("agent.session") as span:
span.set_attribute("session_id", session_id)
span.set_attribute("user_id", user_id)
async for turn in self._session_turns(session_id):
await self._trace_turn(turn, span.context)
async def _trace_turn(self, turn: Turn, parent_context):
"""追踪单轮循环"""
with self.tracer.start_span("agent.turn", context=parent_context) as span:
span.set_attribute("turn_number", turn.number)
# 子阶段追踪
with self.tracer.start_span("agent.perceive") as perceive_span:
perceive_span.set_attribute("input_type", turn.input_type)
# ... 感知阶段
perceive_span.end()
with self.tracer.start_span("agent.reason") as reason_span:
reason_span.set_attribute("model", turn.model)
reason_span.set_attribute("tokens", turn.tokens_used)
# ... 推理阶段
# LLM 调用子追踪
with self.tracer.start_span("llm.call") as llm_span:
llm_span.set_attribute("model", turn.model)
llm_span.set_attribute("prompt_tokens", turn.prompt_tokens)
llm_span.set_attribute("completion_tokens", turn.completion_tokens)
llm_span.set_attribute("cost", turn.cost)
# ... LLM 调用
llm_span.end()
reason_span.end()
with self.tracer.start_span("agent.act") as act_span:
act_span.set_attribute("tool", turn.tool_name)
act_span.set_attribute("tool_duration", turn.tool_duration)
# ... 行动阶段
with self.tracer.start_span("tool.call") as tool_span:
tool_span.set_attribute("tool_name", turn.tool_name)
tool_span.set_attribute("status", turn.tool_status)
tool_span.end()
act_span.end()4.2 依赖分析
yaml
tracing:
distributed_tracing:
enabled: true
exporter: otlp # OpenTelemetry Protocol
endpoint: http://otel-collector:4318
service_name: agent-platform
sampling:
strategy: head_based
# 头部采样:根据请求属性决定是否采样
rules:
- name: sample_all_errors
condition: status == "error"
sample_rate: 1.0 # 100% 采样错误请求
- name: sample_high_value
condition: user_tier == "premium"
sample_rate: 1.0
- name: sample_normal
condition: always
sample_rate: 0.1 # 10% 采样正常请求
dependency_graph:
enabled: true
storage: neo4j
captures:
- agent → llm # Agent 调用哪些 LLM
- agent → tool # Agent 调用哪些工具
- agent → agent # Agent 间调用关系
- tool → external_api # 工具调用哪些外部 API
analysis:
- critical_path # 关键路径分析
- bottleneck_detection # 瓶颈检测
- failure_propagation # 故障传播分析五、告警与调试
5.1 告警规则
yaml
alerting:
enabled: true
notification_channels:
- type: slack
webhook: https://hooks.slack.com/...
channel: "#agent-alerts"
- type: email
recipients: [ops-team@company.com]
- type: pagerduty
service_key: "..."
urgency: critical
rules:
# 性能告警
- name: high_latency
description: "会话延迟过高"
condition: agent.session.duration.p95 > 30000 # 30 秒
duration: 5m
severity: warning
notification: slack
- name: success_rate_drop
description: "成功率急剧下降"
condition: agent.turn.success_rate < 0.8 # 低于 80%
duration: 10m
severity: critical
notification: [slack, pagerduty]
- name: llm_error_spike
description: "LLM 错误率飙升"
condition: llm.call.success_rate < 0.9
duration: 5m
severity: critical
notification: [slack, pagerduty]
# 成本告警
- name: cost_spike
description: "成本异常激增"
condition: agent.session.cost.p95 > 5.0 # $5
duration: 1m
severity: warning
notification: slack
- name: token_usage_anomaly
description: "Token 消耗异常"
condition: llm.token.count.p99 > 50000
duration: 5m
severity: warning
notification: slack
# 业务告警
- name: stuck_agent
description: "Agent 卡在循环中"
condition: agent.turn.count_per_session > 20
duration: 1m
severity: warning
notification: slack
- name: tool_failure
description: "关键工具持续失败"
condition: tool.call.success_rate{critical_tools} < 0.95
duration: 5m
severity: critical
notification: [slack, pagerduty]5.2 调试工具
python
class AgentDebugger:
"""Agent 调试器"""
def __init__(self, logger, tracer):
self.logger = logger
self.tracer = tracer
async def replay_session(self, session_id: str, step_by_step: bool = False):
"""重放会话,用于调试"""
# 1. 加载会话日志
logs = await self.logger.get_session_logs(session_id)
# 2. 重建执行轨迹
trace = await self.tracer.get_session_trace(session_id)
if step_by_step:
for turn in logs:
print(f"\n=== Turn {turn.turn_number}: {turn.phase} ===")
print(f"输入: {turn.input}")
print(f"推理链: {turn.reasoning_chain}")
print(f"输出: {turn.output}")
# 等待用户按 Enter 继续
input("按 Enter 继续...")
else:
# 生成完整报告
return self._generate_debug_report(logs, trace)
async def debug_llm_call(self, session_id: str, turn_number: int):
"""调试特定 LLM 调用"""
call_log = await self.logger.get_llm_call(session_id, turn_number)
return {
"prompt": call_log.prompt,
"completion": call_log.completion,
"token_usage": call_log.token_usage,
"model": call_log.model,
"temperature": call_log.temperature,
"latency_ms": call_log.latency_ms,
"cost": call_log.cost,
"alternatives": call_log.alternatives # 如果配置了 n>1
}配置文件:
yaml
debugging:
session_replay:
enabled: true
step_by_step: true
speed_control: [0.5x, 1x, 2x, 5x]
pause_at: [error, approval, tool_call] # 在这些节点暂停
llm_debug:
prompt_viewer: true
diff_mode: true # 对比不同 prompt 版本的输出差异
alternative_sampling:
enabled: true
n: 3 # 对同一个 prompt 采样 3 次,观察输出分布
tool_debug:
mock_mode: true # 模拟工具调用,不真的执行
response_editing: true # 允许手工修改工具返回值5.3 可观测性配置总览
yaml
observability:
version: "1.0"
logging:
provider: elk # elasticsearch + logstash + kibana
level: info # debug | info | warn | error
structured: true
retention: 90d
metrics:
provider: prometheus + grafana
collection_interval: 15s
retention: 30d
tracing:
provider: opentelemetry + jaeger
sampling: 0.1 # 10% 采样
storage: elasticsearch
alerting:
provider: alertmanager + pagerduty
on_call_schedule: "US/Eastern business hours"
dashboard:
url: https://grafana.company.com/d/agent-observability
refresh: auto # 自动刷新
debug:
session_replay: true
llm_prompt_viewer: true
tool_mock: true六、实际产品案例
| 产品 | 可观测性方案 | 亮点 |
|---|---|---|
| LangSmith (LangChain) | 全栈可观测性 | Agent 执行追踪、Prompt 版本管理、数据集标注、在线评估 |
| LangFuse | 开源可观测性 | LLM 调用追踪、成本分析、用户反馈收集、人机协作标注 |
| OpenAI Dashboard | Metrics + Logging | API 使用统计、延迟监控、成本看板、模型用量分析 |
| Weights & Biases (W&B) Prompts | LLM 调用追踪 | Prompt 实验管理、模型对比、可复现性追踪 |
| Dify | 内置可观测性 | 工作流执行日志、节点级追踪、Token 消耗统计 |
| Datadog LLM Observability | 企业级全栈 | APM 集成、LLM 调用 Tracing、Guardrails 监控、成本优化建议 |
| Arize AI | ML 可观测性 | LLM 幻觉检测、嵌入漂移监控、检索质量评估 |