可觀測性

Argus 的可觀測訊號分三類：metrics、logs、traces。每類都走業界標準格式，可以接你已有的 Prometheus / Grafana / ELK / Loki / Datadog / OTel 系統。

訊號總覽

訊號	端點 / 來源	格式	用途
Metrics	`GET /metrics`	Prometheus exposition	系統 + 業務指標
Logs（structured）	stdout / journald	JSON Lines（`--enable-json-logging`）	行為 / 錯誤 / debug
Audit log	metastore `audit_log` 表 + `/v1/auditLogs`	結構化 record	稽核軌跡（與一般 log 分離）
Traces	OTLP（roadmap）	OpenTelemetry	跨 RPC 追蹤
Health	`GET /healthz` / `/readyz`	200 / 5xx	k8s probe / LB
pprof	`/debug/pprof/*`（`--debug`）	pprof	效能分析

Metrics

Prometheus exposition：

bash

curl -s http://localhost:8080/metrics | head -30

系統指標（runtime）

Metric	含意
`go_goroutines`	Goroutine 數量
`go_memstats_*`	Go 記憶體統計
`process_cpu_seconds_total`	CPU 累計
`process_resident_memory_bytes`	RSS
`process_open_fds`	開啟的 FD
`http_request_duration_seconds`	HTTP latency histogram
`http_requests_total`	HTTP 累計

業務指標（Argus-specific）

詳見 Release Monitoring — 含 task / plan check / rollout / issue approval 等指標。

建議 Grafana panel

Panel	來源
RPS / latency	`http_requests_total` / `http_request_duration_seconds`
5xx 率	`http_requests_total{status=~"5.."}`
Go heap / GC pause	`go_memstats_heap_inuse_bytes` / `go_gc_duration_seconds`
Task 失敗率	`argus_task_run_total{state="failed"}`
平均審批時長	`argus_issue_approval_duration_seconds`
Plan check 耗時	`argus_plan_check_run_duration_seconds`

Logs

啟用 JSON log

bash

./argus --enable-json-logging

格式：

json

{
  "time":"2026-05-24T08:42:11.123456Z",
  "level":"INFO",
  "msg":"task run completed",
  "source":{"function":"runner.taskrun.runTask","file":"runner/taskrun/taskrun.go","line":142},
  "task":"projects/play/rollouts/100/tasks/3",
  "task_run":"projects/play/rollouts/100/tasks/3/taskRuns/2",
  "duration_ms":3214,
  "rows_affected":1
}

結構化欄位（常見）

欄位	含意
`time`	UTC 時間戳，ns 精度
`level`	`DEBUG` / `INFO` / `WARN` / `ERROR`
`msg`	人類可讀訊息
`source`	程式碼位置（含 source line）
`correlation_id`	同一 HTTP / RPC 請求的所有 log 共用
`actor`	觸發者（若有 user context）
`task` / `plan` / `issue` / `rollout` / `instance`	對象資源
`error`	錯誤 detail（含 stack）

Log shipping 建議

接 ELK / Loki / Datadog 時：

不要用 regex 解析；直接讀 JSON 欄位
index 上至少建：time、level、correlation_id、actor
不要把 audit log 與一般 log 混存（audit 在 metastore，獨立 ship）

Traces（roadmap）

OpenTelemetry 整合在 roadmap。當前可以用以下替代：

HTTP correlation_id header → 跨服務手動串接
pprof CPU profile → 找熱點
--debug log → 看 request lifecycle

OTLP 端點上線後將支援：

HTTP / RPC 自動 span
DB driver query span（含對目標 DB 的執行）
跨副本 propagation

Health probes

bash

# Liveness — server 是否還在跑
curl -fs http://localhost:8080/healthz
# 預期：200，body = {"status":"ok"}

# Readiness — server 能否接 traffic（含 metastore 連線）
curl -fs http://localhost:8080/readyz
# 預期：200；metastore 不通會回 503

K8s 設定範例（私有部署已含）：

yaml

readinessProbe:
  httpGet: { path: /readyz, port: 8080 }
  initialDelaySeconds: 10
  periodSeconds: 5
livenessProbe:
  httpGet: { path: /healthz, port: 8080 }
  initialDelaySeconds: 30
  periodSeconds: 10

pprof（除錯）

只在 --debug 開啟時暴露：

bash

# CPU 30 秒 profile
curl -o /tmp/cpu.pprof http://localhost:8080/debug/pprof/profile?seconds=30
go tool pprof -http :9000 /tmp/cpu.pprof

# Heap
curl -o /tmp/heap.pprof http://localhost:8080/debug/pprof/heap

# Goroutine dump
curl -o /tmp/goroutine.txt 'http://localhost:8080/debug/pprof/goroutine?debug=2'

Production 不要長期開 --debug。pprof 端點暴露在反向代理外會洩漏資源細節，視同攻擊面。

觀測 checklist

把以下都接上 = 一個 production-ready 的 Argus：

[ ] Prometheus 抓 /metrics，retention ≥ 30d
[ ] JSON log → ELK / Loki / Datadog，retention ≥ 90d
[ ] Audit log 獨立留存 ≥ 法遵要求（通常 1 年）
[ ] Grafana dashboard（系統 + 業務）
[ ] 警報接通（5xx、metastore lag、凍結期違規、audit log 寫入停止…）
[ ] On-call 流程 / 值班輪表
[ ] 二線運維 SOP 已演練

警報基準

詳見 Release Monitoring — 健康度警報。

最核心 4 條（必接）：

警報	條件	嚴重度
服務不健康	`/healthz` 5xx 連續 1 分鐘	critical
Audit log 寫入停	5 分鐘無新 entry	critical
5xx 飆高	5 分鐘 5xx 率 > 1%	warning
Metastore 連線異常	`/readyz` 5xx 持續 30 秒	critical

可觀測性 ​

訊號總覽 ​

Metrics ​

系統指標（runtime） ​

業務指標（Argus-specific） ​

建議 Grafana panel ​

Logs ​

啟用 JSON log ​

結構化欄位（常見） ​

Log shipping 建議 ​

Traces（roadmap） ​

Health probes ​

pprof（除錯） ​

觀測 checklist ​

警報基準 ​

相關 ​

可觀測性

訊號總覽

Metrics

系統指標（runtime）

業務指標（Argus-specific）

建議 Grafana panel

Logs

啟用 JSON log

結構化欄位（常見）

Log shipping 建議

Traces（roadmap）

Health probes

pprof（除錯）

觀測 checklist

警報基準

相關