feat: add observability improvements (metrics, audit log, structured logging) #169
Reference in New Issue
Block a user
Delete Branch "feature/observability-improvements"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
closes #84
Summary
Adds comprehensive observability to upaas: Prometheus metrics, an audit log table, and enhanced structured logging.
Prometheus Metrics (
internal/metrics)upaas_deployments_total(counter by app/status),upaas_deployments_duration_seconds(histogram),upaas_deployments_in_flight(gauge)upaas_container_healthy(gauge, 1=healthy/0=unhealthy)upaas_webhook_events_total(counter by app/event_type/matched)upaas_http_requests_total,upaas_http_request_duration_seconds,upaas_http_response_size_bytesupaas_audit_events_total(counter by action)NewForTest()uses a custom registry to avoid cross-test collisions/metricsendpoint is now always available (optionally auth-protected)Audit Log
audit_logSQLite table via migration007_add_audit_log.sqlwith indexes oncreated_at,action, and(resource_type, resource_id)AuditEntrymodel (internal/models/audit_log.go) withSave,FindAuditEntries,FindAuditEntriesByResource,CountAuditEntriesinternal/service/audit) withLog,LogFromRequest,Recent,ForResourcemethodsGET /api/audit?limit=N(default 50, max 500)Structured Logging Enhancements
bytesfield) alongside existing request metadataIntegration Points
h.auditLog()helper which resolves the current user and delegates to the audit serviceTests
internal/metrics/metrics_test.go: constructor, all metric families registered, label operationsinternal/models/models_test.go: 7 new audit entry tests (CRUD, find by resource, count, limit, ordering)internal/service/audit/audit_test.go: 6 tests covering Log, LogFromRequest, IP extraction, Recent, ForResource, optional fieldsREADME
Added Observability section documenting all metrics, audit log, and structured logging.
Code Review: PR #169 - Observability Improvements
1. Policy Divergences
BLOCKING: Security - IP spoofing in audit logs
extractRemoteIP() in internal/service/audit/audit.go unconditionally trusts X-Real-IP and X-Forwarded-For headers from any client without checking if the request comes from a trusted proxy.
The existing realIP() function in internal/middleware/middleware.go correctly validates that the connecting IP is from a trusted proxy (RFC1918/loopback) before trusting these headers. The audit service reimplements IP extraction without this security check.
This allows any direct client to forge their IP address in audit records by setting X-Real-IP: 1.2.3.4, undermining the integrity of the audit log. The audit service should reuse the middleware's trusted-proxy-aware logic (export realIP or pass the already-resolved IP from the handler layer).
BLOCKING: Incomplete audit coverage - 9 dead audit action constants
The following audit action constants are defined in internal/models/audit_log.go but never referenced anywhere in the codebase:
The corresponding handler operations (HandleEnvVarSave, HandleLabelAdd, HandleLabelEdit, HandleLabelDelete, HandleVolumeAdd, HandleVolumeEdit, HandleVolumeDelete, HandlePortAdd, HandlePortDelete) exist in the routes but are not instrumented with audit logging. These are user-facing actions. The issue scope says "Audit log table for user actions" - these qualify. Either instrument them or remove the dead constants.
BLOCKING: README inconsistency - Wrong API endpoint path
README states the audit log is available at GET /api/audit but the actual route registered in internal/server/routes.go is under the /api/v1 group, making the real path GET /api/v1/audit.
Minor: README - Duplicate numbering in DI order
The Dependency Injection order section has duplicate numbers - items 10 and 11 appear twice (notify/audit then deploy/webhook).
2. Requirements Checklist (Issue #84)
3. Build Result
docker build . passes - lint, fmt-check, test, build all green.
4. Additional Observations
5. Verdict: FAIL
Blocking issues:
Rework Summary
All 4 review findings have been fixed:
1. Security: IP spoofing in audit logs ✅
RealIP()frominternal/middleware/middleware.go(wasrealIP()).extractRemoteIP()ininternal/service/audit/audit.gowithmiddleware.RealIP(), which validates trusted proxies (RFC1918/loopback) before trustingX-Real-IP/X-Forwarded-Forheaders.extractRemoteIP()function and its unused imports (net,strings).realip_test.goto reference the now-exportedRealIP().TestAuditServiceLogFromRequestWithXRealIP) with two new tests:TestAuditServiceLogFromRequestWithXRealIPTrustedProxy— verifies X-Real-IP is honoured when request comes from RFC1918 proxy.TestAuditServiceLogFromRequestWithXRealIPUntrustedProxy— verifies X-Real-IP is ignored when request comes from a public IP (anti-spoof).2. Incomplete audit coverage — 9 dead action constants ✅
Added audit instrumentation to all 9 handlers:
HandleEnvVarSave→AuditActionEnvVarSaveHandleLabelAdd→AuditActionLabelAddHandleLabelEdit→AuditActionLabelEditHandleLabelDelete→AuditActionLabelDeleteHandleVolumeAdd→AuditActionVolumeAddHandleVolumeEdit→AuditActionVolumeEditHandleVolumeDelete→AuditActionVolumeDeleteHandlePortAdd→AuditActionPortAddHandlePortDelete→AuditActionPortDeleteAll audit calls are placed after successful operations only (not on error paths), with descriptive detail strings.
3. README wrong API path ✅
Fixed
GET /api/audit?limit=N→GET /api/v1/audit?limit=N.4. README duplicate numbering ✅
Fixed DI order section: items 10-16 now numbered correctly (was 10, 11, 10, 11, 12, 13, 14).
make fmt✅ |make test✅ |docker build .✅Code Review: PR #169 (Re-review) — Observability Improvements
1. Policy Compliance
No violations found.
@sha256:✅007_add_audit_log.sqlcorrectly numbered for post-1.0.0 ✅.golangci.yml,Makefile,Dockerfile, CI config: untouched ✅nolintannotations are all justified (funlenon webhook handler expanded by audit calls,revivefor metrics package naming vsruntime/metrics,mndfor Prometheus bucket boundaries) ✅go.mod/go.sumproperly maintained ✅2. Previous Review Findings — Verification
extractRemoteIP()extractRemoteIP()deleted entirely.audit.LogFromRequest()now callsmiddleware.RealIP()which validates trusted proxies (RFC1918/loopback) before trustingX-Real-IP/X-Forwarded-For. Two new tests verify: trusted proxy honours header, untrusted proxy ignores it.HandleEnvVarSave,HandleLabelAdd/Edit/Delete,HandleVolumeAdd/Edit/Delete,HandlePortAdd/Delete. All audit calls placed after successful operations only./api/audit)GET /api/v1/audit?limit=N. Route registered atr.Get("/audit", ...)inside the/api/v1group inroutes.go.fx.Provideorder incmd/upaasd/main.go.3. Requirements Checklist (Issue #84)
bytesfield). All existingslogusage maintained. Audit events logged to stdout.upaas_deployments_total(counter),upaas_deployments_duration_seconds(histogram),upaas_deployments_in_flight(gauge)upaas_container_healthygauge set after health check in deploy serviceupaas_webhook_events_totalcounter withapp/event_type/matchedlabelsaudit_logSQLite table via migration 007. All user-facing actions instrumented including login/logout, app CRUD, deploy/cancel/rollback, container start/stop/restart, env var/label/volume/port CRUD, webhook receive, initial setup.4. Test Coverage
internal/metrics/metrics_test.go: 6 tests — constructor, all 9 metric families verified via registry gather ✅internal/models/models_test.go: 7 new audit tests — create/find, all fields, find-by-resource, count, limit, ordering ✅internal/service/audit/audit_test.go: 7 tests — Log, LogFromRequest, trusted proxy IP, untrusted proxy IP (anti-spoof), Recent, ForResource, optional fields ✅handlers_test.go,webhook_test.go) correctly supplies metrics and audit dependencies ✅internal/middleware/realip_test.go: updated to reference exportedRealIP()✅5. Build Result
6. Additional Observations
NewForTest()pattern for metrics isolation prevents cross-test collisions — well designedMetricsAuth()correctly passes through when no credentials configured, making/metricsalways availablecontainerActionToAuditAction()helper with sensible fallback default caseHandleAPIAuditLoghandler properly validates limit param (bounds checking, default fallback)nullStringValuehelper keeps JSON serialisation cleanaudit.LogFromRequest(no authenticated user context) — correct7. Verdict: PASS ✅
All 4 previous findings are fixed correctly. All requirements from issue #84 are fully implemented. Code is clean, well-tested, and follows all repository policies. Build succeeds.
please fix conflict
ef109b9513tob593f544d2Rebase completed — conflicts resolved
Rebased
feature/observability-improvementsontomainto resolve conflicts introduced by PR #165 (CPU/memory resource limits).Conflicts resolved
internal/models/models_test.go— Both branches added new test functions at the end of the file. Kept both: resource limits tests (from main/PR #165) and audit entry tests (from this PR). Added missing closing brace forTestAppResourceLimitsthat was lost in the conflict boundary.internal/database/migrations/007_add_audit_log.sql→008_add_audit_log.sql— PR #165 merged with migration007_add_resource_limits.sql. Renumbered this PR's audit log migration from 007 to 008. The migration runner auto-discovers viaembed.FSsorted by filename, so no code references needed updating.Auto-merged (no conflicts)
README.mdinternal/handlers/app.gointernal/service/deploy/deploy.goVerification
make fmt✅docker build .✅ (lint, all tests pass, binary builds)Pull request closed