feat(metrics): emit workflow execution and per-block metrics to CloudWatch by TheodoreSpeaks · Pull Request #4931 · simstudioai/sim
Greptile Summary
This PR generalizes the existing metrics.ts CloudWatch emitter to support multiple namespaces and wires up new Sim/Workflow metrics for execution lifecycle and per-block execution tracking. The hostedKeyMetrics API is unchanged; new workflowMetrics helpers (recordExecutionStarted, recordExecutionCompleted, recordExecutionPaused, recordBlockExecuted) are added with appropriate cardinality guards.
metrics.ts: The buffer is now keyed by namespace (BufferedDatum),flushMetrics()groups by namespace before callingPutMetricData, andGRAFANA_DEPLOYMENT_ENVIRONMENTis added to theENVIRONMENTfallback chain so trigger.dev staging workers no longer collapse toproduction.logging-session.ts: AcompletionMetricEmittedflag guards all terminal paths (complete, completeWithError, completeWithCancellation, cost-only fallback, markAsFailed) against double-counting; pause is tracked separately viarecordExecutionPaused.block-executor.ts:recordBlockMetricis called on both success and error paths inside theblockLogguard; theOperationdimension is regex-validated to prevent cardinality explosion.cleanup-stale-executions/route.ts: Addstriggerto the stale-execution query and callsrecordExecutionCompleted(failed)for crashed workers that never reach aLoggingSessioncompletion path.
Confidence Score: 4/5
Safe to merge; all terminal completion paths are guarded against double-counting and the CloudWatch flush machinery is unmodified. The two findings are confined to metric accuracy, not execution correctness.
The dedup flag (completionMetricEmitted), the regex cardinality guard on Operation, and the trigger type safety (notNull DB column) are all correct. The safeStart double-emit risk is latent rather than present — nothing currently triggers it — but the pattern is fragile enough to note. The cost-only fallback's || 0 duration passes a zero ExecutionDuration data point whenever totalDurationMs is absent, which will skew CloudWatch percentile math for that path.
The two metric-emission sites in logging-session.ts — around safeStart (line 796) and the cost-only fallback duration (line 1080) — are worth a second look before the staging rollout produces misleading percentile data.
Important Files Changed
| Filename | Overview |
|---|---|
| apps/sim/lib/monitoring/metrics.ts | Core emitter generalized to support namespace-aware buffering; new workflowMetrics API added with correct cardinality guards and GRAFANA_DEPLOYMENT_ENVIRONMENT env-var fallback. |
| apps/sim/lib/logs/execution/logging-session.ts | Metric emission wired across all terminal paths; completionMetricEmitted flag prevents double-counting; a subtle gap exists in the safeStart fallback path (see comment). |
| apps/sim/executor/execution/block-executor.ts | recordBlockMetric added to both success and error paths, guarded behind blockLog existence check and operation regex filter; duration is always computed before the guard. |
| apps/sim/app/api/cron/cleanup-stale-executions/route.ts | trigger column added to select query (notNull in schema); recordExecutionCompleted(failed) correctly placed after the DB update succeeds, handling crashed-worker completions. |
| apps/sim/lib/monitoring/metrics.test.ts | New test file covering namespace grouping, buffer drain, dimension shapes, and failure-drop behavior for the generalized emitter. |
| apps/sim/lib/logs/execution/logging-session.test.ts | Comprehensive new test suite for metric emission: start/resume gate, all completion paths, pause/failed dedup, cost-only fallback idempotency, and cancelled-skip behavior. |
Sequence Diagram
sequenceDiagram
participant Caller
participant LoggingSession
participant BlockExecutor
participant workflowMetrics
participant Buffer
participant CloudWatch
Caller->>LoggingSession: start() / safeStart()
LoggingSession->>workflowMetrics: "recordExecutionStarted({trigger})"
workflowMetrics->>Buffer: enqueue(Sim/Workflow, ExecutionStarted)
loop Per block
BlockExecutor->>BlockExecutor: execute block
alt Success
BlockExecutor->>workflowMetrics: "recordBlockExecuted({blockType, operation?, success:true, durationMs})"
else Error
BlockExecutor->>workflowMetrics: "recordBlockExecuted({blockType, operation?, success:false, durationMs})"
end
workflowMetrics->>Buffer: enqueue(Sim/Workflow, BlockExecuted + BlockDuration)
end
alt Normal completion
LoggingSession->>workflowMetrics: "recordExecutionCompleted({trigger, status, durationMs})"
else Pause
LoggingSession->>workflowMetrics: "recordExecutionPaused({trigger})"
else markAsFailed
LoggingSession->>workflowMetrics: "recordExecutionCompleted({trigger, status:failed}) [guarded by completionMetricEmitted]"
else Crashed worker (cron)
Note over LoggingSession,workflowMetrics: cleanup-stale-executions route
LoggingSession->>workflowMetrics: "recordExecutionCompleted({trigger, status:failed})"
end
workflowMetrics->>Buffer: enqueue(Sim/Workflow, ExecutionCompleted + ExecutionDuration?)
Note over Buffer,CloudWatch: Every 5s / threshold / SIGTERM
Buffer->>CloudWatch: PutMetricData per namespace (Sim/Workflow, Sim/HostedKey)
Reviews (2): Last reviewed commit: "feat(metrics): emit workflow execution a..." | Re-trigger Greptile