◐ Shell
clean mode source ↗

feat(metrics): emit workflow execution and per-block metrics to CloudWatch by TheodoreSpeaks · Pull Request #4931 · simstudioai/sim

Greptile Summary

This PR generalizes the existing metrics.ts CloudWatch emitter to support multiple namespaces and wires up new Sim/Workflow metrics for execution lifecycle and per-block execution tracking. The hostedKeyMetrics API is unchanged; new workflowMetrics helpers (recordExecutionStarted, recordExecutionCompleted, recordExecutionPaused, recordBlockExecuted) are added with appropriate cardinality guards.

  • metrics.ts: The buffer is now keyed by namespace (BufferedDatum), flushMetrics() groups by namespace before calling PutMetricData, and GRAFANA_DEPLOYMENT_ENVIRONMENT is added to the ENVIRONMENT fallback chain so trigger.dev staging workers no longer collapse to production.
  • logging-session.ts: A completionMetricEmitted flag guards all terminal paths (complete, completeWithError, completeWithCancellation, cost-only fallback, markAsFailed) against double-counting; pause is tracked separately via recordExecutionPaused.
  • block-executor.ts: recordBlockMetric is called on both success and error paths inside the blockLog guard; the Operation dimension is regex-validated to prevent cardinality explosion.
  • cleanup-stale-executions/route.ts: Adds trigger to the stale-execution query and calls recordExecutionCompleted(failed) for crashed workers that never reach a LoggingSession completion path.

Confidence Score: 4/5

Safe to merge; all terminal completion paths are guarded against double-counting and the CloudWatch flush machinery is unmodified. The two findings are confined to metric accuracy, not execution correctness.

The dedup flag (completionMetricEmitted), the regex cardinality guard on Operation, and the trigger type safety (notNull DB column) are all correct. The safeStart double-emit risk is latent rather than present — nothing currently triggers it — but the pattern is fragile enough to note. The cost-only fallback's || 0 duration passes a zero ExecutionDuration data point whenever totalDurationMs is absent, which will skew CloudWatch percentile math for that path.

The two metric-emission sites in logging-session.ts — around safeStart (line 796) and the cost-only fallback duration (line 1080) — are worth a second look before the staging rollout produces misleading percentile data.

Important Files Changed

Filename Overview
apps/sim/lib/monitoring/metrics.ts Core emitter generalized to support namespace-aware buffering; new workflowMetrics API added with correct cardinality guards and GRAFANA_DEPLOYMENT_ENVIRONMENT env-var fallback.
apps/sim/lib/logs/execution/logging-session.ts Metric emission wired across all terminal paths; completionMetricEmitted flag prevents double-counting; a subtle gap exists in the safeStart fallback path (see comment).
apps/sim/executor/execution/block-executor.ts recordBlockMetric added to both success and error paths, guarded behind blockLog existence check and operation regex filter; duration is always computed before the guard.
apps/sim/app/api/cron/cleanup-stale-executions/route.ts trigger column added to select query (notNull in schema); recordExecutionCompleted(failed) correctly placed after the DB update succeeds, handling crashed-worker completions.
apps/sim/lib/monitoring/metrics.test.ts New test file covering namespace grouping, buffer drain, dimension shapes, and failure-drop behavior for the generalized emitter.
apps/sim/lib/logs/execution/logging-session.test.ts Comprehensive new test suite for metric emission: start/resume gate, all completion paths, pause/failed dedup, cost-only fallback idempotency, and cancelled-skip behavior.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant LoggingSession
    participant BlockExecutor
    participant workflowMetrics
    participant Buffer
    participant CloudWatch

    Caller->>LoggingSession: start() / safeStart()
    LoggingSession->>workflowMetrics: "recordExecutionStarted({trigger})"
    workflowMetrics->>Buffer: enqueue(Sim/Workflow, ExecutionStarted)

    loop Per block
        BlockExecutor->>BlockExecutor: execute block
        alt Success
            BlockExecutor->>workflowMetrics: "recordBlockExecuted({blockType, operation?, success:true, durationMs})"
        else Error
            BlockExecutor->>workflowMetrics: "recordBlockExecuted({blockType, operation?, success:false, durationMs})"
        end
        workflowMetrics->>Buffer: enqueue(Sim/Workflow, BlockExecuted + BlockDuration)
    end

    alt Normal completion
        LoggingSession->>workflowMetrics: "recordExecutionCompleted({trigger, status, durationMs})"
    else Pause
        LoggingSession->>workflowMetrics: "recordExecutionPaused({trigger})"
    else markAsFailed
        LoggingSession->>workflowMetrics: "recordExecutionCompleted({trigger, status:failed}) [guarded by completionMetricEmitted]"
    else Crashed worker (cron)
        Note over LoggingSession,workflowMetrics: cleanup-stale-executions route
        LoggingSession->>workflowMetrics: "recordExecutionCompleted({trigger, status:failed})"
    end
    workflowMetrics->>Buffer: enqueue(Sim/Workflow, ExecutionCompleted + ExecutionDuration?)

    Note over Buffer,CloudWatch: Every 5s / threshold / SIGTERM
    Buffer->>CloudWatch: PutMetricData per namespace (Sim/Workflow, Sim/HostedKey)
Loading

Reviews (2): Last reviewed commit: "feat(metrics): emit workflow execution a..." | Re-trigger Greptile