Revenue anomaly investigation swarm

02

User journey

An upstream anomaly detector flags a suspicious revenue window and pings this workflow. A coordinator agent investigates, optionally delegates to an analyst sub-agent for deeper root-cause work, and drafts a finance-grade summary artifact. The finance director reviews. On approval, the artifact is marked published in Concord and visible to the finance team in-app. Nothing is emailed or pushed externally.

User journey

flowchart LR Detector([Anomaly detector
flags date range]) --> Webhook[/external_webhook
or api_request/] Webhook --> Coord["Coordinator agent
investigates"] Coord --> Analyst{Deeper analysis
needed?} Analyst -->|yes| Sub["Analyst sub-agent
focused dive"] Analyst -->|no| Draft Sub --> Draft[Draft summary artifact] Draft --> Review[Finance director
human review] Review --> Published([Published artifact
visible to finance team]) style Coord fill:#EFE6F0,stroke:#7A5560 style Sub fill:#EFE6F0,stroke:#7A5560 style Review fill:#F5E0D2,stroke:#D97757 style Published fill:#F1F2EC,stroke:#6B7B5A

03

Trigger / ingress

Two supported ingress paths, same command type:

external_webhook — from the anomaly detector. Payload carries a provider event_id.
api_request — manual kickoff by a finance analyst via internal UI. Payload carries a UI-generated action_id.

Idempotency

Ingress-level idempotency_key template:

idempotency_key = "revenue_investigation:{ingress}:{event_id_or_action_id}"

Payload schema (key fields)

field	type	notes
`date_range_start`	date	required; inclusive
`date_range_end`	date	required; inclusive; max 31 days
`metric_scope`	enum	`gross \| net \| recognized \| bookings`
`anomaly_score`	float	0-1, from detector; null on manual
`requested_by`	user_id	analyst on manual; `system:detector` on webhook

Implicit context: workspace_id, tenant_id, trace_id.

04

Command(s)

One command type, parents the entire swarm.

field	value
`command_type`	`investigate_revenue_anomaly`
`ingress`	`external_webhook` \| `api_request`
`cancellation_mode`	`graceful`
`idempotency_key`	`revenue_investigation:{ingress}:{event_id_or_action_id}`
`status path`	`created → validated → queued → running → succeeded \| failed \| cancelled \| expired`

The standard status machine applies. compensated is not reachable: no external writes, so no compensation chain exists.

05

Policy stack

Composed at validate time and before each tool call:

permission

cost

data_safety

connector_scope

agent_risk

memory_consent

approval_requirement

external_sharing

rate_limit

permission — only members of finance_investigators or the detector service account can trigger.
cost — hard cap max_cost_units = 30 for the SwarmRun.
data_safety — SQL tool runs under a row-level-secured view; raw PII columns are denied at the warehouse.
connector_scope — Snowflake reader role; SELECT only, no INSERT/UPDATE/DELETE.
agent_risk — max_steps = 20 per AgentRun, spawn_depth ≤ 2, max_agents = 2.
memory_consent — every memory write is a candidate; require_approval from the requester before commit.
approval_requirement — artifact transition to published requires finance director sign-off.
external_sharing — set to deny for all egress connectors; the only "sharing" is in-app artifact visibility.
rate_limit — bucket on connector_calls queue; cap SQL calls per minute per tenant.

06

Execution plan

Async, agentic swarm. The coordinator runs on agent_runs; the optional analyst runs on swarm_children. SQL goes through connector_calls.

Execution plan

flowchart TB Ingress([Trigger]) --> Validate["Validate payload
sync_function"] Validate --> Policy{Policy stack
allow?} Policy -->|deny| Fail([Command failed:
policy_denied]) Policy -->|allow| Enqueue[Enqueue SwarmRun
queue: agent_runs] Enqueue --> Coord["Coordinator AgentRun
agent_tool_call loop"] Coord --> SQL["run_sql via Snowflake
connector_call
queue: connector_calls"] Coord --> Mem["retrieve_memory
agent_tool_call"] Coord --> Decide{Need deeper
analysis?} Decide -->|yes| Analyst["Analyst sub-AgentRun
queue: swarm_children"] Decide -->|no| Draft Analyst --> Draft["Draft artifact
sync_function"] Draft --> Approval["Approval: finance director
human_task"] Approval --> Publish["Mark artifact published
sync_function"] Publish --> Done([succeeded]) style Policy fill:#F5E0D2,stroke:#D97757 style Decide fill:#F5E0D2,stroke:#D97757 style Coord fill:#EFE6F0,stroke:#7A5560 style Analyst fill:#EFE6F0,stroke:#7A5560 style Approval fill:#F5E0D2,stroke:#D97757 style Done fill:#F1F2EC,stroke:#6B7B5A

Execution modes used

sync_function

async_task

connector_call

agent_tool_call

human_task

07

Effects table

Read-only investigation. No CoreEffect rows are emitted to domain_effects. The artifact is an Artifact, not an effect.

Note

Effects are side-effect plans on the outside world. This workflow performs no external writes — Snowflake access is SELECT-only, memory writes are internal Concord state, and the artifact is in-app. Therefore the effects table is empty by design.

08

Memory

Reads

The coordinator's retrieve_memory tool searches prior incidents at scope = project (the finance-investigations project). Hit shape: incident summary, date range, root cause, resolution.

Writes

On a confirmed root cause, the coordinator proposes a memory candidate:

field	value
`scope`	`project` (finance-investigations)
`type`	`fact`
`commit_policy`	candidate → `memory_consent` policy → `require_approval` → commit
`conflict_resolution`	supersede with audit

Rule

No memory write commits without explicit consent from the requester. The candidate is durable; the commit is gated.

09

Artifacts

field	value
`artifact_type`	`report`
`status path`	`draft → created → validated → published`
`storage`	Postgres pointer; body in object storage (`location` ref)
`versioning`	new artifact per command run; supersedes prior via `parent_artifact_id`
`archival`	90-day visibility; archived to cold storage thereafter
`contents`	flagged window, metric breakdowns, retrieved prior incidents, analyst notes (if spawned), coordinator's hypothesis, confidence band

The artifact also references a query_result sub-artifact per SQL run, for traceability.

10

Approvals

One approval. Triggered by the approval_requirement policy when the artifact transitions from validated to published.

Review packet

requested_action: publish revenue investigation artifact to finance team view
requester: command's requested_by
reason: anomaly score and flagged window
affected_data: artifact draft preview + linked query_result snapshots
triggering_policy: external_sharing + approval_requirement
expected_outcome: artifact visible in-app to finance_team group
risk_level: medium (financial reporting)
expiration: 72 hours

Approver: finance director (role-resolved, not a hardcoded user).

On rejection: artifact stays validated, command transitions to succeeded (the investigation ran), and the rejection reason becomes an audit event. No alternate publish path.

On expiration: artifact stays validated; a follow-up approval_callback may re-request after edits.

Approval flow

sequenceDiagram participant Swarm as Swarm participant Concord as Concord participant Director as Finance Director participant FinanceTeam as Finance Team Swarm->>Concord: artifact.status = validated Concord->>Concord: policy: external_sharing + approval_requirement Concord->>Director: approval.pending
review_packet Director->>Concord: approve | reject alt approved Concord->>Concord: artifact.status = published Concord-->>FinanceTeam: in-app visibility else rejected Concord->>Concord: artifact stays validated Concord->>Concord: audit: approval_rejected else expired (72h) Concord->>Concord: approval.status = expired end

11

Agent / swarm config

SwarmRun

field	value
`objective`	Investigate flagged revenue window; produce finance-grade summary
`execution_mode`	`hierarchical`
`join_strategy`	`coordinator_synthesis`
`max_agents`	2 (coordinator + at most one analyst)
`max_depth`	2
`max_total_steps`	40
`max_cost_units`	30

Coordinator AgentRun

field	value
`goal`	investigate window, decide if delegation needed, draft artifact
`allowed_tools`	`run_sql`, `retrieve_memory`, `propose_memory_write`, `spawn_analyst`, `create_artifact`
`allowed_connectors`	`snowflake` (reader), `postgres` (Concord state), `vector_db` (memory)
`memory_scope`	`project` (finance-investigations)
`context_scope`	`workspace_id`, `command_id`, payload
`max_steps`	20
`spawn_depth`	1 (may spawn one child)

Analyst sub-AgentRun (optional)

field	value
`goal`	deeper root-cause dive on a sub-window or metric dimension
`allowed_tools`	`run_sql`, `retrieve_memory` (subset of parent)
`allowed_connectors`	`snowflake` (reader), `vector_db` (subset of parent)
`memory_scope`	`project` (same as parent, no widening)
`max_steps`	20
`spawn_depth`	0 (cannot spawn further; `max_depth=2` reached)

Rule

child_scope ⊆ parent_scope for tools, connectors, and memory. The analyst's allowed_tools and allowed_connectors are each a strict subset of the coordinator's; memory scope is identical (project) and may not widen.

Swarm hierarchy

flowchart TB Swarm["SwarmRun
objective: investigate window
execution_mode: hierarchical
join_strategy: coordinator_synthesis"] Swarm --> Coord["Coordinator AgentRun
max_steps: 20
spawn_depth: 1"] Coord -->|optional spawn| Analyst["Analyst sub-AgentRun
max_steps: 20
spawn_depth: 0
scope ⊆ coordinator"] Coord -->|synthesizes| Join["Join: coordinator_synthesis
draft artifact"] Analyst -->|returns findings| Join style Swarm fill:#EFE6F0,stroke:#7A5560 style Coord fill:#EFE6F0,stroke:#7A5560 style Analyst fill:#EFE6F0,stroke:#7A5560 style Join fill:#F1F2EC,stroke:#6B7B5A

Allowed tools

run_sql

retrieve_memory

propose_memory_write

spawn_analyst

create_artifact

Forbidden tools

send_email

slack_post

http_post

write_sql

file_upload_external

commit_memory_direct

Allowed connectors

snowflake (reader)

postgres (concord state)

vector_db (memory)

12

Audit events

All written to domain_events with appropriate purpose.

purpose = audit

command.created

policy.decision

approval.requested

approval.approved

approval.rejected

approval.expired

artifact.published

memory.candidate

memory.committed

purpose = agent_step

Per agent step: step_index, tool_name, latency_ms, token_usage, cost_units. Carries agent_run_id for the coordinator and analyst separately.

purpose = event

investigation.started

investigation.completed

analyst.spawned

analyst.returned

13

Cancellation model

Mode: graceful. Transitions: running → cancelling → cancelled.

Each agent step's prelude checks the cancellation flag. In-flight SQL is allowed to finish (read-only, bounded); the next step is skipped and the swarm exits.

propagate_cancellation = true: cancelling the parent command cancels the coordinator AgentRun, which cancels the analyst sub-AgentRun if active.

Cancellation cascade

flowchart TB Cancel([Cancel signal]) --> Cmd["Command
running → cancelling"] Cmd --> Coord["Coordinator AgentRun
cancelling"] Coord --> Analyst["Analyst sub-AgentRun
cancelling (if active)"] Coord --> SQL["In-flight SQL
allowed to finish"] Analyst --> Done["All children
cancelled"] SQL --> Done Done --> Final["Command
cancelled"] style Cmd fill:#F5E0D2,stroke:#D97757 style Coord fill:#EFE6F0,stroke:#7A5560 style Analyst fill:#EFE6F0,stroke:#7A5560

Not chosen: compensate_then_stop. There are no external writes to undo.

14

Compensation graph

Skipped — not applicable. Read-only investigation: zero CoreEffect rows, no external mutations, no inverse operations to declare. The compensation graph validator passes trivially (empty manifest).

Note

If a future iteration adds an external write (e.g., publish to a finance Slack channel or file a Jira ticket on root-cause confirmation), revisit this section and declare the inverse per side effect.

15

Runtime config

Adapter

DBOS. Required capabilities: DURABLE_WORKFLOWS, DURABLE_STEPS, QUEUES, SIGNALS, SUBWORKFLOWS. SAGA_COMPENSATION_NATIVE not needed (no effects).

Queues

agent_runs

swarm_children

connector_calls

RetryPolicy

operation	retryable	max_attempts	backoff_s
`run_sql`	yes (transient + rate_limited)	3	5, 15, 45
`retrieve_memory`	yes (transient)	3	2, 6, 18
`spawn_analyst`	no	1	—
`create_artifact`	yes (transient db)	3	2, 6, 18

Constraints

Latency: background (minutes acceptable; p95 target 5 min end-to-end excluding approval wait).
Cost: max_cost_units = 30 per SwarmRun.
Compliance: audit retention 7 years (SOX). PII never leaves the warehouse — row-level security at the view layer.
Multi-tenant: hard isolation by workspace_id on every read.

16

Test plan

Unit

Payload validation: required fields, range ≤ 31 days, valid metric_scope.
Idempotency key composition for both ingress types.
Coordinator decision function: when to spawn analyst (pure over typed dataclasses).
Review packet builder produces all required fields.

Integration

Postgres + DBOS adapter: SwarmRun row, coordinator AgentRun row, optional analyst row, status transitions.
Snowflake reader role: SELECT passes, INSERT/UPDATE/DELETE denied at connector.
Memory candidate write + consent commit flow.

Workflow (end-to-end)

Happy path: webhook → investigation → no analyst → artifact drafted → approved → published.
Branch: deeper analysis triggered → analyst spawned → join via coordinator_synthesis → artifact reflects analyst findings.
Approval rejected: artifact remains validated, command succeeded, audit event recorded.
Approval expired at 72h: artifact remains validated, approval marked expired.
Cancellation mid-flight: graceful exit, no orphaned state.

Safety

Policy denial: non-finance_investigators trigger → policy_denied, no SwarmRun row.
Forbidden tool call: coordinator attempts send_email → blocked at policy, audit event, agent step recorded.
Subset rule: analyst attempts a tool the coordinator lacks → blocked, agent_risk policy denial.
Memory consent: candidate without approval never commits.
External sharing: artifact transition without approval → blocked.
Cost ceiling: SwarmRun exceeds max_cost_units = 30 → terminates with agent_failed.

Boundary discipline

concord_boundary_check.py rejects dbos / temporalio imports outside the runtime adapter file.

17

Open questions / risks

Detector reliability. If the upstream detector floods the webhook (e.g., recurring false positives across many windows), the rate_limit bucket needs to be sized — current cap is TBD per-tenant per-hour. Wire to alerting if rejection rate climbs.
Analyst spawn heuristic. The coordinator's "need deeper analysis?" decision is currently model-driven. Worth defining a small rubric (confidence threshold, metric variance, multi-dimension flag) and treating the heuristic as a policy input — measurable and overrideable.
Approver fallback. If the finance director is unavailable and the 72h expiry hits, the artifact stalls. Open: define a delegation path (deputy approver) or alert-and-extend pattern. For now, expiry is terminal.
Memory pollution. Project-scope memory could accumulate low-quality candidates if consent is granted reflexively. Consider periodic memory review as a separate workflow.
Multi-investigation overlap. Two overlapping date ranges triggered separately produce two artifacts. Should one supersede the other? Currently independent; parent_artifact_id linkage is manual.
Versioning across coordinator iterations. If the coordinator agent's prompt or tool surface changes meaningfully, agent_run rows should reflect a config_version. Adapter capability WORKFLOW_VERSIONING is not in DBOS today — flag for Temporal if it becomes load-bearing.
Anomaly score absence. Manual trigger has anomaly_score = null. The coordinator's prompt should handle this gracefully — confirm in evals.