Concord design

Vendor sales daily sync

Pull yesterday's sales from a vendor REST API at 03:00 UTC, detect stale snapshots via content hash, and upsert into the Postgres warehouse partition — fully automated, idempotent end to end.

Triggerscheduled_trigger Planasync deterministic Cancellationgraceful Approvals0 Effects1

User journey

No interactive user. A scheduled trigger fires nightly; the data engineering team consumes the result by querying the warehouse the next morning and by inspecting the audit/event stream when something looks off.

Operator journey

flowchart LR Cron([03:00 UTC cron]) --> Cmd[sync_vendor_sales command] Cmd --> Run[Workflow runs
5-15 min] Run --> Outcome{Outcome} Outcome -->|fresh data| Warehouse[(Warehouse partition
yyyy-mm-dd upserted)] Outcome -->|stale snapshot| Skipped[Skip write
emit skipped event] Outcome -->|bad request| Failed[command.failed
page on-call] Warehouse --> Morning([DE team queries 09:00]) Skipped --> Morning Failed --> Pager([On-call pager]) style Outcome fill:#F5E0D2,stroke:#D97757 style Skipped fill:#F1F2EC,stroke:#6B7B5A style Warehouse fill:#F1F2EC,stroke:#6B7B5A

Trigger / ingress

Ingress type: scheduled_trigger. A cron schedule (0 3 * * * UTC) issues the command; the runtime adapter's schedule capability owns delivery.

Idempotency at ingress

Duplicate triggers must collapse to one command — defensive against double-fire of the scheduler.

idempotency_key template: sync_vendor_sales:{vendor_name}:{target_date}
target_date = the business day being pulled (UTC, the day before the cron fires)
Unique constraint on the sparse index in commands.idempotency_key guarantees second-fire becomes a no-op

Command(s)

Single command type. No fan-in, no dependencies on other commands.

command_type: sync_vendor_sales
requested_by: system:scheduler
ingress: scheduled_trigger
payload:
  vendor_name: "acme_sales"
  target_date: "2026-06-01"     # yesterday at fire time
  vendor_base_url: "https://api.acme.example/v1"
context:
  workspace_id: null            # single-tenant; see Constraints
  trace_id: <runtime-supplied>
cancellation_mode: graceful
idempotency_key: "sync_vendor_sales:acme_sales:2026-06-01"
dbos_workflow_id: <runtime-supplied>

Status path used: created → validated → queued → running → succeeded | failed | cancelled. compensated and expired are not used — there is nothing to compensate (idempotent upsert) and no expiry deadline.

Policy stack

The policy chain evaluated at validated:

permission

connector_scope

rate_limit

cost

permission — the scheduler service principal must hold warehouse.write on the target partition and connector.acme_sales.read.
connector_scope — restricts the vendor connector to GET /sales/daily and the Postgres connector to upserts on warehouse.sales_daily.
rate_limit — bucket cap on vendor calls (handful per day, but caps protect against retry storms).
cost — soft cap on retry budget; if exhausted, escalate rather than burn budget.

No data_safety, external_sharing, destructive_action, memory_consent, agent_risk, or approval_requirement — none apply (no PII flagged, no human gate, no agent).

Execution plan

Async deterministic. Five steps, all connector_call or sync_function. No agent, no human task.

Execution plan

flowchart TB Start([scheduled_trigger fires
03:00 UTC]) --> Fetch["1 · fetch_vendor_snapshot
connector_call · connector_calls queue
retry on 5xx · 30s/2m/10m"] Fetch --> Hash["2 · compute_snapshot_hash
sync_function · SHA256 of latest_snapshot_id"] Hash --> LoadPrev["3 · load_last_hash
connector_call · postgres read"] LoadPrev --> Compare{"hash == previous?"} Compare -->|yes · stale| Skip["4a · record_skipped
sync_function · emit skipped event"] Compare -->|no · fresh| Write["4b · upsert_warehouse_partition
connector_call · idempotent upsert
connector_calls queue"] Write --> Record["5 · record_hash
connector_call · persist new hash"] Skip --> Done([command.succeeded]) Record --> Done style Compare fill:#F5E0D2,stroke:#D97757 style Skip fill:#F1F2EC,stroke:#6B7B5A style Write fill:#F1F2EC,stroke:#6B7B5A style Done fill:#F1F2EC,stroke:#6B7B5A

Step table

step	execution_mode	queue	side effect?
fetch_vendor_snapshot	connector_call	connector_calls	no (read)
compute_snapshot_hash	sync_function	—	no
load_last_hash	connector_call	connector_calls	no (read)
upsert_warehouse_partition	connector_call	connector_calls	yes
record_hash	connector_call	connector_calls	incidental (state bookkeeping)

The single declared effect (warehouse_partition.write) is the upsert in step 4b. Step 5 stores the hash for next-day comparison; treated as bookkeeping, not a domain effect.

Effects table

effect_type	side_effect?	idempotency_key	compensation	counter_pure?
`warehouse_partition.write`	yes	`sync_vendor_sales:acme_sales:2026-06-01`	none (idempotent upsert)	n/a

Why no compensation

The warehouse write is an upsert keyed by (vendor_name, target_date). A retry or re-run overwrites the same partition row-for-row; partial failure leaves the partition either unchanged or in the latest good state. Nothing to reverse.

Memory

Skipped — workflow is stateless across runs. The previous-day snapshot hash needed for staleness detection is stored in an operational table (vendor_sync_state), not Concord Memory: it's not a learned preference or user-scoped fact, it's deterministic bookkeeping for an idempotent connector.

Artifacts

Skipped — no human-readable durable output. The warehouse partition itself is queried directly by downstream consumers; it's an effect, not an Artifact.

Approvals

Skipped — fully automated. No human gate anywhere in the plan; the user explicitly opted out of approvals.

Agent / swarm config

Skipped — plan is deterministic. No AgentRun or SwarmRun; no LLM in the loop.

Audit events

All rows land in the single append-only domain_events table. purpose distinguishes them.

command.created

command.validated

command.running

vendor.fetch.started

vendor.fetch.succeeded

vendor.fetch.retried

snapshot.hashed

snapshot.stale_detected

snapshot.fresh_detected

warehouse.upsert.started

warehouse.upsert.succeeded

command.succeeded

command.failed

command.cancelled

Retention: 90 days for purpose=audit. purpose=event rows used for operational metrics may be aged out sooner via policy.

Cancellation model

cancellation_mode: graceful. Transitions: running → cancelling → cancelled. Each step's prelude checks the cancellation flag and bails before doing connector work.

Cancellation cascade

flowchart LR CancelReq([cancel signal]) --> Flag["set cancellation flag
command.status = cancelling"] Flag --> Probe{"current step?"} Probe -->|fetch in flight| Finish["let fetch finish
discard result"] Probe -->|between steps| Exit["exit cleanly
no upsert"] Probe -->|upsert in flight| Complete["let upsert finish
partition is idempotent"] Finish --> Cancelled([command.cancelled]) Exit --> Cancelled Complete --> Cancelled style Probe fill:#F5E0D2,stroke:#D97757

Rule

Graceful is safe here because the only effect is an idempotent upsert. If an upsert is mid-flight when cancel arrives, completing it leaves the partition in a valid state — letting it finish is cheaper than tearing down.

Compensation graph

Skipped — no declared compensation. The single effect (warehouse_partition.write) is an idempotent upsert keyed by (vendor_name, target_date); re-running the workflow overwrites in place. requires_compensation=false at registration. The compensation graph validator will accept this only because declared_compensation is intentionally None with the idempotent-upsert justification recorded.

Runtime config

Adapter

DBOS. Required capabilities: DURABLE_WORKFLOWS, DURABLE_STEPS, QUEUES, SCHEDULES. Not required: SUBWORKFLOWS, SIGNALS (cancel uses the flag pattern, not a runtime signal), SAGA_COMPENSATION_NATIVE, WORKFLOW_VERSIONING.

Queues

connector_calls

scheduled_maintenance

All connector calls (vendor fetch, warehouse upsert, hash read/write) ride connector_calls. The cron itself dispatches via the runtime's schedule capability — no app-level queue needed for the trigger.

Retry policy

Wired per-operation through compile_policy(operation) -> dict for the runtime adapter. Error classes split into retryable and non-retryable:

transient_connector_errorretryable

rate_limitedretryable

timeoutretryable

validation_errorno retry

permanent_connector_errorno retry

policy_deniedno retry

permission_deniedno retry

For fetch_vendor_snapshot:

RetryPolicy(
  operation="fetch_vendor_snapshot",
  retryable=["transient_connector_error", "rate_limited", "timeout"],
  max_attempts=4,                       # initial + 3 retries
  backoff_seconds=[30, 120, 600],       # 30s · 2min · 10min
  requires_idempotency_key=False,       # GET is naturally idempotent
)

Vendor 4xx bad-request responses map to validation_error (or permanent_connector_error for 401/403/404) — non-retryable, fail fast, route to command.failed.

Schedule

Cron 0 3 * * * UTC, single instance lock (idempotency_key blocks double-fire even if the scheduler misbehaves).

Test plan

Unit (functional core)

compute_snapshot_hash — SHA256 over latest_snapshot_id stable across whitespace / key order.
is_stale(current_hash, last_hash) — true iff bytes equal; treats None last-hash as "fresh".
Error classifier: 5xx → transient_connector_error; 400/422 → validation_error; 401/403/404 → permanent_connector_error; 429 → rate_limited.
RetryPolicy compilation: backoff list yields [30, 120, 600] in order.

Integration (Postgres + DBOS adapter)

Idempotency key collision: two scheduler fires for the same date collapse to one command row.
Effect lifecycle: planned → executing → succeeded rows inserted for the upsert.
Audit purpose split: event, audit, agent_step rows land in domain_events with the correct purpose.

Workflow (end-to-end)

Happy path: fresh snapshot → upsert succeeds → command.succeeded → warehouse row visible.
Stale snapshot: hash matches previous → no upsert → snapshot.stale_detected emitted → command.succeeded.
Transient 5xx then success: vendor returns 503 twice, then 200 → retries fire at 30s and 2m → workflow succeeds.
Bad request: vendor returns 400 → no retry → command.failed with validation_error.
Cancel mid-fetch: cancel signal during fetch → fetch completes, no upsert, command.cancelled.

Safety

Forbidden connector op: attempt to POST to vendor → connector_scope policy denies.
Retry storm guard: 5xx loop hits max_attempts → no further retries, command.failed, cost policy emits cost-cap audit row.

Boundary discipline

concord_boundary_check.py rejects any import dbos outside the runtime adapter module.

Acceptance criterion

For 30 consecutive UTC days, at 03:00 UTC ± 5 min the workflow produces either (a) a warehouse partition matching the vendor's reported daily totals within tolerance, or (b) a snapshot.stale_detected audit row with no warehouse write. Zero duplicate writes; zero unretried 5xx that should have retried; zero retries on 4xx.

Open questions / risks

Staleness false-positive risk. Hashing latest_snapshot_id alone trusts the vendor to bump that field. If the vendor occasionally bumps the id without changing the payload (or vice versa), we either skip a real update or write a no-op. Consider hashing a content-derived field too (e.g. row count + total revenue) as a second signal.
Backoff vs cron cadence. Worst case: 30s + 2m + 10m = ~12.5 min of retry. Comfortably fits inside a daily window; revisit if the workflow's downstream SLA tightens.
Vendor cursor / pagination. Spec assumes a single GET returns the full daily snapshot. If pagination is required, the fetch step decomposes into a sub-plan; hash should cover the assembled payload, not the first page.
Schedule drift / missed runs. If the runtime misses a fire (e.g. outage), do we backfill yesterday's missed date? Current design: no automatic catch-up; operator issues an ad-hoc command with the missed target_date (same idempotency_key shape).
Multi-tenant later. Single-tenant today (workspace_id: null). If multi-tenancy lands, idempotency_key must include workspace and the partition key needs a tenant dimension.
Vendor auth rotation. Out of scope here; credentials live in connector config. A rotation that breaks auth surfaces as permanent_connector_error (401/403) and pages on-call — verify the alerting wiring.
Audit retention. 90 days agreed. Confirm the warehouse partition's own retention is longer (so audit + warehouse can be cross-checked beyond 90d if needed).