Architecture Specification

Concord

A library of contracts where every action — deterministic or agentic — moves in agreement with policy, state, and audit. A durable runtime executes; Concord declares what the work means.

Version0.3 · functional core
RuntimeDBOS (default)
System of recordPostgres
ScopeSemantic primitive layer
Section 01 Executive summary

Concord is a library of contracts that turn any user request, system event, webhook, scheduled job, or agent action into a durable, inspectable workflow. The library declares what work means and what governance it requires; a durable runtime — DBOS by default — executes it.

Why Concord exists

Modern applications now coordinate agents, humans, tools, connectors, memory, approvals, artifacts, and durable workflows — but each piece usually has its own local model. Agent frameworks govern the agent loop. Workflow runtimes govern execution. Connector systems expose integrations. Policy engines decide rules. Observability tools record traces.

But no shared contract explains what an action means, who authorized it, what it touched, what it produced, and how it should be audited. Concord provides that missing semantic contract layer. For the full problem statement — ten concrete pains, one per chapter — see The problems Concord solves ↗.

Positioning

For a longer-form treatment of what Concord is and what it isn't — including a comparison across durable runtimes, agent frameworks, BPM platforms, DevOps systems, policy engines, data DAG orchestrators, observability tools, iPaaS, and memory systems — see What Concord is & isn't ↗.

Try a worked example

The core idea is simple:

Everything begins as ingress. Ingress becomes a command. A command is evaluated by policy. Policy produces a plan. A plan executes through primitives. Primitives mutate durable state. Every important transition is audited.
The core loop
flowchart LR A([Ingress]) --> B([Command]) B --> C{{Policy}} C --> D([Plan]) D --> E[Primitives] E --> F[(Durable state)] F --> G[/Audit/] style C fill:#F5E0D2,stroke:#D97757,stroke-width:1.5px style F fill:#FFFFFF,stroke:#141413,stroke-width:1.5px style G fill:#F1F2EC,stroke:#6B7B5A

The framework must support both:

Mode A

Deterministic workflows

Fixed steps, fixed state transitions, well-defined inputs and outputs.

Mode B

Agentic workflows

Dynamic tool selection, reasoning loops, external connectors, memory retrieval, human review, adaptive plans.

Framework assumes

Postgres is always the durable source of truth. Connectors are replaceable. Execution runtimes are replaceable. Agents are callers of the primitive layer, not owners of the system of record.

Section 02 Design philosophy

Concord is built around a few strong principles.

2.1 · Durable before executable

Do not execute meaningful work before recording intent.

Bad vs. good
flowchart LR subgraph Bad ["Bad"] direction LR a1[User request] --> a2[Execute function] --> a3[Maybe log result] end subgraph Good ["Good"] direction LR b1[User request] --> b2[Create command] --> b3[Create audit record] --> b4[Evaluate policy] --> b5[Execute] end style a1 fill:#FAF8F2,stroke:#B85556 style a2 fill:#FAF8F2,stroke:#B85556 style a3 fill:#FAF8F2,stroke:#B85556 style b1 fill:#FAF8F2,stroke:#6B7B5A style b2 fill:#FAF8F2,stroke:#6B7B5A style b3 fill:#FAF8F2,stroke:#6B7B5A style b4 fill:#FAF8F2,stroke:#6B7B5A style b5 fill:#FAF8F2,stroke:#6B7B5A

This makes the system replayable, observable, cancellable, and auditable.

2.2 · Commands are the center of the system

A command is the durable representation of requested intent. Examples:

generate_report
publish_report
run_sql_validation
capture_user_preference
process_webhook_event
sync_connector_data
ask_human_approval
send_notification
call_agent_tool
Distinction

The command is not the same as the execution step. The command says what is requested. Tasks say how work is executed.

2.3 · State transitions are explicit

Every meaningful workflow state change should be validated and persisted.

created → validated → waiting_for_approval → approved → queued → running → succeeded
Principle

Never infer workflow state only from logs. Logs are evidence. State is the control surface.

2.4 · Execution is replaceable

Concord does not care whether work is executed by:

local python function
background worker
databricks job
serverless function
container job
external api
agent tool call
workflow engine
human operator

Execution is an adapter. The core framework only cares about: input, context, command, task run, result, side effects, state transition, audit.

2.5 · Agents are participants, not authorities

Agentic workflows can call tools, propose plans, request approvals, and write memories. But the framework owns:

policy
approval gates
state transitions
artifact writes
memory writes
audit
idempotency
connector permissions
Rule

An agent can propose. The framework decides what is allowed.

2.6 · Postgres is the system of record

All durable framework state lives in Postgres. Postgres stores:

commands
workflow state
task runs
queue items
approvals
events
memories
artifacts
connectors
audit logs
policies
plans
agent traces
idempotency keys

This keeps the architecture portable and inspectable.

2.7 · The primitive set should be small and stable

New capabilities should usually be modeled as combinations of primitives, not new primitives.

Example composition
flowchart LR X["'Run a weekly connector sync'
is NOT a new primitive"] X --> A[schedule ingress] A --> B[command] B --> C[policy] C --> D[async task] D --> E[connector call] E --> F[artifact / event writes] F --> G[audit] style X fill:#F5E0D2,stroke:#D97757 style G fill:#F1F2EC,stroke:#6B7B5A

2.8 · Contracts and mechanics

Concord declares what work means; the durable runtime executes how it runs. Every operation that gets durable execution — commands, agent runs, swarms, connector calls, retries, schedules, cancellations, effects — carries a contract written in Concord and is executed by the runtime. The contract is the meaning: what the action is, what inputs it requires, what side effects it may produce, what error classes apply, what compensations exist, who may cancel it, what audit must be recorded. The mechanics are the execution: when it runs, how it retries, how it queues, how it sleeps, how it recovers.

This separation makes the runtime swappable. Concord's domain layer imports a DurableRuntime protocol (see §41), not the runtime implementation. The default adapter is DBOS; future Temporal or Restate adapters slot in without touching the domain. The contract is what survives across runtimes.

The rule

If a question is about meaning, it belongs in Concord. If it is about execution, it belongs in the runtime. When the line is unclear, the contract wins — write the meaning in Concord first, then describe how the runtime should honor it.

Section 03 The primitive model

Concord organizes work into ten primitive families.

Family 01

Ingress

How work enters the system.

Family 02

Intent

Capture what the system is being asked to do.

Family 03

Policy & planning

Decide allowed/denied and produce a plan.

Family 04

Execution

Actually perform the work.

Family 05

Coordination

Make async / distributed work safe.

Family 06

State lifecycle

Track progress with allowed transitions.

Family 07

Human judgment

Approvals, overrides, escalations.

Family 08

Knowledge & output

Memory, artifacts, lineage.

Family 09

Connectors

Adapt to outside systems.

Family 10

Observability & governance

Explain and secure the system.

Section 04 Primitive taxonomy

4.1 · Ingress primitives

Ingress is how work enters the system. Types:

user_request
api_request
external_webhook
scheduled_trigger
file_event
table_event
connector_event
agent_action
approval_callback
retry_event
system_event

Ingress should always produce either: a command, an event record, or a rejected request with audit.

Caution

Ingress should not perform expensive work directly.

4.2 · Intent primitives

Intent primitives capture what the system is being asked to do. Core object: Command. A command contains:

command_id
command_type
requested_by
source
payload
context
status
state
idempotency_key
created_at
updated_at

The command is the root object for most downstream records.

4.3 · Policy and planning primitives

Policy decides whether a command is allowed, denied, delayed, escalated, or routed. Policy outcomes:

allow
deny
require_approval
require_more_input
route_sync
route_async
redact
escalate

Planning turns a policy-approved command into an execution plan. A plan contains:

plan_id
command_id
execution_mode
steps
required_approvals
expected_artifacts
memory_write_intents
notification_intents
created_at

4.4 · Execution primitives

Execution primitives actually do work. Types:

sync_function
async_task
background_worker
external_job
connector_call
agent_tool_call
human_task

Execution must be tracked through task runs. A task run contains:

task_run_id
command_id
task_type
function_name
status
attempt
input
result
error
claimed_by
lease_until
started_at
completed_at

4.5 · Coordination primitives

Coordination makes async and distributed work safe. Types:

queue
idempotency
deduplication
lease
lock
retry
backoff
rate_limit
dependency
fanout
join

These primitives prevent common production failures:

  • duplicate webhook processing
  • double-published reports
  • two workers processing the same task
  • infinite retry loops
  • unbounded parallelism
  • overloaded connectors

4.6 · State lifecycle primitives

Lifecycle primitives track progress. Canonical states:

created
validated
waiting_for_input
waiting_for_approval
approved
queued
running
blocked
succeeded
failed
cancelled
expired
compensating
compensated

State transitions should be checked against an allowed transition table.

4.7 · Human judgment primitives

These represent human decisions. Types:

approval_request
approval_decision
human_input_request
review_packet
override
escalation
Rule

Human approval should never be only a message in Slack or email. It should be a durable record in Postgres.

4.8 · Knowledge and output primitives

These represent durable outputs or reusable knowledge. Types:

memory
artifact
version
lineage
retrieval
summary
embedding

Memory is for future behavior. Artifacts are outputs of work. Examples: generated report, exported CSV, SQL query result, dashboard draft, connector sync snapshot, agent answer, approval packet, preference memory.

4.9 · Connector primitives

Connectors adapt Concord to outside systems. A connector can represent:

databricks
slack
gmail
google drive
github
salesforce
snowflake
s3
internal apis
llm provider
vector database
notification provider

A connector is not the workflow owner. It is a capability provider. Connector calls should be represented as task runs or tool calls with durable status.

4.10 · Observability and governance primitives

These explain and secure the system. Types:

audit_event
trace
metric
policy_decision
cost_record
permission_check
data_safety_check
secret_access_log

The audit log should answer: who requested, what was requested, what policy decision was made, what executed, what changed, what was produced, who approved, what connector was called, what failed.

Section 05 Canonical workflow shape

Every workflow follows this shape:

Canonical pipeline
flowchart TB I([Ingress]) --> C([Command]) C --> X[Context] X --> P{{Policy}} P --> PL([Plan]) PL --> E[Execution] E --> S[State transition] S --> SE[Side effects] SE --> O[Output] O --> M[(Memory)] M --> AU[/Audit/] style P fill:#F5E0D2,stroke:#D97757,stroke-width:1.5px style M fill:#F1F2EC,stroke:#6B7B5A style AU fill:#FFFFFF,stroke:#141413,stroke-width:1.5px

Optional branches:

Branching decisions
flowchart LR Q{"Needs immediate answer?"} -->|yes| SYNC[sync function] Q -->|no| LONG{"Long-running?"} LONG -->|yes| ASYNC["queue → async task → worker"] LONG -->|no| EXT{"External system?"} EXT -->|yes| CC[connector call] EXT -->|no| HJ{"Human judgment?"} HJ -->|yes| AP["approval request → decision → resume"] HJ -->|no| FR{"Future recall?"} FR -->|yes| MEM[memory write] FR -->|no| DR{"Durable result?"} DR -->|yes| ART[artifact write] DR -->|no| NOT{"Notify?"} NOT -->|yes| NOTIF[notification] NOT -->|no| FAIL{"Fails after side effects?"} FAIL -->|yes| COMP[compensation] FAIL -->|no| STOP{"Needs to stop?"} STOP -->|yes| CANCEL[cancellation] style Q fill:#F5E0D2,stroke:#D97757 style LONG fill:#F5E0D2,stroke:#D97757 style EXT fill:#F5E0D2,stroke:#D97757 style HJ fill:#F5E0D2,stroke:#D97757 style FR fill:#F5E0D2,stroke:#D97757 style DR fill:#F5E0D2,stroke:#D97757 style NOT fill:#F5E0D2,stroke:#D97757 style FAIL fill:#F5E0D2,stroke:#D97757 style STOP fill:#F5E0D2,stroke:#D97757
Section 06 High-level architecture
System overview
flowchart TB subgraph SOURCES ["Sources"] direction LR U[User] & A[Agent] & W[Webhook] & SC[Scheduler] & CL[Client] end SOURCES --> IA[Ingress Adapter] IA --> CS[Command Service] CS --> PE{{Policy Engine}} PE --> PL[Planner] PL --> SE[Sync Executor] PL --> AQ[Async Queue] SE --> FC[Function / Connector] AQ --> WK[Worker] FC --> SM[State Manager] WK --> SM SM --> PG[("Postgres
system of record")] PG --> AAM[Audit · Artifacts · Memory · Notifications] style PG fill:#FFFFFF,stroke:#141413,stroke-width:2px style PE fill:#F5E0D2,stroke:#D97757,stroke-width:1.5px style AAM fill:#F1F2EC,stroke:#6B7B5A
Section 07 Layered architecture
Layers (top-down)
flowchart TB API["API layer
receive · authenticate · normalize"] CMD["Command layer
create · validate · idempotency"] POL["Policy layer
permissions · risk · approval"] PLN["Planning layer
sync/async · gates · steps"] EXE["Execution layer
run · retry · persist"] STA["State layer
enforce transitions"] PER["Persistence layer
Postgres tables · tx · locks"] CON["Connector layer
normalize external APIs"] AGT["Agent layer
read context · propose · summarize"] API --> CMD --> POL --> PLN --> EXE --> STA --> PER CON -.-> EXE AGT -.-> CMD style POL fill:#F5E0D2,stroke:#D97757,stroke-width:1.5px style PER fill:#FFFFFF,stroke:#141413,stroke-width:1.5px style CON fill:#F1F2EC,stroke:#6B7B5A style AGT fill:#EFE6F0,stroke:#7A5560

7.1 · API layer

Receive requests, authenticate users, normalize ingress, create commands, return immediate response. Should not run expensive jobs, perform long syncs, write memory without policy, or skip command creation.

7.2 · Command layer

Create command, validate required inputs, apply idempotency, attach context, store payload, emit audit event.

7.3 · Policy layer

Check permissions, cost, risk, data safety, approval requirements, memory consent; decide route. Declarative where possible and executable where necessary.

7.4 · Planning layer

Choose sync vs async, insert approval gates, connector steps, memory writes, artifact writes, notification steps; create execution plan. For deterministic workflows, plans can be static. For agentic workflows, plans can be proposed dynamically and validated by policy.

7.5 · Execution layer

Execute function, run task, call connector, invoke agent, call external job, persist result, handle retries. Should be side-effect-aware.

7.6 · State layer

Enforce valid transitions, persist state, prevent illegal transitions, record state history.

7.7 · Persistence layer

Postgres tables, transactions, row-level locking, idempotency constraints, query APIs, retention, archival.

7.8 · Connector layer

Normalize external APIs, handle auth, handle rate limits, return structured results, avoid leaking provider-specific details upward.

7.9 · Agent layer

Read context, retrieve memory, propose action, call permitted tools, request approval if needed, summarize result.

Rule

Agent layer should never bypass policy.

Section 08 Postgres-first persistence design

Postgres is the durable system of record. Use Postgres for:

commands
events
task queue
task runs
approvals
memory
artifacts
audit
connector registrations
policy decisions
execution plans
Guidance

Use object storage or external systems only for large blobs. Store references in Postgres.

Section 09 Core Postgres schema

The schema is compact, append-only where it can be, and extensible through JSONB columns. Each command links to a durable runtime workflow via dbos_workflow_id; runtime execution status lives in the runtime adapter, business status lives here.

Entity relationships

Entity relationships
erDiagram commands ||--o{ domain_events : emits commands ||--o{ domain_effects : plans commands ||--o{ approvals : gates commands ||--o{ artifacts : produces commands ||--o{ memory_records : writes commands ||--o{ command_dependencies : has_parents commands ||--o{ agent_runs : runs agent_runs ||--o{ swarm_runs : participates_in agent_runs ||--o{ domain_events : records_steps connectors ||--o{ tools : exposes commands { UUID command_id PK TEXT command_type TEXT status TEXT cancellation_mode JSONB payload JSONB context JSONB result TEXT idempotency_key TEXT dbos_workflow_id } command_dependencies { UUID child_command_id FK UUID parent_command_id FK TEXT required_status BOOLEAN propagate_cancellation } domain_effects { UUID domain_effect_id PK UUID command_id FK TEXT effect_type TEXT idempotency_key TEXT status TEXT executed_by_dbos_workflow_id } domain_events { UUID event_id PK UUID command_id FK UUID agent_run_id FK UUID swarm_run_id FK TEXT purpose TEXT event_type JSONB payload INT step_index NUMERIC cost_units } approvals { UUID approval_id PK UUID command_id FK TEXT status TIMESTAMPTZ expires_at } memory_records { UUID memory_id PK TEXT subject_type TEXT subject_id TEXT memory_type DOUBLE confidence } artifacts { UUID artifact_id PK UUID command_id FK TEXT artifact_type INT version } agent_runs { UUID agent_run_id PK UUID command_id FK TEXT agent_role TEXT status INT spawn_depth } swarm_runs { UUID swarm_run_id PK UUID command_id FK TEXT status TEXT join_strategy }

Schema

-- Commands. Carries domain status; runtime status lives in DBOS.
CREATE TABLE IF NOT EXISTS commands (
  command_id UUID PRIMARY KEY,
  command_type TEXT NOT NULL,
  requested_by TEXT NOT NULL,
  ingress TEXT NOT NULL,

  payload JSONB NOT NULL DEFAULT '{}',
  context JSONB NOT NULL DEFAULT '{}',

  status TEXT NOT NULL,
  cancellation_mode TEXT NOT NULL DEFAULT 'graceful',  -- 'graceful' | 'compensate_then_stop'
  idempotency_key TEXT NULL UNIQUE,
  dbos_workflow_id TEXT NULL UNIQUE,

  result JSONB NULL,
  error TEXT NULL,

  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  completed_at TIMESTAMPTZ NULL
);
-- Fan-in dependencies. A child waits for all parents to reach required_status.
CREATE TABLE IF NOT EXISTS command_dependencies (
  child_command_id UUID NOT NULL REFERENCES commands(command_id),
  parent_command_id UUID NOT NULL REFERENCES commands(command_id),
  required_status TEXT NOT NULL DEFAULT 'succeeded',
  propagate_cancellation BOOLEAN NOT NULL DEFAULT true,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  PRIMARY KEY (child_command_id, parent_command_id)
);

CREATE INDEX IF NOT EXISTS idx_command_deps_parent ON command_dependencies(parent_command_id);
-- Domain effects. Planned-upfront, transitioned through by DBOS steps.
CREATE TABLE IF NOT EXISTS domain_effects (
  domain_effect_id UUID PRIMARY KEY,
  command_id UUID NOT NULL REFERENCES commands(command_id),

  effect_type TEXT NOT NULL,
  effect_payload JSONB NOT NULL DEFAULT '{}',
  idempotency_key TEXT NULL,

  status TEXT NOT NULL DEFAULT 'planned',   -- 'planned' | 'executing' | 'succeeded' | 'failed'
  executed_by_dbos_workflow_id TEXT NULL,
  result JSONB NULL,
  error TEXT NULL,

  -- Compensation tracking
  compensates_effect_id UUID NULL REFERENCES domain_effects(domain_effect_id),
  declared_compensation TEXT NULL,

  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  completed_at TIMESTAMPTZ NULL
);

CREATE INDEX IF NOT EXISTS idx_domain_effects_command ON domain_effects(command_id);
CREATE INDEX IF NOT EXISTS idx_domain_effects_status  ON domain_effects(status);
CREATE INDEX IF NOT EXISTS idx_domain_effects_idem    ON domain_effects(idempotency_key) WHERE idempotency_key IS NOT NULL;
-- Unified domain events. Replaces command_events, domain_audit_log, agent_steps.
-- Disambiguated by purpose.
CREATE TABLE IF NOT EXISTS domain_events (
  event_id UUID PRIMARY KEY,
  command_id UUID REFERENCES commands(command_id),
  agent_run_id UUID NULL REFERENCES agent_runs(agent_run_id),
  swarm_run_id UUID NULL REFERENCES swarm_runs(swarm_run_id),

  purpose TEXT NOT NULL,        -- 'event' | 'audit' | 'agent_step'
  event_type TEXT NOT NULL,
  payload JSONB NOT NULL DEFAULT '{}',

  actor TEXT NOT NULL,
  trace_id TEXT NOT NULL,

  -- Agent-step extensions; null for non-agent rows
  step_index INT NULL,
  tool_name TEXT NULL,
  latency_ms INT NULL,
  token_usage JSONB NULL,
  cost_units NUMERIC NULL,

  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX IF NOT EXISTS idx_domain_events_command  ON domain_events(command_id);
CREATE INDEX IF NOT EXISTS idx_domain_events_purpose  ON domain_events(purpose);
CREATE INDEX IF NOT EXISTS idx_domain_events_agent    ON domain_events(agent_run_id) WHERE agent_run_id IS NOT NULL;
CREATE INDEX IF NOT EXISTS idx_domain_events_audit    ON domain_events(command_id, created_at) WHERE purpose = 'audit';
Section 10 Queue model

Queue execution is a runtime concern. Concord declares semantic queue names; the durable runtime adapter handles the mechanics — claim, lease, concurrency, retry, and rate limiting.

The standard queue names a Concord catalog uses:

connector_calls
agent_runs
swarm_children
notifications
scheduled_maintenance
high_risk_operations
primitiveflow_default

Queue choice is a planning outcome. Effect type maps to queue:

def choose_queue(effect: CoreEffect, context: dict) -> str:
    if effect.effect_type.startswith("connector."):
        return "connector_calls"
    if effect.effect_type.startswith("agent.run"):
        return "agent_runs"
    if effect.effect_type.startswith("agent.spawn"):
        return "swarm_children"
    if effect.effect_type.startswith("notification."):
        return "notifications"
    return "primitiveflow_default"

Queue registration, concurrency limits, partitioning, and per-queue rate limits are runtime configuration. The runtime adapter (default DBOSDurableRuntime) translates the queue name into its native primitive — a DBOS queue for the default adapter; a Temporal task queue or another broker for alternates.

No queue table here

Concord does not own a task_queue or worker_claims table. Runtime claim, lease, and worker assignment belong to the adapter and live in its tables, not Concord's. The domain projection of what work has been requested and what happened to it lives in domain_effects and domain_events.

Section 11 Idempotency model

Every externally triggered or user-submitted command should support idempotency. Examples:

webhook provider event id
user action id
file path + file version
scheduled job time window
approval decision id
external API request id
Rule

If idempotency_key exists, return the existing command instead of creating another one.

Postgres enforces this with:

CREATE UNIQUE INDEX pf_commands_idempotency_key_idx
ON pf_commands (idempotency_key)
WHERE idempotency_key IS NOT NULL;
Section 12 State machine

Concord enforces a domain state machine over the commands table. Runtime execution status — whether the underlying workflow is alive, recovering, or errored — lives in the runtime adapter and is joined to the command via commands.dbos_workflow_id. The two evolve independently: a runtime workflow can be "recovering" while the command is "running"; a runtime workflow can complete normally while the command sits in waiting_for_approval. The domain machine answers what does this booking mean right now; the runtime answers is execution alive.

The legal domain transitions are below. compensating and compensated are reachable via the compensate_then_stop cancellation mode (see §30).

Workflow state machine
stateDiagram-v2 [*] --> created created --> validated created --> failed created --> cancelled validated --> waiting_for_input validated --> waiting_for_approval validated --> queued validated --> running validated --> failed validated --> cancelled waiting_for_input --> validated waiting_for_input --> cancelled waiting_for_input --> expired waiting_for_approval --> approved waiting_for_approval --> cancelled waiting_for_approval --> expired waiting_for_approval --> failed approved --> queued approved --> running approved --> cancelled queued --> running queued --> cancelled queued --> failed running --> succeeded running --> failed running --> cancelled running --> blocked blocked --> queued blocked --> running blocked --> failed blocked --> cancelled failed --> queued failed --> compensating failed --> cancelled compensating --> compensated compensating --> failed succeeded --> [*] cancelled --> [*] expired --> [*] compensated --> [*]

Python standard:

from enum import Enum


class WorkflowState(str, Enum):
    CREATED = "created"
    VALIDATED = "validated"
    WAITING_FOR_INPUT = "waiting_for_input"
    WAITING_FOR_APPROVAL = "waiting_for_approval"
    APPROVED = "approved"
    QUEUED = "queued"
    RUNNING = "running"
    BLOCKED = "blocked"
    SUCCEEDED = "succeeded"
    FAILED = "failed"
    CANCELLED = "cancelled"
    EXPIRED = "expired"
    COMPENSATING = "compensating"
    COMPENSATED = "compensated"


ALLOWED_TRANSITIONS = {
    WorkflowState.CREATED: {
        WorkflowState.VALIDATED,
        WorkflowState.FAILED,
        WorkflowState.CANCELLED,
    },
    WorkflowState.VALIDATED: {
        WorkflowState.WAITING_FOR_INPUT,
        WorkflowState.WAITING_FOR_APPROVAL,
        WorkflowState.QUEUED,
        WorkflowState.RUNNING,
        WorkflowState.FAILED,
        WorkflowState.CANCELLED,
    },
    WorkflowState.WAITING_FOR_INPUT: {
        WorkflowState.VALIDATED,
        WorkflowState.CANCELLED,
        WorkflowState.EXPIRED,
    },
    WorkflowState.WAITING_FOR_APPROVAL: {
        WorkflowState.APPROVED,
        WorkflowState.CANCELLED,
        WorkflowState.EXPIRED,
        WorkflowState.FAILED,
    },
    WorkflowState.APPROVED: {
        WorkflowState.QUEUED,
        WorkflowState.RUNNING,
        WorkflowState.CANCELLED,
    },
    WorkflowState.QUEUED: {
        WorkflowState.RUNNING,
        WorkflowState.CANCELLED,
        WorkflowState.FAILED,
    },
    WorkflowState.RUNNING: {
        WorkflowState.SUCCEEDED,
        WorkflowState.FAILED,
        WorkflowState.CANCELLED,
        WorkflowState.BLOCKED,
    },
    WorkflowState.BLOCKED: {
        WorkflowState.QUEUED,
        WorkflowState.RUNNING,
        WorkflowState.FAILED,
        WorkflowState.CANCELLED,
    },
    WorkflowState.FAILED: {
        WorkflowState.QUEUED,
        WorkflowState.COMPENSATING,
        WorkflowState.CANCELLED,
    },
    WorkflowState.COMPENSATING: {
        WorkflowState.COMPENSATED,
        WorkflowState.FAILED,
    },
    WorkflowState.SUCCEEDED: set(),
    WorkflowState.CANCELLED: set(),
    WorkflowState.EXPIRED: set(),
    WorkflowState.COMPENSATED: set(),
}


def can_transition(from_state: WorkflowState, to_state: WorkflowState) -> bool:
    return to_state in ALLOWED_TRANSITIONS.get(from_state, set())
Section 13 Logical domain objects

13.1 · Command

A command is durable intent.

from dataclasses import dataclass, field
from datetime import datetime, timezone
from typing import Any
from uuid import uuid4


def now_iso() -> str:
    return datetime.now(timezone.utc).isoformat()


def new_id(prefix: str) -> str:
    return f"{prefix}_{uuid4().hex}"


@dataclass
class Context:
    user_id: str
    workspace_id: str | None = None
    app_id: str | None = None
    run_as: str = "user"
    trace_id: str = field(default_factory=lambda: new_id("trace"))
    request_id: str = field(default_factory=lambda: new_id("req"))
    metadata: dict[str, Any] = field(default_factory=dict)


@dataclass
class Command:
    command_id: str
    command_type: str
    task_name: str
    requested_by: str
    ingress_type: str
    payload: dict[str, Any]
    context: Context
    state: str = "created"
    status: str = "created"
    idempotency_key: str | None = None
    plan: dict[str, Any] | None = None
    result: dict[str, Any] | None = None
    error: str | None = None
    created_at: str = field(default_factory=now_iso)
    updated_at: str = field(default_factory=now_iso)

13.2 · TaskSpec

A task spec describes how a task maps to primitives.

from dataclasses import dataclass, field
from typing import Any


@dataclass
class TaskSpec:
    name: str
    description: str
    command_type: str
    function_name: str

    ingress_types: list[str] = field(default_factory=list)
    required_inputs: list[str] = field(default_factory=list)

    sync_allowed: bool = False
    async_required: bool = False
    approval_required: bool = False

    memory_write_possible: bool = False
    artifact_output_possible: bool = False
    notification_required: bool = False

    policy_checks: list[str] = field(default_factory=list)
    connector_requirements: list[str] = field(default_factory=list)

    risk_level: str = "low"
    max_attempts: int = 3
    metadata: dict[str, Any] = field(default_factory=dict)

13.3 · ExecutionPlan

A plan is a validated path from intent to execution.

from dataclasses import dataclass, field


@dataclass
class PlanStep:
    step_id: str
    primitive: str
    function_name: str
    depends_on: list[str] = field(default_factory=list)
    metadata: dict = field(default_factory=dict)


@dataclass
class ExecutionPlan:
    plan_id: str
    command_id: str
    execution_mode: str
    primitives: list[str]
    steps: list[PlanStep]

13.4 · CommandDependency

A command may declare zero or more parent commands it depends on. Resolution waits until every parent reaches its required_status (typically succeeded). The runtime adapter handles the waiting via a listener over command-completion events.

@dataclass(frozen=True)
class CommandDependency:
    child_command_id: str
    parent_command_id: str
    required_status: str = "succeeded"
    propagate_cancellation: bool = True

Cycles are rejected at insert time via a BFS over command_dependencies from child to ancestors. Cancellation cascades when propagate_cancellation = True and the required parent status becomes unreachable.

13.5 · CoreEffect

An effect is a side-effect plan — a description of what the workflow intends to do to the outside world (or to durable storage outside the workflow's own state). The functional core returns effects as part of CoreResult; they are persisted upfront as domain_effects rows in status planned and transitioned through executing → succeeded | failed as the runtime fires them.

from enum import StrEnum


class EffectStatus(StrEnum):
    PLANNED = "planned"
    EXECUTING = "executing"
    SUCCEEDED = "succeeded"
    FAILED = "failed"


@dataclass(frozen=True)
class CoreEffect:
    effect_type: str
    payload: dict[str, Any]
    idempotency_key: str | None = None
    declared_compensation: str | None = None    # name of the compensation operation, if any
    counter_effects: bool = False               # True if this effect only undoes a prior effect

The idempotency_key is the source of truth for effect-level idempotency, separate from command-level. External APIs that support idempotency keys consume this value.

Section 14 Primitive mapping standard

Every task should be mappable to primitives.

def map_task_to_primitives(task: TaskSpec) -> list[str]:
    primitives = [
        "ingress",
        "command",
        "context",
        "policy",
        "plan",
    ]

    if task.approval_required:
        primitives.append("human_approval")

    if task.async_required:
        primitives.extend(["queue", "async_task"])
    elif task.sync_allowed:
        primitives.append("sync_function")
    else:
        primitives.extend(["queue", "async_task"])

    if task.connector_requirements:
        primitives.append("connector_call")

    if task.artifact_output_possible:
        primitives.append("artifact_write")

    if task.memory_write_possible:
        primitives.append("memory_write")

    if task.notification_required:
        primitives.append("notification")

    primitives.extend(["state_transition", "audit"])

    return primitives

Example:

generate_report = TaskSpec(
    name="Generate Report",
    description="Generate a report asynchronously.",
    command_type="generate_report",
    function_name="generate_report",
    ingress_types=["user_request", "scheduled_trigger"],
    required_inputs=["report_type", "date_range"],
    async_required=True,
    artifact_output_possible=True,
    notification_required=True,
    policy_checks=["permission", "cost", "data_access"],
    risk_level="medium",
)

map_task_to_primitives(generate_report)

Expected output:

[
  "ingress",
  "command",
  "context",
  "policy",
  "plan",
  "queue",
  "async_task",
  "artifact_write",
  "notification",
  "state_transition",
  "audit"
]
Section 15 Deterministic workflows

A deterministic workflow has a known plan before execution.

Generate report · deterministic
flowchart LR A([Generate report]) --> B[validate input] B --> C[run report job] C --> D[create artifact] D --> E[notify user] style A fill:#F5E0D2,stroke:#D97757 style E fill:#F1F2EC,stroke:#6B7B5A

Task spec:

generate_report_task = TaskSpec(
    name="Generate Report",
    description="Generate a report from a known report template.",
    command_type="generate_report",
    function_name="generate_report",
    ingress_types=["user_request", "scheduled_trigger"],
    required_inputs=["report_type", "date_range"],
    async_required=True,
    approval_required=False,
    artifact_output_possible=True,
    notification_required=True,
    policy_checks=["permission", "cost"],
    risk_level="medium",
)

Lifecycle:

created → validated → queued → running → succeeded
Note

The plan is created by rules, not by an agent.

Section 16 Agentic workflows

An agentic workflow can use dynamic reasoning, but it must still operate inside the primitive framework.

Agent loop
flowchart TB UR([User request]) --> CMD([Command]) CMD --> RM[Retrieve memory / context] RM --> PROP[Agent proposes plan / tool call] PROP --> POL{{Policy validates}} POL --> EXEC[Execute approved tool] EXEC --> OBS[Observe result] OBS --> DEC{Agent decides next step} DEC -->|continue| PROP DEC -->|finish| OUT[Artifact · memory · audit] style POL fill:#F5E0D2,stroke:#D97757,stroke-width:1.5px style DEC fill:#F5E0D2,stroke:#D97757 style OUT fill:#F1F2EC,stroke:#6B7B5A
Important

The agent proposes actions. The framework authorizes and records actions.

Agent tool calls should become commands or child task runs. Example: agent wants to send an email.

Agent · external email
sequenceDiagram autonumber participant A as Agent participant G as Primitive Gateway participant P as Policy participant H as Approver participant C as Email Connector A->>G: propose send_email G->>G: create command G->>P: evaluate policy P-->>G: require_approval (external) G->>H: approval request H-->>G: approved G->>C: send email C-->>G: ok G-->>A: result + audit
Section 17 Agent action protocol

Agents should not call connectors directly. They should call the primitive gateway.

Recommended protocol (request):

{
  "action_type": "tool_call",
  "tool_name": "send_notification",
  "payload": {
    "channel": "email",
    "recipient": "finance@example.com",
    "message": "The report is ready."
  },
  "reason": "The user asked me to notify the finance team.",
  "risk_level": "medium"
}

The framework responds:

{
  "decision": "require_approval",
  "command_id": "cmd_123",
  "approval_id": "appr_456",
  "message": "Human approval required before sending external email."
}
Effect

This keeps agents safe and auditable.

Section 18 Connectors

A connector exposes capabilities to the framework. Examples:

postgres
databricks
github
slack
gmail
google_drive
s3
salesforce
openai
anthropic
internal_http

A connector should declare: connector_id, connector_type, name, capabilities, auth mode, rate limits, risk level, input schemas, output schemas.

Separation

A connector function should not decide policy. It should execute after policy permits it.

Section 19 Connector interface standard
from abc import ABC, abstractmethod
from dataclasses import dataclass
from typing import Any


@dataclass
class ConnectorCall:
    connector_name: str
    capability: str
    input: dict[str, Any]
    context: dict[str, Any]
    command_id: str | None = None


@dataclass
class ConnectorResult:
    ok: bool
    output: dict[str, Any]
    error: str | None = None
    metadata: dict[str, Any] | None = None


class Connector(ABC):
    name: str
    connector_type: str

    @abstractmethod
    def capabilities(self) -> list[str]:
        pass

    @abstractmethod
    def call(self, request: ConnectorCall) -> ConnectorResult:
        pass

Example connector:

class NotificationConnector(Connector):
    name = "notification"
    connector_type = "notification"

    def capabilities(self) -> list[str]:
        return ["send_app_notification", "send_email", "send_slack"]

    def call(self, request: ConnectorCall) -> ConnectorResult:
        if request.capability == "send_app_notification":
            return ConnectorResult(
                ok=True,
                output={
                    "sent": True,
                    "channel": "app",
                    "recipient": request.input["recipient"],
                },
            )

        return ConnectorResult(
            ok=False,
            output={},
            error=f"Unsupported capability: {request.capability}",
        )
Section 20 Tool registry

A tool is a capability exposed to deterministic flows or agents. Tool metadata: name, description, input_schema, output_schema, execution_mode, risk_level, requires_approval, connector, function_name.

Examples:

validate_sql
run_query
generate_report
publish_report
retrieve_memory
write_memory
send_notification
sync_github_issue
create_google_doc
start_databricks_job
Principle

Tools are not just Python functions. They are governed capabilities.

Section 21 Policy framework

Policies should be composable. Categories:

permission
cost
data_safety
external_sharing
destructive_action
memory_consent
connector_scope
agent_risk
rate_limit
approval_requirement

Policy function standard:

from dataclasses import dataclass, field
from typing import Any


@dataclass
class PolicyResult:
    decision: str
    reasons: list[str] = field(default_factory=list)
    required_approvals: list[str] = field(default_factory=list)
    metadata: dict[str, Any] = field(default_factory=dict)


def external_sharing_policy(command: Command, task: TaskSpec, context: Context) -> PolicyResult:
    destination = command.payload.get("destination", "")

    if "@" in destination or destination.startswith("external:"):
        return PolicyResult(
            decision="require_approval",
            reasons=["External sharing requires approval."],
            required_approvals=["data_owner"],
        )

    return PolicyResult(decision="allow")
Section 22 Approval architecture

Approvals are first-class. An approval request should include: approval_id, command_id, requested_by, approver, approval_type, request_payload, review_packet, status, expires_at, created_at, decided_at.

Approval statuses:

pending
approved
rejected
expired
cancelled
Approval flow
sequenceDiagram autonumber participant U as User / Agent participant F as Framework participant N as Notification participant A as Approver U->>F: command reaches approval gate F->>F: create approval request F->>N: send notification F->>F: state = waiting_for_approval A-->>F: approve / reject F->>F: record approval decision alt approved F-->>U: resume workflow else rejected / expired F-->>U: workflow fails / expires end

An approval request should include enough context for a human: what action is requested, who requested it, why it is needed, what data/artifact is affected, what policy triggered approval, what will happen after approval, risk level, expiration.

Section 23 Memory architecture

Memory is durable context that can shape future behavior. Memory should be scoped.

Scopes

user
team
organization
app
workflow
connector
dataset
project

Memory types

preference
instruction
constraint
fact
negative_preference
approval_preference
connector_preference
format_preference

Examples

  • User prefers concise executive summaries.
  • Finance reports should include YoY comparison.
  • Never send customer PII to external emails.
  • Use the analytics warehouse for ad hoc report queries.
  • Ask Alex before publishing monthly revenue reports.

Memory write rules

  • High-confidence, low-risk preference — write directly if consent exists.
  • Sensitive or broad memory — require approval.
  • Conflicting memory — mark old memory superseded or ask human.
  • Temporary memory — set expiration.

Memory retrieval should be explicit: by subject, by task type, by connector, by semantic query.

Backend

The semantics of memory — consent, scope, confidence, supersession, conflict resolution, retrieval contracts — live in Concord. The storage backend is a connector that implements a MemoryStore protocol. The default is Postgres (with optional pgvector for semantic search); alternates are connectors: PineconeMemoryStore, WeaviateMemoryStore, in-house stores.

from typing import Protocol


class MemoryStore(Protocol):
    def insert(self, memory: Memory) -> None: ...
    def get(self, memory_id: str) -> Memory | None: ...
    def search(self, scope: MemoryScope, query: str | None, limit: int) -> list[Memory]: ...
    def supersede(self, old_id: str, new_id: str) -> None: ...
    def delete(self, memory_id: str) -> None: ...

Consent, policy, and audit hooks fire in the semantic layer regardless of backend. The backend itself never sees consent — it just stores and retrieves bytes addressed by scope.

Section 24 Artifact architecture

Artifacts are durable outputs: the things the workflow produces and the user reads. They are distinct from effects (§13.5, §24 below) which are side-effect plans the workflow intends to perform on the outside world. An artifact is a report, a query result, a draft document; an effect is a publish call, an email send, a job enqueue. A workflow may produce both: an artifact (the report) and an effect (publish it). The two live in different tables — artifacts versus domain_effects — because they answer different questions.

Artifact types

report
file
table
query_result
dashboard
notebook
agent_answer
approval_packet
connector_snapshot
model_output
memory_summary

Artifacts should track: location, type, version, status, metadata, lineage, creator, command_id, created_at.

Artifact statuses

draft
created
validated
published
archived
deleted
failed
Note

Artifacts should be references, not necessarily blobs. Store large payloads elsewhere and keep pointers in Postgres.

Section 25 Audit architecture

Audit is append-only, and lives in a single domain_events table alongside business events and agent step traces. The purpose column ('event', 'audit', 'agent_step') disambiguates the role of each row. One table covers three needs: business event stream, compliance audit trail, and agent step history. Compliance queries filter by purpose = 'audit'; agent observability filters by purpose = 'agent_step' ordered by step_index; product event feeds filter by purpose = 'event'. Agent-step extension columns (step_index, tool_name, latency_ms, token_usage, cost_units) are nullable for non-agent rows.

Audit event examples

command_created
input_validated
policy_evaluated
plan_created
approval_requested
approval_resolved
task_enqueued
task_claimed
task_started
task_succeeded
task_failed
task_retried
connector_called
memory_written
artifact_created
notification_sent
state_transitioned
command_cancelled
compensation_started
compensation_completed

Audit events should contain: actor, command_id, trace_id, event_type, target_type, target_id, payload, timestamp.

Principle

Audit should never be the only place where state is stored. Audit explains state. State controls workflow.

Section 26 Deterministic task mapping

To map any deterministic task, answer:

  1. What ingress created it?
  2. What command does it represent?
  3. What context applies?
  4. What policy checks apply?
  5. Is it sync or async?
  6. Does it need approval?
  7. Does it call a connector?
  8. Does it create artifacts?
  9. Does it write memory?
  10. Does it notify someone?
  11. What state lifecycle applies?
  12. What audit events must exist?

Template:

task:
  name: Generate Report
  command_type: generate_report
  ingress_types:
    - user_request
    - scheduled_trigger
  required_inputs:
    - report_type
    - date_range
  execution:
    mode: async
    function_name: generate_report
  policies:
    - permission
    - cost
    - data_access
  approval:
    required: false
  outputs:
    artifacts:
      - report
    memory: []
    notifications:
      - app
  lifecycle:
    - created
    - validated
    - queued
    - running
    - succeeded
Section 27 Agentic task mapping

To map an agentic task, answer:

  1. What user/system goal is the agent pursuing?
  2. What tools may the agent call?
  3. What memories may the agent read?
  4. What memories may the agent write?
  5. What tool calls require policy checks?
  6. What actions require approval?
  7. Can the agent create new commands?
  8. Can the agent create artifacts?
  9. What should be audited at each loop?
  10. How does the loop terminate?

Template:

agent_workflow:
  name: Investigate Revenue Anomaly
  command_type: investigate_revenue_anomaly
  allowed_tools:
    - run_sql
    - retrieve_memory
    - create_report_draft
    - request_human_input
  forbidden_tools:
    - send_external_email
    - mutate_production_table
  memory:
    read_scopes:
      - user
      - team
      - project
    write_allowed: true
    write_requires_policy: true
  approval_gates:
    - publish_report
    - send_notification_external
  termination:
    conditions:
      - final_answer_created
      - human_cancelled
      - max_steps_reached
      - policy_denied
Section 28 Agent loop standard
from dataclasses import dataclass
from typing import Any


@dataclass
class AgentAction:
    action_type: str
    tool_name: str
    payload: dict[str, Any]
    reason: str
    risk_level: str = "low"


@dataclass
class AgentStepResult:
    decision: str
    command_id: str | None
    output: dict[str, Any]
    message: str


class PrimitiveGateway:
    def submit_agent_action(
        self,
        action: AgentAction,
        context: Context,
    ) -> AgentStepResult:
        """Convert an agent action into a governed command or task.

        The implementation should:
        1. Create command.
        2. Evaluate policy.
        3. Execute or request approval.
        4. Return structured result.
        """
        raise NotImplementedError
Effect

This prevents the agent from directly executing unsafe actions.

Section 29 Sync vs async standards
Sync

Fast, low-risk, bounded

  • operation is fast
  • operation is low risk
  • operation has bounded latency
  • operation can return inside the request budget
  • operation does not need retries beyond caller retry
Async

Slow, expensive, multi-step

  • may take longer than request budget
  • calls slow external APIs
  • creates durable artifacts
  • needs retries
  • has approval gates
  • is expensive
  • has multiple steps

Examples:

  • validate input — sync
  • preview SQL — sync if small
  • generate report — async
  • sync connector — async
  • send approved notification — async or sync depending on provider
  • write memory — sync if low risk, approval-gated if sensitive
  • agent tool call — governed sync or async depending on tool
Section 30 Cancellation standard

Every long-running workflow should be cancellable. Each command carries a cancellation_mode column with one of two values.

30.1 · graceful (default)

The workflow finishes its current step and then exits cleanly. The runtime's hard workflow.cancel() is not called; instead, each step's prelude checks command.cancellation_requested and a typed Cancelled exception routes to the workflow's exit branch. Side effects already in flight inside the current step are allowed to complete; side effects beyond it are skipped.

Transitions: running → cancelling → cancelled.

30.2 · compensate_then_stop

The workflow stops issuing new operation steps and walks the effect chain in reverse. For every domain_effects row in status succeeded with a declared compensation, the compensation is enqueued (reverse declaration order). The chain runs as a runtime sub-workflow with its own audit trail under purpose='audit'.

Transitions: running → cancelling → compensating → compensated → cancelled.

Cancellation flow
flowchart TB R["User / system
requests cancellation"] --> S{"State allows?"} S -->|no| KEEP[no-op] S -->|yes| M{"cancellation_mode"} M -->|graceful| G1[current step finishes] G1 --> G2[future steps skip via prelude check] G2 --> C1["command → cancelled"] M -->|compensate_then_stop| K1["command → compensating"] K1 --> K2[walk effects in reverse] K2 --> K3[enqueue each compensation] K3 --> K4["command → compensated"] K4 --> C1 C1 --> AU[/"domain_events audit"/] style S fill:#F5E0D2,stroke:#D97757 style M fill:#F5E0D2,stroke:#D97757 style AU fill:#F1F2EC,stroke:#6B7B5A

Cancellation propagates: from a swarm to its agent runs (§64), from a parent command to dependent children where propagate_cancellation is true (§13.4), and from a child workflow upward when marked terminal-on-child-failure.

Caution

Running steps should periodically check the command's cancellation state before performing irreversible side effects. graceful only protects against new steps, not against side effects mid-flight inside the currently running one.

immediate (hard cancel without grace) and checkpoint_then_stop (run to next safe checkpoint) are reserved as future modes; both are deferred until a concrete use case forces them.

Section 31 Compensation standard
Definition

Compensation is not rollback. It is a forward action that counteracts a side effect.

Examples:

  • created draft — delete draft
  • published artifact — unpublish artifact
  • sent wrong notification — send correction
  • created external ticket — close ticket
  • granted access — revoke access

Compensation works in three layers: a declarative manifest at registration, a graph validator that runs at startup, and a drift detector that runs at execution time.

Layer 1

Manifest at registration

Every operation declares the side effects it produces. Every compensation declares what it counters and what (if any) new effects it itself produces.

Layer 2

Graph validator at startup

On catalog load, Concord builds the effect/compensation graph and validates acyclicity, depth ≤ N, pure-counter invariants, and runtime capability matches.

Layer 3

Runtime drift detector

Each compensation step is wrapped to record every effect actually emitted. Drift from the declared manifest writes a compensation_drift audit row and alerts on-call.

31.1 · Manifest

Operations and their compensations carry typed declarations. Most compensations are pure-counter: their only produced effects are inverses of the parent's. Complex compensations that themselves require further compensation are allowed but must declare the cascade explicitly.

from concord.effects import operation, compensation, ExternalCall


@operation(
    produces=[ExternalCall("hotel_booking.book")],
    requires_compensation=True,
)
def book_hotel(...): ...


@compensation(
    of=book_hotel,
    produces=[ExternalCall("hotel_booking.cancel")],
    counter_effects=True,    # explicit: these are inverses, not new work
)
def cancel_hotel_reservation(...): ...

31.2 · Graph validator

At catalog load the validator rejects:

  • Operations marked requires_compensation=True with no registered compensation.
  • Compensation chains exceeding max_depth (default 2).
  • Cycles in the effect → compensation → counter-effect graph.
  • counter_effects=True declarations whose produced effects are not the inverses of the parent's.
  • Catalogs requiring SAGA_COMPENSATION_NATIVE when the active runtime adapter doesn't declare that capability (see §41).

These are registration-time errors. The app refuses to start until the catalog is internally consistent and runnable against the chosen runtime.

31.3 · Drift detector

At runtime, each compensation step is wrapped. Effects actually emitted are compared against the declared manifest; any drift writes an audit row.

@runtime.step(**compile_policy("compensation"))
def run_compensation_step(effect_id: str) -> dict:
    declared = load_manifest_for(effect_id)
    with concord.effects.intercept() as recorder:
        result = execute_compensation(effect_id)
    drift = recorder.emitted - declared.expected
    if drift:
        write_domain_event(
            event_type="compensation_drift",
            purpose="audit",
            payload={"effect_id": effect_id, "drift": list(drift)},
        )
    return result

The detector is what catches incomplete manifests, conditional side effects the declaration didn't anticipate, and genuine implementation bugs. Together with the manifest and the validator, it is good enough to call a compensation contract honest — every drift is recorded, named, and auditable.

When to need native saga support

The default DBOS adapter runs compensation chains as Concord-orchestrated sub-workflows. This is correct but weaker than native saga atomicity: if a compensation fails after its retries exhaust, the chain halts at that point and a drift row records the partial completion. Compensation-heavy domains (financial transactions, multi-leg bookings, regulated workflows) should choose a runtime that declares SAGA_COMPENSATION_NATIVE (see §41) — Temporal is the natural fit.

Section 32 Error taxonomy

Failures should be classified.

validation_error
policy_denied
approval_rejected
transient_connector_error
permanent_connector_error
rate_limited
timeout
cancelled
agent_failed
human_input_missing
unknown_error

This classification drives retry behavior.

Retryable

Transient by nature

  • transient_connector_error
  • rate_limited
  • timeout
  • temporary_database_error
Non-retryable

Logical failures

  • validation_error
  • policy_denied
  • approval_rejected
  • permission_denied
  • malformed_payload
Section 33 Retry and backoff

Retry mechanics belong to the runtime; the retry contract belongs to Concord. Concord declares, per operation, which error classes are retryable and how aggressively. The runtime adapter receives this as ordinary configuration.

The contract carries: attempt, max_attempts, run_after, last_error, error_class. Each operation has a registered RetryPolicy:

@dataclass(frozen=True)
class RetryPolicy:
    operation: str
    retryable: frozenset[ErrorClass]
    max_attempts: int = 3
    backoff_seconds: list[int] = field(default_factory=lambda: [30, 120, 600])
    requires_idempotency_key: bool = False

A single compile step translates the contract into runtime step kwargs. This is the only sanctioned way a connector step gets its retry configuration; no ad-hoc retry numbers in decorators.

from concord.retry import compile_policy

@runtime.step(**compile_policy("hotel_booking.book"))
def book_hotel_step(...): ...

Inside the step, the classifier from §32 runs before exceptions propagate. A validation_error raised inside book_hotel_step is converted to a non-retryable exception class regardless of max_attempts; a transient_connector_error raises a class the runtime knows to retry. Test the rule, not just the policy: a step that emits a non-retryable class must never retry, irrespective of decorator config.

A typical backoff schedule:

attempt 1 → retry after 30 seconds
attempt 2 → retry after 2 minutes
attempt 3 → retry after 10 minutes
attempt 4 → fail permanently
Caution

Retries should be idempotent. If a side effect may have happened, the retry should check before repeating it. Effect-level idempotency keys (on domain_effects) are the source of truth for external APIs that support them; the runtime's step-level idempotency only protects against re-execution within a workflow.

Section 34 Connector safety standards

Every connector call should record: connector name, capability, input hash or redacted input, output summary, status, latency, error, command_id, task_run_id, trace_id.

Rule

Never store raw secrets in connector configs.

Connector config should reference secrets, not contain them:

{
  "auth_mode": "oauth",
  "secret_ref": "vault://github/app/token",
  "scopes": ["repo:read", "issues:write"]
}
Section 35 Postgres transaction boundaries

Recommended transaction boundaries:

Tx 1

Command creation

Insert command · insert audit event · commit.

Tx 2

Policy and planning

Update command state · insert policy decisions · update plan · insert audit · commit.

Tx 3

Enqueue

Update command state to queued · insert task run · insert audit · commit.

Tx 4

Worker completion

Update task run result · update command result/state · insert artifacts/memories · insert audit · commit.

Caution

Avoid holding transactions while calling external APIs.

Section 36 API standards

Submit command

POST /commands

Request:

{
  "task_name": "Generate Report",
  "payload": {
    "report_type": "monthly_revenue",
    "date_range": "2026-05"
  },
  "idempotency_key": "generate_report:monthly_revenue:2026-05"
}

Response:

{
  "command_id": "cmd_123",
  "state": "queued",
  "status": "queued",
  "trace_id": "trace_abc"
}

Get command

GET /commands/{command_id}

Response:

{
  "command_id": "cmd_123",
  "command_type": "generate_report",
  "state": "succeeded",
  "result": {
    "artifact_id": "art_456"
  }
}

Resolve approval

POST /approvals/{approval_id}/resolve

Request:

{
  "decision": "approved",
  "reason": "Looks good."
}

Agent action

POST /agent-actions

Request:

{
  "tool_name": "publish_report",
  "payload": {
    "artifact_id": "art_456",
    "destination": "external:finance@example.com"
  },
  "reason": "The user asked to publish the final report."
}

Response:

{
  "decision": "require_approval",
  "command_id": "cmd_789",
  "approval_id": "appr_123"
}
Section 37 Example · deterministic flow
Worked design

A full Concord design doc for a deterministic vendor-data sync (scheduled trigger, retry taxonomy, idempotent upsert, no human approvals) is available as a worked example: vendor-data-sync · Concord design ↗

User asks: Generate the May revenue report.

End-to-end · deterministic
sequenceDiagram autonumber participant U as User participant API as API participant POL as Policy participant Q as Queue participant W as Worker participant DB as Postgres U->>API: "Generate May revenue report" API->>DB: insert command (created) API->>POL: evaluate (permission, cost, data_access) POL-->>API: allow → async API->>DB: validated → queued API->>Q: enqueue task Q-->>W: claim task W->>DB: running W->>W: generate_report() W->>DB: artifact created W->>DB: succeeded W-->>U: app notification

State: created → validated → queued → running → succeeded

Section 38 Example · approval-gated flow
Worked design

A full Concord design doc for a hotel reservation flow (approval gate at >$500, vendor connector with idempotency, compensate_then_stop cancellation with vendor cancel as compensation) is available as a worked example: hotel-booking · Concord design ↗

User asks: Publish the May revenue report to finance@example.com.

End-to-end · approval-gated
sequenceDiagram autonumber participant U as User participant F as Framework participant POL as Policy participant A as Approver participant C as Connector participant DB as Postgres U->>F: publish_report F->>DB: command created F->>POL: permission + external_sharing + pii_check POL-->>F: require_approval (external destination) F->>DB: state = waiting_for_approval F->>A: approval request A-->>F: approved F->>DB: approved → queued → running F->>C: publish report C-->>F: ok F->>DB: artifact created · succeeded F-->>U: notification

State: created → validated → waiting_for_approval → approved → queued → running → succeeded

Section 39 Example · agentic flow
Worked design

A full Concord design doc for an agentic revenue-anomaly investigation (coordinator + optional analyst sub-agent, governed SQL tool, memory with consent gate, approval before external sharing, read-only) is available as a worked example: revenue-investigation-swarm · Concord design ↗

User asks: Investigate why revenue dropped last week and draft a summary.

Investigate revenue anomaly
flowchart TB U([User request]) --> CMD[command: investigate_revenue_drop] CMD --> POL{{Policy · permission · data_access · agent_tool_scope}} POL --> RM[Agent retrieves memory] RM --> SQL[run_sql via governed tool] SQL --> DRAFT[create draft artifact] DRAFT --> SUM[Agent summarizes result] SUM --> OUT[investigation artifact · memory candidate] subgraph ALLOWED ["Allowed tools"] direction LR T1[retrieve_memory] & T2[run_sql] & T3[create_artifact] & T4[ask_human_input] end subgraph FORBID ["Forbidden tools"] direction LR F1[publish_report] & F2[external_email] & F3[production_mutation] end SUM -.->|if external| EXT[propose send_email] --> POL2{{Policy requires approval}} --> FLOW[approval flow] style POL fill:#F5E0D2,stroke:#D97757,stroke-width:1.5px style POL2 fill:#F5E0D2,stroke:#D97757,stroke-width:1.5px style OUT fill:#F1F2EC,stroke:#6B7B5A style ALLOWED fill:#F1F2EC,stroke:#6B7B5A style FORBID fill:#FAEEEE,stroke:#B85556

State: created → validated → running → succeeded (or escalates to approval if external action proposed).

Section 40 Framework module structure

Recommended code layout:

concord/
  __init__.py

  core/
    models.py
    states.py
    errors.py
    ids.py

  persistence/
    postgres.py
    migrations/
      001_core.sql

  registry/
    tasks.py
    tools.py
    connectors.py
    policies.py

  engine/
    command_service.py
    policy_engine.py
    planner.py
    executor.py
    worker.py
    approvals.py
    memory.py
    artifacts.py
    audit.py

  connectors/
    base.py
    postgres.py
    http.py
    notification.py
    databricks.py
    github.py

  agents/
    gateway.py
    protocols.py
    memory_context.py

  api/
    routes.py
    schemas.py

  examples/
    deterministic_report.py
    approval_flow.py
    agentic_investigation.py
Section 41 Minimal service interfaces

CommandService

class CommandService:
    def submit(
        self,
        task_name: str,
        payload: dict,
        context: Context,
        idempotency_key: str | None = None,
    ) -> dict:
        raise NotImplementedError

    def get(self, command_id: str) -> dict:
        raise NotImplementedError

    def cancel(self, command_id: str, actor: str, reason: str | None = None) -> dict:
        raise NotImplementedError

PolicyEngine

class PolicyEngine:
    def evaluate(
        self,
        command: Command,
        task: TaskSpec,
        context: Context,
    ) -> PolicyResult:
        raise NotImplementedError

Planner

class Planner:
    def create_plan(
        self,
        command: Command,
        task: TaskSpec,
        policy_result: PolicyResult,
    ) -> ExecutionPlan:
        raise NotImplementedError

Executor

class Executor:
    def execute_sync(self, command: Command, plan: ExecutionPlan) -> dict:
        raise NotImplementedError

    def enqueue_async(self, command: Command, plan: ExecutionPlan) -> dict:
        raise NotImplementedError

Worker

class Worker:
    def claim_next(self) -> dict | None:
        raise NotImplementedError

    def run_once(self) -> dict | None:
        raise NotImplementedError

DurableRuntime

The durable runtime is a protocol, not an implementation. Concord's domain layer imports the protocol; the runtime is supplied at app startup. The default implementation wraps DBOS; future adapters wrap Temporal, Restate, or in-house equivalents. Each adapter publishes a capabilities set so the catalog can be validated against the runtime at registration time.

from enum import StrEnum
from typing import ClassVar, Protocol


class RuntimeCapability(StrEnum):
    DURABLE_WORKFLOWS = "durable_workflows"
    DURABLE_STEPS = "durable_steps"
    QUEUES = "queues"
    SCHEDULES = "schedules"
    SIGNALS = "signals"
    SUBWORKFLOWS = "subworkflows"
    EFFECT_INTERCEPTION = "effect_interception"
    SAGA_COMPENSATION_NATIVE = "saga_compensation_native"
    WORKFLOW_VERSIONING = "workflow_versioning"


class DurableRuntime(Protocol):
    capabilities: ClassVar[frozenset[RuntimeCapability]]

    def submit_workflow(self, spec: WorkflowSpec) -> WorkflowHandle: ...
    def wait_for_result(self, handle: WorkflowHandle) -> WorkflowResult: ...
    def cancel(self, handle: WorkflowHandle, mode: CancellationMode) -> None: ...
    def enqueue_step(self, queue_name: str, step_spec: StepSpec) -> StepHandle: ...
    def schedule(self, schedule_spec: ScheduleSpec) -> ScheduleHandle: ...
    def signal(self, handle: WorkflowHandle, signal_name: str, payload: dict) -> None: ...

Adapter capability matrix:

CapabilityDBOSTemporalNotes
DURABLE_WORKFLOWSTable stakes.
DURABLE_STEPS✓ (activities)
QUEUES✓ (task queues)
SCHEDULES✓ (cron schedules)
SIGNALSpartial✓ (signals + queries)DBOS approval-wait pattern; Temporal more general.
SUBWORKFLOWS✓ (child workflows)
EFFECT_INTERCEPTIONConcord wraps every step; adapter just needs hook points.
SAGA_COMPENSATION_NATIVEDBOS runs chains as Concord-orchestrated sub-workflows; see §31.
WORKFLOW_VERSIONINGAdopt Temporal when versioning becomes load-bearing.

At catalog load, Concord checks the union of capabilities required by registered operations against the active adapter's capabilities. Missing capabilities surface as startup errors — never as first-failure runtime errors.

MemoryStore

class MemoryStore(Protocol):
    def insert(self, memory: Memory) -> None: ...
    def get(self, memory_id: str) -> Memory | None: ...
    def search(self, scope: MemoryScope, query: str | None, limit: int) -> list[Memory]: ...
    def supersede(self, old_id: str, new_id: str) -> None: ...
    def delete(self, memory_id: str) -> None: ...
Section 42 Extensibility model

The framework should be extensible through registries.

task registry
tool registry
policy registry
connector registry
memory extractor registry
artifact handler registry
notification handler registry
agent adapter registry

To add a new capability:

  1. Register connector if needed.
  2. Register tool/function.
  3. Register task spec.
  4. Register policies.
  5. Define output artifacts/memory/notifications.
  6. Add tests.
Effect

No core engine changes should be required for most new capabilities.

Section 43 Future connector model

A future connector should be able to expose capabilities like:

connector:
  name: github
  type: github
  capabilities:
    - search_issues
    - create_issue
    - comment_on_pr
  auth:
    mode: oauth
    scopes:
      - issues:read
      - issues:write
  rate_limits:
    requests_per_minute: 60

Tool:

tool:
  name: create_github_issue
  connector: github
  capability: create_issue
  execution_mode: async
  risk_level: medium
  requires_approval: false
  input_schema:
    type: object
    required:
      - repo
      - title
    properties:
      repo:
        type: string
      title:
        type: string
      body:
        type: string

The planner can route tool execution through the connector registry.

Section 44 Future agent model

Agents should see a constrained tool catalog. For each task, define:

allowed_tools
forbidden_tools
memory_scopes
approval_gates
max_steps
max_cost
termination_conditions
Agent path through the system
flowchart LR AG[Agent] --> GW[Primitive gateway] GW --> CPP[command · policy · plan] CPP --> EX[execution] style AG fill:#EFE6F0,stroke:#7A5560 style GW fill:#F5E0D2,stroke:#D97757 style CPP fill:#F5E0D2,stroke:#D97757,stroke-width:1.5px style EX fill:#F1F2EC,stroke:#6B7B5A
Effect

This allows future replacement of the agent framework without rewriting the workflow system.

Section 45 Security model

Security layers:

authentication
authorization
policy evaluation
connector scope
approval gates
audit
secret isolation
data safety checks
Rule

A connector credential does not imply a user may use the connector.

The framework must check:

  • who is requesting
  • what they are trying to do
  • what connector/tool is involved
  • what data will be accessed
  • what side effects may occur
  • whether approval is required
Section 46 Data safety model

Data safety policies should classify:

PII
credentials
customer confidential data
internal-only data
external-shareable data
regulated data

Data safety outcomes:

allow
redact
require_approval
deny

Agentic workflows must apply data safety checks before:

  • external tool call
  • external notification
  • memory write
  • artifact publication
  • connector sync
Section 47 Testing standards

Test at four levels.

Level 1

Unit tests

State transitions · primitive mapping · policy decisions · planner outputs · idempotency behavior.

Level 2

Integration tests

Postgres persistence · queue claiming · worker retry · approval resume · connector call recording.

Level 3

Workflow tests

Generate report end-to-end · approval-gated publish · webhook deduplication · agent tool proposal with approval.

Level 4

Safety tests

Policy denial · external sharing approval · memory consent · forbidden tool calls · duplicate webhook event.

Level 5

Boundary discipline

Import-time boundary check: concord/core/ and concord/domain/ must not import the runtime; only concord/runtime/<adapter>.py may. An AST scanner (concord_boundary_check.py) runs in CI and fails the build on violation.

The boundary check is the only "test" that runs on every commit before the suite — it's a 200-line AST scanner with no install footprint. Rules:

PathDisallowedWhy
concord/core/**dbos, temporalioThe functional core has no runtime knowledge.
concord/domain/**dbos, temporalioDomain speaks the protocol; never an implementation.
concord/runtime/*.py (≠ dbos.py)dbosAdapter isolation. One file binds the implementation.
concord/runtime/*.py (≠ temporal.py)temporalioSame rule shape; future adapter.
Section 48 Operational dashboards

Minimum dashboards

commands by state
task runs by status
failed tasks
retry counts
approval backlog
old queued tasks
connector errors
policy denials
memory writes
artifact creation
audit volume

Operational alerts

queued task older than threshold
approval expired
connector error rate spike
task retry exhaustion
worker heartbeat missing
policy denial spike
Section 49 Development roadmap
Phases
flowchart LR P1["Phase 1
Logical core"] --> P2["Phase 2
Postgres backend"] P2 --> P3["Phase 3
Worker"] P3 --> P4["Phase 4
Approvals & notifications"] P4 --> P5["Phase 5
Connectors"] P5 --> P6["Phase 6
Agent gateway"] P6 --> P7["Phase 7
Hardening"] style P1 fill:#FAF8F2,stroke:#141413 style P2 fill:#FAF8F2,stroke:#141413 style P3 fill:#F5E0D2,stroke:#D97757 style P4 fill:#F5E0D2,stroke:#D97757 style P5 fill:#F1F2EC,stroke:#6B7B5A style P6 fill:#EFE6F0,stroke:#7A5560 style P7 fill:#F7F1E0,stroke:#B68A2E
Phase 1

Logical core

TaskSpec · Command · PolicyResult · ExecutionPlan · primitive mapper · state machine · audit model · in-memory store.

Phase 2

Postgres backend

Migrations · command / task queue / approval / audit / memory / artifact repositories.

Phase 3

Worker

Queue claiming · lease renewal · retry/backoff · task execution · result persistence · failure classification.

Phase 4

Approvals & notifications

Approval API · approval UI · notification connector · resume · expiration.

Phase 5

Connectors

Base interface · HTTP · notification · Databricks · GitHub · tool registry.

Phase 6

Agent gateway

Action protocol · tool allowlist · memory retrieval · policy-gated calls · trace events · approval-gated agent actions.

Phase 7

Hardening

Rate limits · cost accounting · data safety policies · compensation · cancellation · dashboards · retention.

Section 50 What this framework is not
See also

A longer-form treatment of Concord's positioning — including a comparison across nine adjacent categories (durable runtimes, agent frameworks, BPM platforms, DevOps systems, policy engines, data DAG orchestrators, observability tools, iPaaS / connector platforms, memory systems) — lives at What Concord is & isn't ↗.

Concord is not
  • a durable execution runtime (DBOS, Temporal, Cadence do that)
  • a distributed compute engine
  • a full BPMN engine
  • a replacement for Postgres
  • a replacement for connector APIs
  • a replacement for an LLM agent framework
  • a full data pipeline orchestrator
  • a framework that demands wholesale adoption
Concord is
  • a library of contracts (pip install concord)
  • a policy, approval, and state model
  • a connector and agent governance layer
  • an agent-safe tool gateway
  • a Postgres-backed system of record for domain meaning
  • a thin layer that delegates execution to a durable runtime
Section 51 Final vision

Concord should make every action in the system look like this:

Someone or something requested work. The request became a command. The command was evaluated by policy. The policy produced a plan. The plan executed through governed primitives. The workflow state changed explicitly. Outputs, memories, and artifacts were recorded. Every important thing was audited.

This gives you one coherent framework for:

ordinary deterministic workflows
long-running async jobs
external triggers
human approvals
memory capture
artifact creation
notifications
future connectors
agentic tool use

The key architectural decision:

Architectural truths
flowchart LR A["Postgres
owns truth"] --- B["Framework
owns primitives"] B --- C["Connectors
own capabilities"] C --- D["Workers
own execution"] D --- E["Agents
propose actions"] E --- F["Policy
decides what is allowed"] F --- G["Audit
explains what happened"] style A fill:#FFFFFF,stroke:#141413,stroke-width:1.5px style B fill:#FAF8F2,stroke:#141413 style C fill:#F1F2EC,stroke:#6B7B5A style D fill:#FAF8F2,stroke:#141413 style E fill:#EFE6F0,stroke:#7A5560 style F fill:#F5E0D2,stroke:#D97757,stroke-width:1.5px style G fill:#FAF8F2,stroke:#141413
Part II · Addendum

Multi-agent swarms, subagent spawning, agentic execution

A governed extension to the primitive layer. Agents propose actions; Concord governs, records, executes, and audits them.

Single-agent runs, parallel swarms, hierarchical delegation, and reviewer agents are all compositions over the same primitives — no new fundamentals required.

Sections52 – 73
Adds5 Postgres tables · 2 state machines · 5 execution modes
StanceRuntime-agnostic, Postgres-first
Section 52 Agentic design philosophy

Concord supports agentic execution as a governed extension of the primitive system. Agents propose; the framework decides.

Agents do not own execution. Agents propose actions. Concord governs, records, executes, and audits those actions.

52.1 · Agents are participants, not infrastructure

An agent is a participant that can interpret context, propose commands, call tools, delegate work, spawn subagents, produce artifacts, request memory reads/writes, ask for human approval, and evaluate or synthesize results. But every consequential action passes through the same primitive gateway used by deterministic workflows.

Agent → primitive gateway
flowchart LR AG[Agent proposes action] --> CMD([Concord command]) CMD --> POL{{Policy check}} POL --> PL([Execution plan]) PL --> EX[Sync / async execution] EX --> ST[State transition] ST --> OUT[Artifact · memory · notification · audit] style AG fill:#EFE6F0,stroke:#7A5560 style POL fill:#F5E0D2,stroke:#D97757,stroke-width:1.5px style OUT fill:#F1F2EC,stroke:#6B7B5A

This is what makes deterministic and agentic workflows interoperable.

52.2 · Swarms are workflow structures, not special cases

A swarm is a coordinated group of agent runs serving a parent command or workflow. A swarm can be sequential, parallel, competitive, hierarchical, review-driven, or human-supervised — but it is still composed from the same primitives:

Command → Policy → Plan → AgentRun → TaskRun → Artifact → Memory → Audit

52.3 · Subagent spawning is a governed operation

An agent must not directly create uncontrolled child agents. Subagent spawning is itself a command — spawn_subagent — subject to:

permission checks
role checks
cost limits
tool scope limits
memory scope limits
connector scope limits
spawn depth limits
swarm size limits
human approval if required

52.4 · Postgres remains the system of record

All durable agentic state lives in Postgres: swarm runs, agent runs, invocations, steps, tool calls, delegated goals, parent-child relationships, artifacts, memory reads/writes, approvals, cancellations, audit.

Substitutability

The LLM/agent runtime can be replaced. The Postgres-backed execution record must remain stable.

Section 53 Core agentic concepts

Five core objects model agentic execution.

Concept

SwarmRun

A coordinated multi-agent execution that belongs to one parent command. Defines objective, participants, coordinator, join strategy, and hard limits.

Concept

AgentRun

One execution of one agent role: coordinator, planner, researcher, reviewer, worker, memory manager, domain expert, or connector-specific agent.

Concept

AgentInvocation

Records a parent agent spawning or delegating to a child agent — delegated goal, constraints, allowed tools, memory scope, budget, max steps, spawn depth.

Concept

AgentStep

One decision, action, or observation inside an agent run. The agentic equivalent of a trace event — but durable and queryable.

Concept

JoinStrategy

Defines how subagent outputs are combined: all_success, first_success, quorum, coordinator_synthesis, evaluator_selection, human_review, best_of_n, map_reduce, consensus, ranked_review.

An AgentRun may create

commands
task runs
subagent invocations
artifacts
memory candidates
approval requests
audit events

AgentStep action types

plan
reason
tool_call_proposed
tool_call_executed
command_created
artifact_read
artifact_written
memory_read
memory_write_proposed
approval_requested
subagent_spawned
child_result_observed
evaluation
synthesis
final_answer
error

Join strategies

all_success
first_success
quorum
coordinator_synthesis
evaluator_selection
human_review
best_of_n
map_reduce
consensus
ranked_review
Section 54 Agentic hierarchy

54.1 · Standard hierarchy

Command → swarm → agents → join → output
flowchart TB C([Command]) --> WR[WorkflowRun] WR --> SR[SwarmRun] SR --> COORD[AgentRun · coordinator] COORD --> I1[Invocation · researcher] COORD --> I2[Invocation · analyst] COORD --> I3[Invocation · reviewer] I1 --> AR1[AgentRun · researcher] I2 --> AR2[AgentRun · analyst] I3 --> AR3[AgentRun · reviewer] AR1 --> J[JoinResults] AR2 --> J AR3 --> J J --> ART[ArtifactWrite] ART --> POL{{Policy check}} POL --> HA[HumanApproval if needed] HA --> FIN([FinalOutput]) style POL fill:#F5E0D2,stroke:#D97757,stroke-width:1.5px style FIN fill:#F1F2EC,stroke:#6B7B5A style COORD fill:#EFE6F0,stroke:#7A5560 style AR1 fill:#EFE6F0,stroke:#7A5560 style AR2 fill:#EFE6F0,stroke:#7A5560 style AR3 fill:#EFE6F0,stroke:#7A5560

54.2 · Parent-child relationships

Every child agent should have:

parent agent run id
swarm run id
delegated goal
bounded role
tool scope
memory scope
connector scope
max step count
max runtime / timeout
cancellation parent

This enables recursive execution while preserving control.

Section 55 Swarm execution modes

Five recurring patterns. Choose by the shape of the work.

Mode 01

Sequential

Parent delegates to one child at a time. Use when each step depends on the previous, when debuggability matters, when cost and control beat latency.

Mode 02

Parallel

Parent spawns subagents concurrently. Use when work decomposes cleanly, latency matters, multiple connectors can be explored at once.

Mode 03

Competitive

Multiple agents attempt the same task; an evaluator selects the best. Use when output quality matters or confidence comes from comparison.

Mode 04

Review-driven

Reviewer agents evaluate proposed outputs before publication. Use when outputs are externally visible, sensitive, or costly to get wrong.

Mode 05

Hierarchical

Agents can spawn subagents, within hard limits. Use when a coordinator cannot plan all subtasks upfront and decomposition must be flexible.

Sequential
flowchart LR COORD([Coordinator]) --> R[Researcher] --> A[Analyst] --> REV[Reviewer] --> FS([Final synthesis]) style COORD fill:#EFE6F0,stroke:#7A5560 style FS fill:#F1F2EC,stroke:#6B7B5A
Parallel
flowchart LR COORD([Coordinator]) --> RA[Researcher A] COORD --> RB[Researcher B] COORD --> A[Analyst] COORD --> REV[Reviewer] RA --> J([Join results]) RB --> J A --> J REV --> J style COORD fill:#EFE6F0,stroke:#7A5560 style J fill:#F1F2EC,stroke:#6B7B5A
Competitive
flowchart LR COORD([Coordinator]) --> S1[Solver A] COORD --> S2[Solver B] COORD --> S3[Solver C] S1 --> E{Evaluator} S2 --> E S3 --> E E --> BEST([Select best]) style COORD fill:#EFE6F0,stroke:#7A5560 style E fill:#F5E0D2,stroke:#D97757 style BEST fill:#F1F2EC,stroke:#6B7B5A
Review-driven
flowchart LR W[Workers produce artifacts] --> REV[Reviewer evaluates] REV --> Q{"Risk remains?"} Q -->|yes| HUM[Human approval] Q -->|no| PUB([Publication]) HUM --> PUB style Q fill:#F5E0D2,stroke:#D97757 style PUB fill:#F1F2EC,stroke:#6B7B5A
Hierarchical
flowchart TB COORD([Coordinator]) --> DL[Domain Lead] COORD --> REV[Reviewer] DL --> SA[Subagent A] DL --> SB[Subagent B] style COORD fill:#EFE6F0,stroke:#7A5560 style DL fill:#EFE6F0,stroke:#7A5560

Always enforce in hierarchical mode

max_depth
max_agents
max_steps_per_agent
max_total_steps
max_cost
allowed_roles
allowed_tools
allowed_connectors
Section 56 Mapping swarms to existing primitives

No new fundamental primitive is required. Swarms are compositions over existing primitives.

Agentic actionConcord mapping
Start swarmCommand → Policy → Plan → SwarmRun
Spawn subagentCommand → Policy → AgentInvocation → AgentRun
Agent tool callCommand → Policy → Sync/Async Function
Agent long-running toolCommand → Policy → Queue → Async Task
Agent asks humanHuman Approval
Agent writes memoryMemory Write (candidate, policy-checked)
Agent creates outputArtifact Write
Agent notifies userNotification
Agent stops workCancellation
Agent reverses side effectCompensation
Agent result reviewEvaluation / Quality Check
Agent traceAudit / AgentStep
Section 57 Postgres schema additions

Five new tables reference the existing pf_commands, pf_task_runs, pf_artifacts, pf_memory, pf_approvals, and pf_audit_log tables.

Agentic entity relationships
erDiagram pf_commands ||--o{ swarm_runs : initiates swarm_runs ||--o{ agent_runs : contains agent_runs ||--o{ agent_invocations : parent_of agent_invocations ||--|| agent_runs : child_run agent_runs ||--o{ agent_steps : records agent_runs ||--o{ agent_messages : exchanges swarm_runs ||--o{ agent_messages : scoped_to swarm_runs { UUID swarm_run_id PK UUID command_id FK TEXT status TEXT execution_mode TEXT join_strategy INT max_agents INT max_depth INT max_total_steps } agent_runs { UUID agent_run_id PK UUID swarm_run_id FK UUID parent_agent_run_id FK TEXT agent_role TEXT status INT spawn_depth INT step_count INT max_steps } agent_invocations { UUID invocation_id PK UUID parent_agent_run_id FK UUID child_agent_run_id FK TEXT invocation_type TEXT delegated_goal } agent_steps { UUID agent_step_id PK UUID agent_run_id FK INT step_index TEXT action_type TEXT decision NUMERIC cost_units } agent_messages { UUID agent_message_id PK UUID agent_run_id FK TEXT message_role TEXT message_type TEXT content }

57.1 · swarm_runs

CREATE TABLE IF NOT EXISTS swarm_runs (
  swarm_run_id UUID PRIMARY KEY,
  command_id UUID NOT NULL,
  workflow_run_id UUID NULL,

  coordinator_agent_run_id UUID NULL,

  status TEXT NOT NULL,
  objective TEXT NOT NULL,

  execution_mode TEXT NOT NULL,
  join_strategy TEXT NOT NULL,

  max_agents INT NOT NULL DEFAULT 5,
  max_depth INT NOT NULL DEFAULT 2,
  max_total_steps INT NOT NULL DEFAULT 100,
  max_cost_units NUMERIC NULL,

  metadata JSONB NOT NULL DEFAULT '{}',

  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  started_at TIMESTAMPTZ NULL,
  completed_at TIMESTAMPTZ NULL,
  cancelled_at TIMESTAMPTZ NULL
);

Recommended statuses

created
running
joining
waiting_for_approval
succeeded
failed
cancelled
partially_succeeded
expired

57.2 · agent_runs

CREATE TABLE IF NOT EXISTS agent_runs (
  agent_run_id UUID PRIMARY KEY,

  command_id UUID NOT NULL,
  swarm_run_id UUID NULL,
  parent_agent_run_id UUID NULL,

  agent_name TEXT NOT NULL,
  agent_role TEXT NOT NULL,
  agent_version TEXT NULL,

  status TEXT NOT NULL,
  goal TEXT NOT NULL,

  allowed_tools JSONB NOT NULL DEFAULT '[]',
  allowed_connectors JSONB NOT NULL DEFAULT '[]',
  memory_scope JSONB NOT NULL DEFAULT '{}',
  context_scope JSONB NOT NULL DEFAULT '{}',

  max_steps INT NOT NULL DEFAULT 20,
  step_count INT NOT NULL DEFAULT 0,
  spawn_depth INT NOT NULL DEFAULT 0,

  model_config JSONB NOT NULL DEFAULT '{}',
  runtime_config JSONB NOT NULL DEFAULT '{}',
  metadata JSONB NOT NULL DEFAULT '{}',

  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  started_at TIMESTAMPTZ NULL,
  completed_at TIMESTAMPTZ NULL,
  cancelled_at TIMESTAMPTZ NULL
);

Recommended statuses

created
running
waiting_for_tool
waiting_for_child
waiting_for_approval
joining
succeeded
failed
cancelled
expired

57.3 · agent_invocations

CREATE TABLE IF NOT EXISTS agent_invocations (
  invocation_id UUID PRIMARY KEY,

  swarm_run_id UUID NOT NULL,
  parent_agent_run_id UUID NOT NULL,
  child_agent_run_id UUID NOT NULL,

  invocation_type TEXT NOT NULL,
  delegated_goal TEXT NOT NULL,

  constraints JSONB NOT NULL DEFAULT '{}',
  allowed_tools JSONB NOT NULL DEFAULT '[]',
  allowed_connectors JSONB NOT NULL DEFAULT '[]',
  memory_scope JSONB NOT NULL DEFAULT '{}',

  status TEXT NOT NULL DEFAULT 'created',

  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  completed_at TIMESTAMPTZ NULL
);

Invocation types

spawn
delegate
review
evaluate
synthesize
retry
fallback

57.4 · agent_steps

CREATE TABLE IF NOT EXISTS agent_steps (
  agent_step_id UUID PRIMARY KEY,

  agent_run_id UUID NOT NULL,
  command_id UUID NOT NULL,
  swarm_run_id UUID NULL,

  step_index INT NOT NULL,
  action_type TEXT NOT NULL,

  tool_name TEXT NULL,
  action_payload JSONB NOT NULL DEFAULT '{}',

  decision TEXT NOT NULL,
  observation JSONB NULL,
  output JSONB NULL,

  created_command_id UUID NULL,
  created_task_run_id UUID NULL,
  created_artifact_id UUID NULL,
  created_memory_id UUID NULL,
  created_approval_id UUID NULL,
  child_agent_run_id UUID NULL,

  latency_ms INT NULL,
  token_usage JSONB NULL,
  cost_units NUMERIC NULL,

  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

57.5 · agent_messages

Optional but useful for chat-style and collaborative agent systems.

CREATE TABLE IF NOT EXISTS agent_messages (
  agent_message_id UUID PRIMARY KEY,

  swarm_run_id UUID NULL,
  agent_run_id UUID NOT NULL,
  parent_agent_run_id UUID NULL,

  message_role TEXT NOT NULL,
  message_type TEXT NOT NULL,
  content TEXT NOT NULL,
  metadata JSONB NOT NULL DEFAULT '{}',

  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

Roles & types

Roles

message_role

system
user
agent
tool
reviewer
coordinator
human
Types

message_type

instruction
observation
request
response
critique
summary
handoff

57.6 · Recommended indexes

CREATE INDEX IF NOT EXISTS idx_swarm_runs_command_id ON swarm_runs(command_id);
CREATE INDEX IF NOT EXISTS idx_swarm_runs_status     ON swarm_runs(status);

CREATE INDEX IF NOT EXISTS idx_agent_runs_swarm_run_id        ON agent_runs(swarm_run_id);
CREATE INDEX IF NOT EXISTS idx_agent_runs_parent_agent_run_id ON agent_runs(parent_agent_run_id);
CREATE INDEX IF NOT EXISTS idx_agent_runs_status              ON agent_runs(status);

CREATE INDEX IF NOT EXISTS idx_agent_invocations_parent ON agent_invocations(parent_agent_run_id);
CREATE INDEX IF NOT EXISTS idx_agent_invocations_child  ON agent_invocations(child_agent_run_id);

CREATE INDEX IF NOT EXISTS idx_agent_steps_agent_run_id ON agent_steps(agent_run_id);
CREATE INDEX IF NOT EXISTS idx_agent_steps_command_id   ON agent_steps(command_id);
CREATE INDEX IF NOT EXISTS idx_agent_steps_action_type  ON agent_steps(action_type);
Section 58 Swarm & agent state transitions

58.1 · Swarm state transitions

SwarmRun lifecycle
stateDiagram-v2 [*] --> created created --> running running --> joining running --> waiting_for_approval running --> failed running --> cancelled joining --> succeeded joining --> partially_succeeded joining --> failed waiting_for_approval --> running waiting_for_approval --> cancelled waiting_for_approval --> failed succeeded --> [*] failed --> [*] cancelled --> [*] partially_succeeded --> [*] expired --> [*]

Terminal states: succeeded, failed, cancelled, partially_succeeded, expired.

58.2 · Agent state transitions

AgentRun lifecycle
stateDiagram-v2 [*] --> created created --> running running --> waiting_for_tool running --> waiting_for_child running --> waiting_for_approval running --> joining running --> succeeded running --> failed running --> cancelled waiting_for_tool --> running waiting_for_tool --> failed waiting_for_child --> running waiting_for_child --> failed waiting_for_approval --> running waiting_for_approval --> failed joining --> succeeded joining --> failed succeeded --> [*] failed --> [*] cancelled --> [*] expired --> [*]

Terminal states: succeeded, failed, cancelled, expired.

Section 59 Standards for subagent spawning

59.1 · Spawn request contract

from dataclasses import dataclass, field
from typing import Any


@dataclass
class SpawnSubagentRequest:
    parent_agent_run_id: str
    swarm_run_id: str
    agent_name: str
    agent_role: str
    delegated_goal: str
    allowed_tools: list[str]
    allowed_connectors: list[str]
    memory_scope: dict[str, Any]
    context_scope: dict[str, Any]
    constraints: dict[str, Any] = field(default_factory=dict)
    max_steps: int = 10

59.2 · Spawn result contract

from dataclasses import dataclass, field


@dataclass
class SpawnSubagentResult:
    decision: str
    child_agent_run_id: str | None
    command_id: str | None
    reasons: list[str] = field(default_factory=list)

Allowed decisions:

allowed
denied
requires_approval
requires_more_input

59.3 · Spawn policy checks

max_swarm_size
max_spawn_depth
allowed_agent_role
allowed_tools_for_role
allowed_connectors_for_role
memory_scope_isolation
context_scope_isolation
cost_budget
step_budget
human_approval_requirement
external_side_effect_requirement
Spawn flow
sequenceDiagram autonumber participant P as Parent agent participant G as Primitive Gateway participant POL as Policy participant DB as Postgres P->>G: SpawnSubagentRequest G->>G: create command spawn_subagent G->>POL: evaluate spawn policy alt allowed POL-->>G: allowed G->>DB: insert agent_run · agent_invocation G-->>P: child_agent_run_id else denied / requires_approval POL-->>G: decision + reasons G-->>P: decision + reasons end

59.4 · Example spawn policy

def evaluate_spawn_policy(request, swarm, parent_agent):
    reasons = []

    if parent_agent["spawn_depth"] + 1 > swarm["max_depth"]:
        reasons.append("Spawn depth exceeded.")

    if swarm["current_agent_count"] >= swarm["max_agents"]:
        reasons.append("Swarm agent limit exceeded.")

    allowed_roles = parent_agent.get("allowed_child_roles", [])
    if allowed_roles and request.agent_role not in allowed_roles:
        reasons.append(f"Role not allowed: {request.agent_role}")

    parent_tools = set(parent_agent.get("allowed_tools", []))
    if not set(request.allowed_tools).issubset(parent_tools):
        reasons.append("Child requested tools outside parent scope.")

    parent_connectors = set(parent_agent.get("allowed_connectors", []))
    if not set(request.allowed_connectors).issubset(parent_connectors):
        reasons.append("Child requested connectors outside parent scope.")

    if reasons:
        return {"decision": "denied", "reasons": reasons}

    return {"decision": "allowed", "reasons": []}
Section 60 Tool calls from agents

60.1 · Tool calls must become commands

Rule

An agent tool call should not directly execute the tool. It should create a command or child task run.

Agent wants to call run_sql
→ create command: run_sql
→ policy checks permission / data scope
→ execute sync or async
→ return observation to agent
→ write audit

60.2 · Tool call contract

from dataclasses import dataclass, field
from typing import Any


@dataclass
class AgentToolCallRequest:
    agent_run_id: str
    swarm_run_id: str | None
    tool_name: str
    payload: dict[str, Any]
    reason: str
    expected_output: str
    idempotency_key: str | None = None
    metadata: dict[str, Any] = field(default_factory=dict)

60.3 · Tool call routing

Routing
sequenceDiagram autonumber participant A as Agent participant G as Primitive Gateway participant POL as Policy participant E as Executor A->>G: AgentToolCallRequest G->>G: check allowed_tools G->>G: convert to command G->>POL: evaluate command policy POL-->>G: allow / deny / approve G->>E: execute sync or async E-->>G: result G->>G: persist · write AgentStep G-->>A: observation
Section 61 Memory rules for swarms

Memory semantics — scope, consent, supersession, candidate writes, the MemoryStore protocol — are defined in §23 Memory architecture. This section adds only the swarm-specific deltas.

61.1 · Inheritance is subset, not union

Subset rule

Child agents do not automatically inherit parent memory access. The required invariant: child_memory_scope ⊆ parent_memory_scope. Spawn requests that violate this are rejected at policy time (see §59 Subagent spawning).

61.2 · Per-agent memory scope shape

Each AgentRun carries an explicit memory scope. The scope is bound at spawn time and immutable for the run.

{
  "subject_type": "user",
  "subject_id": "user_123",
  "allowed_memory_types": ["preference", "workflow_preference"],
  "allow_semantic_retrieval": true,
  "max_results": 10
}
Section 62 Connector rules for swarms

Connectors should be scoped explicitly per agent.

{
  "allowed_connectors": [
    "postgres",
    "databricks",
    "github",
    "slack"
  ]
}
Subset rule

A child agent should never gain connector access that the parent does not have. Recommended: child_connector_scope ⊆ parent_connector_scope.

For future connectors, define

connector_name
allowed_operations
credential_scope
read_scope
write_scope
rate_limit
approval_required_for_write
audit_level
Section 63 Artifact rules for swarms

Artifact semantics — types, statuses, lifecycle, and the distinction from effects — are defined in §24 Artifact architecture. This section adds only the swarm-specific delta.

63.1 · Join through artifacts, not chat

A coordinator should consume child outputs as artifacts, not as ephemeral chat messages. Every subagent result that matters is committed as a row in artifacts with a typed artifact_type; the coordinator's join strategy reads from there, not from agent message streams.

Coordinator joins artifacts
flowchart LR CA[Child agent output] --> A[Artifact] A --> COORD[Coordinator reads artifacts] COORD --> J[Join strategy] J --> FA([Final artifact]) style COORD fill:#EFE6F0,stroke:#7A5560 style FA fill:#F1F2EC,stroke:#6B7B5A

Typical swarm artifact types

research_summary
sql_result
analysis_result
review_report
risk_assessment
final_synthesis
Section 64 Cancellation semantics

Cancellation modes (graceful, compensate_then_stop) and per-command state transitions are defined in §30 Cancellation standard. This section adds only the swarm-specific delta: how cancellation cascades across a multi-agent execution.

64.1 · Cascade flow

Swarm cascade
flowchart TB CC[Cancel parent command] --> WR[cancel workflow run] WR --> SR[cancel swarm run] SR --> AR[cancel active agent runs] AR --> TR[cancel queued task runs] TR --> CI[cancel pending child invocations] CI --> AP[cancel pending approvals if appropriate] AP --> AU[/"audit all cancellations"/] style AU fill:#F1F2EC,stroke:#6B7B5A

The parent's cancellation_mode propagates: a graceful parent cancel triggers a graceful exit on each child agent run; a compensate_then_stop parent cancel triggers compensation in each child whose effects fired (reverse declaration order per child, then up the chain).

Recommended audit fields on cancel:

{
  "cancellation_mode": "graceful",
  "cancel_reason": "user_requested",
  "cancelled_by": "user_123"
}
Section 65 Compensation semantics

The compensation contract — the three-layer design (manifest, registration validator, runtime drift detector) — is defined in §31 Compensation standard. This section adds the swarm-specific delta: who runs the compensation chain and who proposes it.

Principle

Agents may propose compensation, but Concord executes compensation through governed commands. A compensation proposed by an agent goes through the same manifest validation and drift detection as one declared in the catalog at registration time.

65.1 · Typical swarm side-effect → compensation mapping

Side effectCompensation
Created draft artifactMark artifact cancelled
Started external jobCancel job
Sent notificationSend correction
Wrote staging tableDrop or mark stale
Created approval requestExpire approval
Wrote memorySupersede or delete memory
Opened GitHub PRClose PR or mark draft
Section 66 Quality & evaluation

Swarms should support evaluation as a first-class step. Evaluation can be deterministic or agentic.

Type 01

Deterministic

  • SQL validates
  • row counts reconcile
  • JSON schema is valid
  • artifact exists
  • confidence score exceeds threshold
  • PII check passes
  • cost is below budget
Type 02

Agentic

  • reviewer agent critiques answer
  • evaluator agent ranks candidates
  • safety agent reviews external output
  • domain agent validates reasoning

Agentic evaluation must still write structured results.

{
  "evaluation_type": "reviewer_agent",
  "decision": "pass",
  "confidence": 0.91,
  "issues": [],
  "recommendation": "publish"
}
Section 67 Standard YAML specification

A complete swarm declared in YAML:

name: revenue_report_swarm
objective: Generate and review a monthly revenue report.
execution_mode: parallel
join_strategy: coordinator_synthesis

limits:
  max_agents: 4
  max_depth: 1
  max_total_steps: 80
  max_cost_units: 50

coordinator:
  agent_name: revenue_report_coordinator
  agent_role: coordinator
  max_steps: 20
  allowed_tools:
    - spawn_subagent
    - read_artifact
    - create_artifact
    - request_approval
  allowed_connectors:
    - postgres
    - databricks

agents:
  - agent_name: revenue_researcher
    agent_role: researcher
    delegated_goal: Gather source data and assumptions.
    max_steps: 15
    allowed_tools:
      - run_sql
      - retrieve_memory
      - create_artifact
    allowed_connectors:
      - postgres
      - databricks

  - agent_name: revenue_analyst
    agent_role: analyst
    delegated_goal: Compute metrics and produce analysis.
    max_steps: 15
    allowed_tools:
      - run_sql
      - create_artifact
    allowed_connectors:
      - postgres
      - databricks

  - agent_name: revenue_reviewer
    agent_role: reviewer
    delegated_goal: Validate the final report and identify risks.
    max_steps: 10
    allowed_tools:
      - read_artifact
      - evaluate_output
      - request_human_input
    allowed_connectors:
      - postgres
Section 68 Standard Python interfaces

The core service interfaces — CommandService, PolicyEngine, Planner, Executor, Worker, DurableRuntime, MemoryStore — live in §41 Minimal service interfaces. This section adds the three agent-and-swarm interfaces that aren't elsewhere.

68.1 · Agent runtime interface

from typing import Protocol, Any


class AgentRuntime(Protocol):
    def start_agent_run(
        self,
        agent_run_id: str,
        goal: str,
        context: dict[str, Any],
    ) -> dict[str, Any]: ...

    def resume_agent_run(
        self,
        agent_run_id: str,
        observation: dict[str, Any],
    ) -> dict[str, Any]: ...

    def cancel_agent_run(
        self,
        agent_run_id: str,
        reason: str,
    ) -> dict[str, Any]: ...
Adapter, not library binding

Concord does not depend on a specific LLM/agent library. LangGraph, custom loops, OpenAI Agents SDK, CrewAI, or other systems plug in as adapters behind this protocol — same pattern as DurableRuntime in §41.

68.2 · Swarm planner interface

class SwarmPlanner(Protocol):
    def create_swarm_plan(
        self,
        command: dict[str, Any],
        context: dict[str, Any],
    ) -> dict[str, Any]: ...

68.3 · Join strategy interface

class JoinStrategy(Protocol):
    def join(
        self,
        swarm_run: dict[str, Any],
        child_results: list[dict[str, Any]],
    ) -> dict[str, Any]: ...

Example join output:

{
  "decision": "succeeded",
  "summary": "All required agents completed.",
  "selected_artifact_id": "artifact_123",
  "confidence": 0.88,
  "requires_human_review": false
}
Section 69 How to map any agentic task

Use this 18-point checklist.

  1. What is the parent command?
  2. Is an agent needed, or is a deterministic function enough?
  3. Is this single-agent or swarm?
  4. What is the swarm objective?
  5. Who is the coordinator?
  6. What subagent roles are allowed?
  7. Can subagents spawn further children?
  8. What is the max depth?
  9. What tools can each agent use?
  10. What connectors can each agent use?
  11. What memory can each agent read?
  12. Can any agent write memory?
  13. What artifacts should each agent produce?
  14. How are outputs joined?
  15. Does any step require human approval?
  16. What are the cancellation rules?
  17. What are the compensation rules?
  18. What audit records are mandatory?
Section 70 Example · research and report swarm

70.1 · User request

Create a monthly revenue report, check it, and prepare it for external sharing.

70.2 · Primitive mapping

End-to-end
flowchart TB IN([User request]) --> CMD["Command
generate_and_prepare_revenue_report"] CMD --> POL{{Policy}} POL --> PL([Plan]) PL --> SR[SwarmRun] SR --> COORD[AgentRun · coordinator] COORD --> R[AgentRun · researcher] COORD --> A[AgentRun · analyst] COORD --> REV[AgentRun · reviewer] R --> J["JoinStrategy
coordinator_synthesis"] A --> J REV --> J J --> ART["ArtifactWrite
draft_report"] ART --> POL2{{Policy · external_sharing}} POL2 --> HA[HumanApproval] HA --> AT["AsyncTask
publish_report"] AT --> NOT[Notification] NOT --> AUD[/Audit/] style POL fill:#F5E0D2,stroke:#D97757,stroke-width:1.5px style POL2 fill:#F5E0D2,stroke:#D97757,stroke-width:1.5px style COORD fill:#EFE6F0,stroke:#7A5560 style R fill:#EFE6F0,stroke:#7A5560 style A fill:#EFE6F0,stroke:#7A5560 style REV fill:#EFE6F0,stroke:#7A5560 style AUD fill:#F1F2EC,stroke:#6B7B5A

70.3 · State flow

command.created
→ command.validated
→ swarm.created
→ swarm.running
→ agent_runs.running
→ swarm.joining
→ artifact.created
→ command.waiting_for_approval
→ command.approved
→ command.queued
→ command.running
→ command.succeeded
Section 71 Operational guardrails

71.1 · Hard limits

Every swarm should have hard limits.

max_agents
max_depth
max_steps_per_agent
max_total_steps
max_runtime_seconds
max_cost_units
max_tool_calls
max_memory_reads
max_memory_writes
max_connector_calls

71.2 · Role-based tool permissions

{
  "researcher": ["run_sql", "retrieve_memory", "create_artifact"],
  "analyst":    ["run_sql", "create_artifact", "evaluate_output"],
  "reviewer":   ["read_artifact", "evaluate_output", "request_human_input"],
  "publisher":  ["publish_artifact", "request_approval"]
}

71.3 · No uncontrolled recursion

Every spawn must check

parent_depth + 1 ≤ max_depth · current_agent_count < max_agents · child_scope ⊆ parent_scope · child_tools ⊆ parent_tools · child_connectors ⊆ parent_connectors

71.4 · No ungoverned side effects

Agents may not directly perform external writes. External writes should be commands:

send_email
publish_report
write_table
create_github_pr
post_to_slack
update_memory
grant_permission

Each goes through: Command → Policy → Approval if needed → Execution → Audit.

Section 72 Recommended update to core architecture

Add this section to the main Concord architecture document:

Agentic execution is modeled as a governed extension of the primitive system. A single-agent workflow is one AgentRun attached to a command. A multi-agent workflow is a SwarmRun containing many AgentRuns. A subagent spawn is a governed command that creates an AgentInvocation and child AgentRun. Agent tool calls are commands or task runs. Agent outputs are artifacts. Agent memory writes are candidate memory records. Agent observations and decisions are AgentSteps.
Section 73 Final principle

The system should support future agent runtimes and connector ecosystems without changing the core architecture.

Do not
  • encode agent framework assumptions into the database
  • let agents bypass the primitive gateway
  • let child agents expand their own authority
  • treat chat messages as the only source of truth
Do persist
  • agent runs & steps
  • invocations & spawn decisions
  • artifacts & memory candidates
  • approvals & audit events

Concord should remain:

Postgres-first
Runtime-agnostic
Connector-extensible
Agent-compatible
Policy-governed
Audit-complete
Deterministic when possible
Agentic when useful
Part III · Addendum · Decision accepted

Functional core architecture on DBOS

Concord is built on DBOS + Postgres. DBOS is the durable execution runtime; Postgres is the system of record; Concord is the semantic and functional-core layer.

Supersedes earlier implementation ideas that proposed custom queue, retry, lease, schedule, and outbox machinery inside Concord.

Section 74 Decision accepted · Concord on DBOS

The goal is not to recreate DBOS inside Concord. The goal is to define a clean primitive vocabulary and functional decision layer that DBOS can execute durably.

Three-layer stack
flowchart TB PF["Concord
semantic primitive layer
functional decision core"] DBOSR["DBOS
durable execution runtime
workflows · steps · queues · schedules"] PG[("Postgres
durable system of record
Concord domain + DBOS runtime tables")] PF --> DBOSR --> PG style PF fill:#F5E0D2,stroke:#D97757,stroke-width:1.5px style DBOSR fill:#FAF8F2,stroke:#141413 style PG fill:#FFFFFF,stroke:#141413,stroke-width:1.5px

74.1 · Core decision

Use DBOS for runtime mechanics. Use Concord for meaning, governance, and architecture.

74.2 · Ownership split

DBOS owns

Runtime mechanics

  • durable workflows & steps
  • workflow recovery
  • retries · queues · schedules
  • durable sleep
  • workflow IDs / idempotency
  • Postgres transaction tracking
  • concurrency & rate limits
Concord owns

Meaning & governance

  • command taxonomy
  • policy model
  • task classification & planning semantics
  • approval, memory, artifact models
  • connector contracts
  • agent & swarm ontology
  • domain audit events
Section 75 Why DBOS is the runtime

DBOS aligns with Concord because Concord is:

Postgres-first
functional-core oriented
app-local
connector-heavy
agent-compatible
domain-audit driven

DBOS workflows provide durable execution and recover from completed steps after interruption. DBOS workflow IDs can act as idempotency keys. Workflows are expected to be deterministic; non-deterministic work (database access, third-party APIs, randomness, local time) belongs inside DBOS steps.

Principle

Primitives decide and describe. DBOS steps execute side effects. Postgres records the domain truth.

Section 76 Design philosophy after DBOS

76.1 · Concord becomes smaller

Concord should not be a workflow engine. It becomes a semantic workflow kernel:

Given command + context + current domain state,
derive policy decisions, plans, domain events, and DBOS execution requests.

The implementation should avoid building infrastructure DBOS already supplies.

76.2 · Functional core, DBOS shell

Layer · A

Functional core

  • pure command classification
  • pure policy evaluation
  • pure plan creation
  • pure state transition derivation
  • pure connector permission checks
  • pure agent / swarm planning
Layer · B

DBOS shell

  • durable workflows & steps
  • transaction steps
  • connector calls
  • LLM calls
  • queues · retries · schedules
  • sleep / wait / resume

The functional core is unit-testable without DBOS. The DBOS shell is integration-tested with Postgres and real or mocked connectors.

76.3 · Domain events are not DBOS internals

DBOS already has runtime execution state. Concord should maintain domain events, not duplicate DBOS runtime history.

Yes · domain

Business meaning

command_created
policy_evaluated
approval_requested
approval_granted
memory_candidate_created
artifact_created
connector_invocation_succeeded
booking_confirmed
No · DBOS

Runtime mechanics

worker lease acquired
generic retry attempt
queue poll started
workflow checkpoint
task heartbeat
Rule

DBOS owns runtime mechanics. Concord owns business meaning.

Section 77 What we remove from Concord

Earlier versions included concepts now delegated to DBOS.

77.1 · Remove custom durable queue

Do not build a custom queue table, queue claiming, worker lease system, polling loop, global concurrency controller, or rate limiter. Use DBOS queues.

Concord may still define semantic queue names:

connector_calls
agent_runs
swarm_children
notifications
scheduled_jobs

77.2 · Remove custom retry runner

Use DBOS step/workflow retry behavior. Concord can still declare domain-level retry metadata:

{
  "operation": "hotel_booking.search",
  "retry_class": "transient_connector_error",
  "max_attempts": 3,
  "requires_idempotency_key": true
}

77.3 · Remove custom lease and claim primitives

Drop claimed_by, lease_until, worker tick, manual claim_next_task. Where domain-level ownership matters (an approval assigned to an approver, an agent role assigned to a runtime adapter), model it semantically. Runtime ownership belongs to DBOS.

77.4 · Remove custom schedule runner

Use DBOS schedules. Concord can still declare schedule specs (cron + queue + command_type) and let DBOS execute them.

77.5 · Remove generic effect outbox

The earlier functional design proposed a generic effect_outbox. With DBOS, this is replaced by DBOS workflows, steps, and queues.

Distinction

DBOS queue/workflow/step = executable runtime mechanism. Concord domain_effect = semantic record that an action was requested or performed. Don't use the latter as an execution engine.

Section 78 What Concord keeps

Concord remains responsible for the conceptual structure of work.

Core primitives

Ingress
Command
Context
Policy
Plan
State
Approval
Memory
Artifact
Notification
Cancellation
Compensation
Audit
AgentRun
SwarmRun
ConnectorInvocation

DBOS executes them; DBOS does not define their meaning.

78.1 · Command model

Every consequential action still becomes a command. A command is the durable representation of user, system, connector, or agent intent.

search_hotels
rank_hotel_options
create_booking_draft
book_hotel
cancel_reservation
write_user_preference
spawn_subagent
run_connector_operation
create_artifact
request_approval

78.2 · Policy, artifact, memory models stay

DBOS does not know whether a hotel booking requires approval, whether a connector write is safe, whether memory needs consent, or whether a subagent may access a connector. Concord owns those decisions.

78.3 · Agent & swarm ontology stays

DBOS can run agent workflows durably, but Concord owns: AgentRun, SwarmRun, AgentInvocation, AgentStep, JoinStrategy, ToolScope, ConnectorScope, MemoryScope, SpawnPolicy.

Roles

The agent runtime is an adapter. DBOS is the durable executor. Concord is the governance and domain model.

Section 79 Mapping Concord to DBOS
Concord conceptDBOS implementation
Command submissionDBOS workflow start
Idempotency keyDBOS workflow ID
Sync primitiveNormal function (or DBOS step if side-effectful)
Async primitiveDBOS background or queued workflow
QueueDBOS queue
RetryDBOS step/workflow retry settings
ScheduleDBOS schedule
Durable timerDBOS durable sleep
External side effectDBOS step
Postgres writeDBOS datasource transaction
Agent runDBOS workflow
Subagent spawnDBOS child/queued workflow + Concord AgentInvocation
SwarmDBOS parent workflow coordinating child workflows
Approval waitDBOS workflow waits on durable approval state/event
Connector callDBOS step calling connector adapter
Artifact writeDBOS transaction step
Memory writeDBOS transaction step after policy
Domain auditDBOS transaction step writing audit table
Section 80 New layered architecture

80.1 · Package layout

concord/
  core/
    types.py
    commands.py
    context.py
    classify.py
    policy.py
    planning.py
    transitions.py
    reducers.py
    validation.py

  domain/
    approvals.py
    memory.py
    artifacts.py
    connectors.py
    agents.py
    swarms.py
    audit.py

  dbos_runtime/
    workflows.py
    steps.py
    queues.py
    schedules.py
    datasource.py
    approval_waits.py
    swarm_workflows.py

  adapters/
    connectors/
      hotel.py
      github.py
      slack.py
      databricks.py
    agents/
      base.py
      langgraph_adapter.py
      custom_agent_adapter.py
      openai_agents_adapter.py

  postgres/
    schema.sql
    repositories.py
    projections.py

80.2 · Layer ownership

Dependencies
flowchart TB CORE["core/
pure Python · no DBOS · no DB · no LLMs"] DOMAIN["domain/
data contracts · repository interfaces"] DBOSR["dbos_runtime/
only layer importing DBOS"] ADAPTERS["adapters/
connectors · agent runtimes"] PG["postgres/
schema · repos · projections"] CORE --> DBOSR DOMAIN --> DBOSR DBOSR --> ADAPTERS DBOSR --> PG style CORE fill:#F5E0D2,stroke:#D97757,stroke-width:1.5px style DBOSR fill:#FAF8F2,stroke:#141413 style PG fill:#FFFFFF,stroke:#141413 style ADAPTERS fill:#EFE6F0,stroke:#7A5560 style DOMAIN fill:#F1F2EC,stroke:#6B7B5A
Why

Keeps Concord portable. DBOS-specific code does not leak into the semantic layer.

Section 81 Functional core standard

81.1 · Core types

All pure Concord functions return values of this shape:

from dataclasses import dataclass, field
from typing import Any


@dataclass(frozen=True)
class CoreEvent:
    event_type: str
    payload: dict[str, Any]


@dataclass(frozen=True)
class CoreEffect:
    effect_type: str
    payload: dict[str, Any]
    idempotency_key: str | None = None


@dataclass(frozen=True)
class CoreResult:
    status: str
    events: list[CoreEvent] = field(default_factory=list)
    effects: list[CoreEffect] = field(default_factory=list)
    errors: list[str] = field(default_factory=list)
    metadata: dict[str, Any] = field(default_factory=dict)
Important

CoreEffect is a semantic description, not an execution queue. DBOS interprets it through workflows and steps.

Core → shell
flowchart LR IN[command + context + state] --> CORE[Pure Concord function] CORE --> OUT["CoreResult
status · events · effects"] OUT --> DBOS[DBOS step interprets] DBOS --> PG[(Postgres writes)] DBOS --> CON[Connector calls] DBOS --> Q[Queue enqueues] style CORE fill:#F5E0D2,stroke:#D97757,stroke-width:1.5px style DBOS fill:#FAF8F2,stroke:#141413 style PG fill:#FFFFFF,stroke:#141413

81.2 · Example · functional policy

def evaluate_booking_policy(command: dict, context: dict, state: dict) -> CoreResult:
    payload = command["payload"]

    required = [
        "booking_draft_id",
        "hotel_name",
        "check_in_date",
        "check_out_date",
        "total_price",
        "currency",
        "cancellation_policy_summary",
    ]
    missing = [f for f in required if not payload.get(f)]

    if missing:
        return CoreResult(
            status="waiting_for_input",
            events=[CoreEvent("policy_evaluated", {"decision": "require_more_input", "missing": missing})],
            effects=[CoreEffect(
                "user.request_input",
                {"missing_fields": missing},
                idempotency_key=f"input:{command['command_id']}",
            )],
        )

    if not payload.get("user_explicitly_approved"):
        return CoreResult(
            status="waiting_for_approval",
            events=[CoreEvent("policy_evaluated", {"decision": "require_approval"})],
            effects=[CoreEffect(
                "approval.request",
                {
                    "approval_type": "hotel_booking",
                    "command_id": command["command_id"],
                    "approver": context["user_id"],
                    "approval_packet": payload,
                },
                idempotency_key=f"approval:{command['command_id']}",
            )],
        )

    return CoreResult(
        status="allowed",
        events=[CoreEvent("policy_evaluated", {"decision": "allow"})],
    )

This function does not write Postgres, call DBOS, call hotel APIs, send notifications, mutate objects, read time, or generate random IDs. It only describes intent.

Section 82 DBOS runtime standard

82.1 · Workflow as durable shell

from dbos import DBOS, SetWorkflowID


@DBOS.workflow()
def run_command(command_id: str) -> dict:
    command = load_command_tx(command_id)

    policy_result = evaluate_policy_step(command_id)
    persist_core_events_step(command_id, policy_result["events"])

    if policy_result["status"] == "waiting_for_input":
        request_input_step(command_id, policy_result["effects"])
        return {"status": "waiting_for_input"}

    if policy_result["status"] == "waiting_for_approval":
        request_approval_step(command_id, policy_result["effects"])
        wait_for_approval_workflow(command_id)

    plan = create_plan_step(command_id)
    result = execute_plan_workflow(command_id, plan)

    finalize_command_step(command_id, result)
    return result

Intentionally thin. Workflow body branches on input and calls steps; the business logic lives in the pure core.

82.2 · Step as side-effect boundary

Every non-deterministic operation goes inside a DBOS step or DBOS datasource transaction:

read/write Postgres
call connector
call LLM
send notification
create approval
write memory
write artifact
read current time
generate external IDs

82.3 · Datasource transaction

Use DBOS datasource transactions for Postgres writes that must not re-execute after workflow replay: insert command, insert approval, insert artifact, insert memory, insert audit event, insert agent step, insert connector invocation, update domain projection.

Section 83 DBOS queues standard

Concord defines semantic queue names; DBOS manages the mechanics (concurrency, partitioning, rate limiting).

Recommended queues

concord_default
connector_calls
agent_runs
swarm_children
notifications
scheduled_maintenance
high_risk_operations

Queue choice is a planning outcome

def choose_queue(effect: CoreEffect, context: dict) -> str:
    if effect.effect_type.startswith("connector."):
        return "connector_calls"
    if effect.effect_type.startswith("agent.run"):
        return "agent_runs"
    if effect.effect_type.startswith("agent.spawn"):
        return "swarm_children"
    if effect.effect_type.startswith("notification."):
        return "notifications"
    return "concord_default"
Section 84 DBOS schedules standard

Concord does not build a scheduler. Schedule specs are domain configuration; DBOS owns execution and backfill.

schedules:
  - name: expire_pending_approvals
    command_type: expire_approvals
    cron: "*/5 * * * *"
    queue: scheduled_maintenance

  - name: sync_connector_metadata
    command_type: sync_connector_metadata
    cron: "0 * * * *"
    queue: scheduled_maintenance

  - name: memory_decay_review
    command_type: review_stale_memory
    cron: "0 3 * * *"
    queue: scheduled_maintenance
Section 85 DBOS and agentic workflows

85.1 · AgentRun as DBOS workflow

Concord AgentRun row
→ DBOS workflow: run_agent(agent_run_id)
→ DBOS steps call agent runtime
→ agent proposes commands / tool calls
→ each proposed action re-enters Concord

Agents do not directly execute external side effects.

85.2 · Agent steps as domain records

Agent steps are written to Concord tables for product-level observability and replay:

agent_started
tool_call_proposed
tool_call_allowed
tool_call_denied
subagent_spawn_requested
artifact_created
memory_candidate_created
agent_completed

These are domain trace records, not DBOS runtime logs.

85.3 · Subagent spawn as DBOS child/queued workflow

Command: spawn_subagent
→ Policy: check role / tool / connector / memory / depth limits
→ DBOS workflow creates AgentInvocation
→ DBOS enqueues child AgentRun workflow
Do not

Let the agent runtime spawn unmanaged processes or coroutines.

85.4 · Swarm as parent DBOS workflow

@DBOS.workflow()
def run_swarm(swarm_run_id: str) -> dict:
    swarm = load_swarm_tx(swarm_run_id)

    child_specs = plan_swarm_children_step(swarm_run_id)

    handles = [enqueue_agent_run_step(child) for child in child_specs]
    results = [h.get_result() for h in handles]

    joined = join_swarm_results_step(swarm_run_id, results)
    persist_swarm_result_step(swarm_run_id, joined)

    return joined
Section 86 Approval workflows on DBOS

Human approvals are domain state plus workflow waiting.

Approval under DBOS
sequenceDiagram autonumber participant W as DBOS workflow participant POL as Concord policy participant DB as Postgres participant N as Notification step participant U as User W->>POL: evaluate POL-->>W: approval.request effect W->>DB: create approval row (tx step) W->>N: notify approver W->>W: durable wait / poll approval state U->>DB: approve via UI/API → command: resolve_approval DB-->>W: approval_granted event W->>W: resume
Durability

The workflow must not rely only on in-memory callbacks. Approval state must be durable in Postgres.

Section 87 Connector execution on DBOS

87.1 · Connector calls are DBOS steps

Concord records the semantic invocation; DBOS executes the step durably.

hotel_inventory.search
hotel_booking.book
github.create_pr
slack.send_message
databricks.start_job

87.2 · Connector idempotency

connector: hotel_booking
operation: book_hotel
side_effect: true
idempotency_required: true
idempotency_key_template: "book_hotel:{booking_draft_id}"
approval_required: true
Don't assume

DBOS workflow IDs and step boundaries help prevent re-execution, but external APIs still need domain idempotency keys when they support them. DBOS alone does not make third-party side effects idempotent.

87.3 · Connector invocations table

CREATE TABLE IF NOT EXISTS connector_invocations (
  connector_invocation_id UUID PRIMARY KEY,
  command_id UUID NOT NULL,
  connector_name TEXT NOT NULL,
  operation TEXT NOT NULL,
  side_effect BOOLEAN NOT NULL DEFAULT false,
  idempotency_key TEXT NULL,
  status TEXT NOT NULL,
  request_payload JSONB NOT NULL DEFAULT '{}',
  response_payload JSONB NULL,
  error TEXT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  completed_at TIMESTAMPTZ NULL
);

This is a domain / audit record — not a queue.

Section 88 Postgres schema changes

88.1 · Keep these Concord domain tables

commands
command_events
approvals
memory_records
memory_candidates
artifacts
notifications
connector_invocations
agent_runs
swarm_runs
agent_invocations
agent_steps
domain_audit_log

88.2 · Remove these runtime tables

task_queue
task_leases
worker_claims
generic_effect_outbox
retry_queue
scheduler_jobs
manual_lock_table

Unless a table has product / domain meaning, DBOS should own its runtime equivalent.

88.3 · Updated commands table

CREATE TABLE IF NOT EXISTS commands (
  command_id UUID PRIMARY KEY,
  command_type TEXT NOT NULL,
  requested_by TEXT NOT NULL,
  ingress TEXT NOT NULL,

  payload JSONB NOT NULL DEFAULT '{}',
  context JSONB NOT NULL DEFAULT '{}',

  status TEXT NOT NULL,
  idempotency_key TEXT NULL UNIQUE,

  dbos_workflow_id TEXT NULL UNIQUE,

  result JSONB NULL,
  error TEXT NULL,

  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  completed_at TIMESTAMPTZ NULL
);

dbos_workflow_id links the Concord command to the DBOS execution.

88.4 · command_events

CREATE TABLE IF NOT EXISTS command_events (
  command_event_id UUID PRIMARY KEY,
  command_id UUID NOT NULL REFERENCES commands(command_id),

  event_type TEXT NOT NULL,
  event_payload JSONB NOT NULL DEFAULT '{}',

  actor TEXT NOT NULL,
  trace_id TEXT NOT NULL,

  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

Domain events; does not replace DBOS runtime history.

88.5 · domain_effects (optional)

Only add if product visibility into planned effects is useful.

CREATE TABLE IF NOT EXISTS domain_effects (
  domain_effect_id UUID PRIMARY KEY,
  command_id UUID NOT NULL REFERENCES commands(command_id),

  effect_type TEXT NOT NULL,
  effect_payload JSONB NOT NULL DEFAULT '{}',
  idempotency_key TEXT NULL,

  status TEXT NOT NULL DEFAULT 'planned',

  executed_by_dbos_workflow_id TEXT NULL,
  result JSONB NULL,
  error TEXT NULL,

  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  completed_at TIMESTAMPTZ NULL
);
Do not

Use domain_effects as a runtime queue. DBOS owns runtime.

Section 89 Concord state after DBOS

Concord state is a domain projection, not a runtime scheduler.

DBOS

Execution status

Is the workflow executing, completed, errored, cancelled?

Concord

Business state

Is the booking waiting for approval, confirmed, cancelled, expired?

Examples of Concord state: command.status, approval.status, artifact.status, agent_run.status, swarm_run.status, reservation.status, memory.status.

Section 90 Deterministic workflow design rules

Because DBOS workflows must be deterministic, Concord enforces these rules.

Allowed

Inside workflow body

  • branch on workflow input
  • call DBOS steps
  • call DBOS transaction steps
  • call DBOS sleep
  • enqueue DBOS workflows
  • wait for DBOS workflow handles
  • call pure Concord functions with deterministic inputs
Forbidden

Directly in workflow body

  • database reads/writes outside transaction steps
  • HTTP / API calls
  • LLM calls
  • random number generation
  • current local time
  • non-deterministic iteration over unordered data
  • uncontrolled async races
  • agent runtime loops without step boundaries
  • connector calls

Put forbidden operations inside DBOS steps.

Section 91 Functional core design rules

91.1 · Pure functions do not import DBOS

# Good
def classify_command(command: dict, context: dict) -> CoreResult:
    ...

# Avoid
from dbos import DBOS

def classify_command(...):
    DBOS.logger.info(...)

91.2 · Core functions return values only

# Good
return CoreResult(
    status="waiting_for_approval",
    events=[...],
    effects=[...],
)

# Avoid
insert_approval(...)
send_email(...)
enqueue_worker(...)

91.3 · DBOS steps interpret core results

@DBOS.step()
def interpret_core_effect(command_id: str, effect: dict) -> dict:
    ...
Section 92 Minimal DBOS workflow patterns

92.1 · Submit command

from dbos import DBOS, SetWorkflowID


def submit_command(command_type: str, payload: dict, context: dict) -> str:
    command_id = create_command_tx(command_type, payload, context)

    with SetWorkflowID(f"command:{command_id}"):
        DBOS.start_workflow(run_command, command_id)

    return command_id

92.2 · Run command

@DBOS.workflow()
def run_command(command_id: str) -> dict:
    command = load_command_tx(command_id)

    classification = classify_command_step(command_id)
    persist_core_events_step(command_id, classification["events"])

    policy = evaluate_policy_step(command_id)
    persist_core_events_step(command_id, policy["events"])

    if policy["status"] == "waiting_for_approval":
        create_approval_step(command_id, policy["effects"])
        notify_approver_step(command_id)
        wait_for_approval_workflow(command_id)

    plan = create_plan_step(command_id)
    persist_core_events_step(command_id, plan["events"])

    result = execute_plan_workflow(command_id, plan)

    finalize_command_step(command_id, result)
    return result

92.3 · Execute connector

@DBOS.step()
def execute_connector_step(command_id, connector_name, operation, payload):
    record_connector_invocation_started_tx(command_id, connector_name, operation, payload)

    connector = connector_registry.get(connector_name)
    result = connector.invoke(operation, payload)

    record_connector_invocation_completed_tx(command_id, connector_name, operation, result)
    return result

92.4 · Write artifact

@DBOS.step()
def write_artifact_step(command_id, artifact_type, payload):
    artifact = create_artifact_tx(
        command_id=command_id,
        artifact_type=artifact_type,
        payload=payload,
    )
    append_command_event_tx(
        command_id=command_id,
        event_type="artifact_created",
        payload={"artifact_id": artifact["artifact_id"]},
    )
    return artifact
Section 93 Example · hotel reservation under DBOS
End-to-end · search + book
sequenceDiagram autonumber participant U as User participant C as Concord command participant W as DBOS workflow participant POL as Policy participant S as DBOS step participant CON as Hotel connector participant DB as Postgres U->>C: hotel_reservation_assist C->>W: start run_command W->>POL: classify + evaluate POL-->>W: route_agent (or sync) W->>S: search step S->>CON: search hotels CON-->>S: results S->>DB: write search_results artifact W->>U: present ranked options U->>C: book_hotel(draft_id) W->>POL: evaluate booking policy POL-->>W: require_approval (price + external API) W->>DB: create approval row W->>U: notification — approval needed U-->>DB: approve W->>S: book step S->>CON: book(draft_id, idempotency_key) CON-->>S: confirmation S->>DB: write reservation artifact + audit W-->>U: confirmation notification

93.1 · Why DBOS matters here

Hotel booking has external API calls, payment-sensitive side effects, approval waits, idempotency needs, retryable connector failures, durable state requirements, notification side effects, and memory candidates. DBOS handles durable execution. Concord handles what the booking means.

Section 94 Agent swarms under DBOS

94.1 · Recommended representation

Concord domain objectDBOS implementation
SwarmRunrun_swarm DBOS workflow
AgentRunrun_agent DBOS workflow
AgentStepdomain trace row in Postgres
ToolCallConcord command executed by DBOS

94.2 · Swarm flow

Swarm under DBOS
flowchart TB CMD[command: run_hotel_reservation_swarm] --> W[run_swarm DBOS workflow] W --> COORD[create coordinator AgentRun] COORD --> PLAN[plan child agents] PLAN --> E1[enqueue child run_agent · researcher] PLAN --> E2[enqueue child run_agent · analyst] PLAN --> E3[enqueue child run_agent · reviewer] E1 --> J[wait for results] E2 --> J E3 --> J J --> JOIN[join results] JOIN --> ART[recommendation artifact] ART --> APP{"external booking?"} APP -->|yes| HA[approval] APP -->|no| DONE[done] HA --> DONE style COORD fill:#EFE6F0,stroke:#7A5560 style E1 fill:#EFE6F0,stroke:#7A5560 style E2 fill:#EFE6F0,stroke:#7A5560 style E3 fill:#EFE6F0,stroke:#7A5560 style APP fill:#F5E0D2,stroke:#D97757 style DONE fill:#F1F2EC,stroke:#6B7B5A

94.3 · Subagent spawning rule

Agent proposes spawn_subagent
→ Concord command
→ policy checks max depth, tools, connectors, memory scope
→ DBOS workflow creates AgentInvocation
→ DBOS enqueues child AgentRun workflow
Rule

Do not spawn unmanaged agent processes.

Section 95 Approval waiting pattern

A human approval is durable state plus workflow waiting. Recommended first implementation:

Workflow creates approval row
Workflow durably sleeps / polls approval status
Approval UI updates approval row
Workflow resumes and continues

This is simple, Postgres-native, and consistent with the architecture. It can evolve into a more event-driven DBOS communication pattern later if needed.

Section 96 What to avoid

96.1 · Avoid building a second DBOS

custom workflow runner
custom worker pool
custom durable timers
custom queue leases
custom retry scheduler
custom backfill scheduler
custom workflow recovery

96.2 · Avoid hiding DBOS behind too much abstraction

Engineers should still be able to see: this is a DBOS workflow, this is a DBOS step, this is a DBOS transaction, this is a DBOS queue, this is a Concord command, this is a Concord artifact.

96.3 · Avoid making agents privileged

Bad
  • agent directly calls booking connector
  • agent directly writes memory
  • agent directly sends email
  • agent directly spawns process
Good
  • agent proposes command
  • Concord policy evaluates
  • DBOS executes through durable steps
  • Postgres records domain events
Section 97 Responsibilities matrix
ConcernOwner
Workflow durabilityDBOS
Workflow recoveryDBOS
Step replay behaviorDBOS
Queue mechanicsDBOS
Queue concurrency / rate limitDBOS
SchedulingDBOS
Durable sleepDBOS
Postgres transaction trackingDBOS datasource
Command taxonomyConcord
Domain stateConcord
Policy decisionsConcord
Approval semanticsConcord
Memory semanticsConcord
Artifact semanticsConcord
Agent / swarm ontologyConcord
Connector contractsConcord
Domain auditConcord
Third-party API implementationConnector adapters
Agent reasoning loopAgent runtime adapter
Section 98 Implementation phases
Phases
flowchart LR P1["Phase 1
Functional core"] --> P2["Phase 2
Postgres domain schema"] P2 --> P3["Phase 3
DBOS runtime shell"] P3 --> P4["Phase 4
Hotel reservation agent"] P4 --> P5["Phase 5
Swarms & subagents"] style P1 fill:#F5E0D2,stroke:#D97757 style P3 fill:#FAF8F2,stroke:#141413 style P4 fill:#F1F2EC,stroke:#6B7B5A style P5 fill:#EFE6F0,stroke:#7A5560
Phase 1

Functional core

Pure modules: commands, classification, policy, planning, state transitions, effect descriptors, domain event descriptors. No DBOS imports.

Phase 2

Postgres domain schema

commands, command_events, approvals, artifacts, memory_records / memory_candidates, connector_invocations, agent_runs, swarm_runs, agent_invocations, agent_steps, domain_audit_log.

Phase 3

DBOS runtime shell

run_command, run_agent, run_swarm workflows; execute_connector_step, write_artifact_step, write_memory_step, request_approval_step, notify_step.

Phase 4

Hotel reservation agent

Hotel inventory / booking connectors; search, ranking, booking draft / approval / finalization commands; reservation artifact; travel preference memory.

Phase 5

Swarms & subagents

Swarm planning, spawn_subagent command, agent run workflow, agent step recording, join strategies, scope enforcement.

Section 99 Open design decisions

Keep these explicit:

  1. How approval wait/resume is implemented in DBOS.
  2. Whether to use a domain_effects table for product visibility.
  3. How much DBOS metadata to mirror into Concord domain tables.
  4. How agent runtime adapters report intermediate steps.
  5. How connector idempotency keys are generated and enforced.
  6. How memory consent is represented in the UI.
  7. How cancellation maps from Concord domain state to DBOS workflow cancellation.
  8. How to version command payload schemas and plan schemas.
  9. How to expose workflow status to users without leaking DBOS internals.
  10. How to manage DBOS queues across environments.
Section 100 Updated one-sentence architecture
Concord is a functional semantic layer that turns commands into governed domain decisions, plans, and effects; DBOS is the durable Postgres-backed runtime that executes those effects through workflows, steps, queues, schedules, and transactions.
Section 101 Updated rule of thumb
Concord

When asking …

  • What does this task mean?
  • Is it allowed?
  • Does it need approval?
  • Sync, agentic, or swarm?
  • What memory / artifacts / audit should exist?
  • What connector scopes are allowed?
DBOS

When asking …

  • How does this run durably?
  • How is it retried?
  • How is it queued?
  • How is it scheduled?
  • How does it recover?
  • How do transactions avoid re-execution?
  • How do workers execute it?
Section 102 Source notes
Section 103 Executive summary · Domain Registry

Concord needs a first-class Domain Registry — the capability control plane that is the source of truth for every governed capability contract in the system.

The registry answers what capabilities exist, which versions are active, who owns them, what they allow, what policies apply, what connectors they touch, what memory they read/write, what artifacts they produce, which agents can use them, which workflows can invoke them, which approvals are required, which DBOS workflows execute them.

skills
tools
connectors
command_types
policies
artifact_schemas
memory_schemas
agent_roles
swarm_templates
approval_types
evaluation_suites
workflow_types
Architectural shift

A "Skill Registry" is not a standalone module. It is one domain registry inside the larger Concord Domain Registry.

The core principle holds: capabilities can describe, propose, and guide. Only Concord commands can authorize. Only DBOS workflows and steps can execute. Only Postgres records the truth.

Section 104 Problem statement

Concord is evolving from a workflow primitive layer into a governed action contract system. The need first appears as a skill registry problem — which agent skills exist, which versions are published, which agents can use them, what tools/connectors/memory/artifacts they touch.

But skills are only one part of the capability graph. A skill often depends on many other registered things:

hotel_booking_skill
  uses tool: book_hotel
    creates command_type: book_hotel
      requires policy_bundle: hotel_booking_policy
      requires approval_type: hotel_booking_approval
      produces artifact_schema: reservation_confirmation
      invokes connector_operation: hotel_booking.book_hotel
      may write memory_schema: user.travel_preferences

Without a unified Domain Registry these relationships scatter across YAML, code, prompts, policy files, connector adapters, and agent runtime configuration — making it impossible to answer impact questions like "if we retire this connector operation, which tools break?" or "if we change this artifact schema, which commands produce it?".

Without a registry

Concord cannot fully deliver its core promise — a governed contract layer for deterministic and agentic work.

Section 105 Product decision

105.1 · New core module

Concord adds a first-class module concord.registry (or concord.domain_registry) alongside the existing core, runtime, agents, connectors, and Postgres modules.

concord/
  core/
    commands/  policies/  plans/  effects/  events/  state/
  runtime/
    dbos/  workflows/  steps/  queues/  schedules/
  registry/                          ← new module
    kernel/
    skills/  tools/  connectors/  command_types/  policies/
    artifacts/  memory/  agents/  swarms/  approvals/
    evaluations/  workflows/
  agents/
    runtime_adapters/  tool_gateway/  swarms/  subagents/
  connectors/
    base/  adapters/
  postgres/
    repositories/  migrations/  projections/
  api/
    admin/  runtime/  webhooks/

105.2 · Keep registries in Concord initially

The semantics of skills, tools, connectors, memory, artifacts, policy, approvals, agents, swarms, and command types are core to Concord's contract layer. A separate framework too early would risk becoming a generic catalog that doesn't understand Concord primitives.

105.3 · Future extraction path

The generic mechanics may later become a reusable library concord-registry covering versioned records, lifecycle state machine, artifact references, immutable version checks, generic binding resolution, compatibility constraints, audit event helpers, deprecation, retirement, rollback. Concord retains the domain semantics — what a skill, tool, connector operation, command type, approval type, memory schema, or agent role means.

Section 106 Design philosophy

106.1 · The Domain Registry is the capability graph

The registry models Concord's capability graph and answers impact questions across it.

Capability graph shape
flowchart TB AR[AgentRole] --> SK[Skill] SK --> TO[Tool] TO --> CT[CommandType] CT --> PB[PolicyBundle] CT --> AT[ApprovalType] CT --> AS[ArtifactSchema] CT --> MS[MemorySchema] CT --> DW[DBOS Workflow] TO --> CO[ConnectorOperation] CO --> CN[Connector] style CT fill:#F5E0D2,stroke:#D97757,stroke-width:1.5px style AR fill:#EFE6F0,stroke:#7A5560 style DW fill:#FFFFFF,stroke:#141413

106.2 · Registries are semantic, not runtime execution

The registry defines what capabilities mean. DBOS defines how workflows execute durably. book_hotel as a registry entry declares high-risk, requires approval, produces reservation_confirmation, invokes hotel_booking.book_hotel, with idempotency key book_hotel:{booking_draft_id}. DBOS as the runtime executes the durable workflow with retries, recovery, and transaction tracking.

106.3 · Published versions are immutable

Rule

Every registry object that can influence behavior must be versioned. Published versions are immutable. Allowed: create / publish / deprecate / retire a version, change a binding to point to a new version, rollback a binding. Not allowed: mutate a published manifest in place, expand connector permissions in place, remove an approval requirement in place, change a schema in place.

106.4 · Specific bindings beat global bindings

Resolution applies scope-aware precedence — more specific bindings override less specific ones.

user → team → tenant → workflow_type → command_type
     → agent_role → agent_name → app → environment → global

106.5 · Runtime manifests are minimized

Agent runtimes, workflows, and tool gateways receive only the capabilities they are allowed to use. Draft versions, retired versions, other tenants' bindings, admin metadata, raw secrets, unbound skills, unallowed connectors, unallowed memory scopes — none of these leak into runtime manifests.

Section 107 Registry types

Twelve registry types, each a typed view over the registry kernel.

Type 01

Skill

Reusable governed capability package: instructions, allowed tools / connectors / memory scopes / artifact scopes, evals, approval requirements, agent/runtime compatibility.

Type 02

Tool

Executable internal capability: input/output schema, sync/async mode, side-effect classification, DBOS execution mode, idempotency, policy + artifact + command + connector mappings.

Type 03

Connector

External system + operations: credential scopes, read/write scopes, rate limits, idempotency support, compensation behavior, approval requirements, adapter compatibility.

Type 04

Command type

Canonical contract for a governed action: payload schema, default policy bundle, execution mode, approval behavior, artifacts, memory behavior, connector operations, DBOS workflow mapping, idempotency, risk level. Likely the most important registry.

Type 05

Policy

Named policy checks and bundles: policy input contract, decision outputs, risk classifications, applicability rules, bundles, external policy-engine references, versioning. Registry decides where and how policy applies; doesn't have to implement every engine.

Type 06

Artifact schema

Contract for durable outputs: schema, visibility rules, lineage requirements, retention policy, external-sharing policy, version compatibility, rendering hints.

Type 07

Memory schema

Contract for durable preferences and reusable facts: subject type, visibility, consent requirements, confidence requirements, expiration, supersession rules, retrieval rules, write policy.

Type 08

Agent role

Capability boundary for an agent: allowed skills / tools / connectors / subagent roles, max steps, max depth, memory + artifact scope, approval behavior, runtime compatibility.

Type 09

Swarm template

Governed multi-agent execution pattern: coordinator role, child roles, join strategy, spawn limits, parallelism, required artifacts and evaluations, approval gates, memory + connector scope inheritance.

Type 10

Approval type

Contract for human authorization: required fields, approver resolution, expiration, risk level, UI schema, audit requirements, resume behavior, allowed decision values.

Type 11

Evaluation suite

Contract for quality and safety checks: eval input/output schema, blocking vs advisory, applicability, version, runtime adapter.

Type 12

Workflow type

Contract for a repeatable business / application workflow: default command sequence, allowed command types, default policies, default agent roles, default skills, approval gates, artifact expectations, DBOS workflow mapping.

107.1 · Skill manifest example

name: hotel_booking
version: 1.4.2
display_name: Hotel Booking Skill
runtime:
  min_concord_version: "0.3.0"
  compatible_agent_runtimes: [concord_agent_runtime, langgraph_adapter]
lifecycle:
  owner: travel-platform
  risk_level: high
capabilities:
  - type: connector_operation
    connector: hotel_booking
    operation: book_hotel
    side_effect: true
    requires_approval: true
  - type: artifact_write
    artifact_type: reservation_confirmation
  - type: memory_write
    memory_type: user.travel_preferences
    requires_consent: true
policy:
  required_checks: [permission, connector_scope, payment_token_required,
                    cancellation_policy_disclosed, approval_required_for_booking]
tools:
  - name: book_hotel
    command_type: book_hotel
    mode: approval_gated_async
evals:
  required: [booking_terms_present, cancellation_policy_present, approval_packet_complete]

107.2 · Command type manifest example

name: book_hotel
version: 1.0.0
risk_level: high
execution:
  mode: approval_gated_async
  dbos_workflow: run_command
payload_schema:
  type: object
  required: [booking_draft_id, payment_token_ref]
policies:
  required: [permission, connector_scope, payment_token_required, approval_required_for_booking]
approval:
  required: true
  approval_type: hotel_booking_approval
artifacts:
  produces: [reservation_confirmation]
connector_operations: [hotel_booking.book_hotel]
idempotency:
  required: true
  key_template: "book_hotel:{booking_draft_id}"
compensation:
  supported: true
  command_type: cancel_reservation

107.3 · Approval type manifest example

name: hotel_booking_approval
version: 1.0.0
risk_level: high
required_fields:
  - hotel_name
  - check_in_date
  - check_out_date
  - guests
  - total_price
  - currency
  - cancellation_policy_summary
  - payment_method_summary
decision_values: [approved, rejected, request_changes]
expiration:
  default_minutes: 30
ui:
  template: hotel_booking_approval_card
resume:
  on_approved: continue_workflow
  on_rejected: fail_or_replan
Section 108 Registry kernel

The Domain Registry is built on a shared kernel that provides generic mechanics. Each typed registry is a thin view over the kernel.

108.1 · Core kernel objects

ObjectCarries
RegistryEntityentity_id, entity_type, name, display_name, description, owner, domain, status
RegistryVersionversion_id, entity_id, version (major.minor.patch), status, manifest, checksum, created_by, approved_by, lifecycle timestamps
RegistryArtifactartifact_id, version_id, artifact_type, artifact_uri, checksum, metadata
RegistryBindingbinding_id, version_id, binding_scope, binding_target, environment, status, rollout_strategy, rollout_config
RegistryLifecycleEventlifecycle_event_id, entity_id, version_id, event_type, actor, payload, trace_id
RegistryRelationshipsource_version_id, relationship_type, target_entity_type, target_entity_name, target_version_constraint, metadata

108.2 · Relationship types

uses_tool
creates_command_type
requires_policy
requires_approval_type
produces_artifact_schema
reads_memory_schema
writes_memory_schema
invokes_connector_operation
allows_agent_role
includes_agent_role
requires_evaluation_suite
uses_workflow_type
Section 109 Data model

Hybrid schema — generic kernel tables, with optional typed domain tables (skill_capabilities, tool_contracts, connector_operations, etc.) when query patterns warrant. This avoids per-type lifecycle duplication while still allowing domain-specific validation.

109.1 · registry_entities

CREATE TABLE IF NOT EXISTS registry_entities (
  entity_id UUID PRIMARY KEY,
  entity_type TEXT NOT NULL,       -- 'skill', 'tool', 'connector', 'command_type',
                                   -- 'policy', 'artifact_schema', 'memory_schema',
                                   -- 'agent_role', 'swarm_template', 'approval_type',
                                   -- 'evaluation_suite', 'workflow_type'
  name TEXT NOT NULL,
  display_name TEXT NOT NULL,
  description TEXT NULL,
  owner TEXT NOT NULL,
  domain TEXT NULL,
  status TEXT NOT NULL DEFAULT 'active',
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  UNIQUE(entity_type, name)
);

109.2 · registry_versions

CREATE TABLE IF NOT EXISTS registry_versions (
  version_id UUID PRIMARY KEY,
  entity_id UUID NOT NULL REFERENCES registry_entities(entity_id),
  version TEXT NOT NULL,
  version_major INT NOT NULL,
  version_minor INT NOT NULL,
  version_patch INT NOT NULL,
  status TEXT NOT NULL DEFAULT 'draft',  -- draft | validated | pending_approval |
                                         -- approved | published | deprecated |
                                         -- retired | rejected
  risk_level TEXT NOT NULL DEFAULT 'low',
  manifest JSONB NOT NULL DEFAULT '{}',
  checksum TEXT NOT NULL,
  created_by TEXT NOT NULL,
  approved_by TEXT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  validated_at TIMESTAMPTZ NULL,
  approved_at TIMESTAMPTZ NULL,
  published_at TIMESTAMPTZ NULL,
  deprecated_at TIMESTAMPTZ NULL,
  retired_at TIMESTAMPTZ NULL,
  UNIQUE(entity_id, version)
);

109.3 · registry_artifacts · registry_bindings · registry_lifecycle_events

CREATE TABLE IF NOT EXISTS registry_artifacts (
  registry_artifact_id UUID PRIMARY KEY,
  version_id UUID NOT NULL REFERENCES registry_versions(version_id),
  artifact_type TEXT NOT NULL,     -- manifest_yaml | manifest_json | skill_md |
                                   -- tool_schema | connector_schema | policy_bundle |
                                   -- prompt_template | eval_suite | example_bundle |
                                   -- package_archive
  artifact_uri TEXT NOT NULL,
  checksum TEXT NOT NULL,
  metadata JSONB NOT NULL DEFAULT '{}',
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE IF NOT EXISTS registry_bindings (
  registry_binding_id UUID PRIMARY KEY,
  version_id UUID NOT NULL REFERENCES registry_versions(version_id),
  binding_scope TEXT NOT NULL,     -- global | environment | app | tenant | team |
                                   -- user | agent_role | agent_name | workflow_type |
                                   -- command_type | connector
  binding_target TEXT NOT NULL,
  environment TEXT NOT NULL DEFAULT 'prod',
  status TEXT NOT NULL DEFAULT 'active',
  rollout_strategy TEXT NOT NULL DEFAULT 'pinned',
  rollout_config JSONB NOT NULL DEFAULT '{}',
  created_by TEXT NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE IF NOT EXISTS registry_lifecycle_events (
  registry_lifecycle_event_id UUID PRIMARY KEY,
  entity_id UUID NOT NULL REFERENCES registry_entities(entity_id),
  version_id UUID NULL REFERENCES registry_versions(version_id),
  event_type TEXT NOT NULL,
  actor TEXT NOT NULL,
  payload JSONB NOT NULL DEFAULT '{}',
  trace_id TEXT NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

109.4 · registry_relationships · registry_evaluations · registry_usage

CREATE TABLE IF NOT EXISTS registry_relationships (
  registry_relationship_id UUID PRIMARY KEY,
  source_version_id UUID NOT NULL REFERENCES registry_versions(version_id),
  relationship_type TEXT NOT NULL,
  target_entity_type TEXT NOT NULL,
  target_entity_name TEXT NOT NULL,
  target_version_constraint TEXT NULL,
  required BOOLEAN NOT NULL DEFAULT true,
  metadata JSONB NOT NULL DEFAULT '{}',
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE IF NOT EXISTS registry_evaluations (
  registry_evaluation_id UUID PRIMARY KEY,
  version_id UUID NOT NULL REFERENCES registry_versions(version_id),
  evaluation_type TEXT NOT NULL,
  status TEXT NOT NULL,
  result JSONB NOT NULL DEFAULT '{}',
  error TEXT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  completed_at TIMESTAMPTZ NULL
);

CREATE TABLE IF NOT EXISTS registry_usage (
  registry_usage_id UUID PRIMARY KEY,
  version_id UUID NOT NULL REFERENCES registry_versions(version_id),
  agent_run_id UUID NULL,
  swarm_run_id UUID NULL,
  command_id UUID NULL,
  workflow_run_id UUID NULL,
  usage_type TEXT NOT NULL,        -- resolved_for_agent_run | tool_enabled |
                                   -- command_proposed | connector_operation_enabled |
                                   -- memory_read_enabled | memory_write_proposed |
                                   -- artifact_created | subagent_spawn_enabled |
                                   -- policy_applied | approval_type_used
  payload JSONB NOT NULL DEFAULT '{}',
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
Section 110 Versioning

All behavior-affecting registry objects use semantic versioning (MAJOR.MINOR.PATCH).

Patch

Doc / cosmetic

Documentation improvements, examples added, prompt wording changes that preserve behavior, bug fixes that preserve contract.

Minor

Additive

New optional capability, new optional field, new optional eval, backward-compatible schema expansion.

Major

Breaking

Changed input or output contract, removed capability, expanded connector scope, changed memory scope, changed approval behavior, changed artifact schema incompatibly, changed side-effect semantics.

Section 111 Lifecycle

111.1 · Standard lifecycle

create registry version
→ validate manifest
→ validate relationships
→ validate compatibility
→ run evals
→ classify risk
→ request approval if needed
→ publish
→ bind to scopes
→ monitor usage
→ deprecate
→ retire

111.2 · State machine

draft
validated
pending_approval
approved
published
deprecated
retired
rejected

111.3 · High-risk lifecycle

High-risk objects require explicit approval before publication. High-risk changes include external write added · memory write added · payment operation added · database mutation added · subagent spawning added · connector scope expanded · approval requirement removed · artifact schema changed incompatibly · policy weakened.

111.4 · Emergency retirement

Triggered by

Security vulnerability · unsafe connector behavior · approval bypass · prompt-injection vulnerability · data leakage · memory corruption · payment-related defect.

Emergency retirement blocks new resolution, disables or migrates active bindings, notifies owners, writes lifecycle events, and creates an incident audit.

Section 112 Bindings & resolution

112.1 · Binding strategies

pinned
latest_patch
latest_minor
canary
percentage
tenant_allowlist
environment_specific
manual
Production default

pinned. Avoid production bindings to unbounded latest — silent updates of behavior-changing manifests are a class of incident no team wants.

112.2 · Resolution inputs

Resolution accepts entity_type, scope, target, environment, agent_role, agent_name, user_id, team_id, tenant_id, app_id, workflow_type, command_type, and context. It applies the precedence order from §106.4.

112.3 · Resolution output

Examples: runtime skill manifest · allowed tool list · connector operation contract · command type contract · policy bundle · approval packet schema · artifact schema · memory access contract · agent role manifest · swarm template manifest.

112.4 · Runtime manifest for an AgentRun

{
  "agent_run_id": "agent_run_123",
  "agent_role": "hotel_booking_agent",
  "skills": [
    {
      "name": "hotel_booking",
      "version": "1.4.2",
      "skill_version_id": "uuid",
      "instructions_ref": "artifact://...",
      "tools": [
        { "name": "create_booking_draft", "command_type": "create_booking_draft", "mode": "async" },
        { "name": "book_hotel", "command_type": "book_hotel", "mode": "approval_gated_async" }
      ],
      "memory_scope": {
        "read":  ["user.travel_preferences"],
        "write": ["user.travel_preferences"]
      },
      "artifact_scope": {
        "write": ["booking_draft", "reservation_confirmation"]
      }
    }
  ]
}
Section 113 DBOS lifecycle workflows

Registry lifecycle runs as DBOS workflows — durable, recoverable, idempotent.

Workflow

publish_registry_version

load version → validate manifest → validate relationships → validate compatibility → run eval suite → classify risk → request approval if needed → publish → write lifecycle events → notify owner.

Workflow

bind_registry_version

load version → check status is published → validate binding scope → check compatibility → write binding → write lifecycle event → invalidate resolution cache.

Workflow

rollback_registry_binding

load current binding → load target prior version → validate target is usable → update binding → write lifecycle event → notify affected owners.

Workflow

retire_registry_version

load version → find active bindings → disable or migrate bindings → mark retired → write lifecycle event → notify owners.

Workflow

resolve_agent_runtime_manifest

load active bindings → apply precedence → filter by policy → validate compatibility → resolve relationships → build minimized manifest → record usage.

Section 114 Policy & governance

The Domain Registry evaluates:

  • Can this agent use this skill?
  • Can this skill expose this tool?
  • Can this tool create this command type?
  • Can this command type invoke this connector operation?
  • Can this command type produce this artifact?
  • Can this skill read / write this memory?
  • Can this agent spawn this subagent?
  • Can this workflow use this swarm template?
  • Does this capability require approval?
  • Is this version deprecated or retired?
  • Is this version allowed in this environment?
Registry changes are commands

Registry mutations themselves go through the Concord command contract: create_registry_entity, create_registry_version, validate_registry_version, publish_registry_version, bind_registry_version, deprecate_registry_version, retire_registry_version, rollback_registry_binding. Policy applies, audit records, approval gates trigger.

Section 115 API surface
def create_registry_entity(
    entity_type: str, name: str, display_name: str,
    owner: str, description: str | None = None, domain: str | None = None,
) -> str: ...

def create_registry_version(
    entity_type: str, name: str, version: str,
    manifest: dict, artifact_refs: list[dict], created_by: str,
) -> str: ...

def validate_registry_version(version_id: str, actor: str) -> dict: ...

def publish_registry_version(version_id: str, actor: str) -> dict: ...

def bind_registry_version(
    version_id: str, binding_scope: str, binding_target: str,
    environment: str, actor: str,
    rollout_strategy: str = "pinned", rollout_config: dict | None = None,
) -> str: ...

def resolve_agent_runtime_manifest(
    agent_run_id: str, agent_name: str, agent_role: str,
    context: dict, environment: str,
) -> dict: ...

def resolve_command_type_contract(
    command_type: str, context: dict, environment: str,
) -> dict: ...

def can_use_capability(
    source_version_id: str, relationship_type: str,
    target_entity_type: str, target_entity_name: str, context: dict,
) -> bool: ...

def rollback_registry_binding(
    binding_id: str, target_version_id: str, actor: str, reason: str,
) -> dict: ...
Section 116 Admin UX

The admin UI should support listing entities by type, viewing and comparing versions, viewing manifests / relationships / declared capabilities / risk level / eval results, publishing, approving high-risk versions, binding to scopes, viewing active bindings, rolling back, deprecating, retiring, viewing lifecycle timelines, viewing usage, performing impact analysis, and previewing runtime manifests.

116.1 · Critical views

Domain registry dashboard
Entity detail
Version detail
Relationship graph
Binding detail
Risk review
Eval results
Usage graph
Runtime manifest preview
Impact analysis
Section 117 Observability

117.1 · Tracked behaviors

Versions resolved per entity type, versions used per command type, skills resolved per agent role, tools enabled per skill, connector operations invoked per tool, policies applied per command type, artifacts produced per command type, memory writes by skill, deprecated version usage, retired version resolution attempts, rollback frequency.

117.2 · Metrics

registry_version_publish_count
registry_resolution_count
registry_resolution_latency_ms
registry_eval_failure_count
registry_binding_rollback_count
deprecated_version_usage_count
retired_version_block_count
high_risk_approval_count
impact_analysis_count
Section 118 Implementation plan
Phases
flowchart LR P1["Phase 1
Registry kernel"] --> P2["Phase 2
Core domain registries"] P2 --> P3["Phase 3
Relationship graph"] P3 --> P4["Phase 4
Evals & rollout"] P4 --> P5["Phase 5
Swarm-aware registry"] P5 --> P6["Phase 6
Optional extraction"] style P1 fill:#FAF8F2,stroke:#141413 style P6 fill:#F1F2EC,stroke:#6B7B5A
Phase 1

Registry kernel

Kernel tables, basic lifecycle, publish, bind, resolve, rollback. Entities supported: skill · tool · connector · command_type.

Phase 2

Core domain registries

Add: policy · artifact schema · memory schema · agent role · approval type. Add relationship validation, runtime manifest generation, capability filtering, risk classification, high-risk approval workflow.

Phase 3

Relationship graph

Add: registry_relationships, impact analysis, dependency resolution, version compatibility checks, graph visualization.

Phase 4

Evals & rollout

Add: registry_evaluations, eval suites, canary rollout, compatibility checks, deprecation warnings, usage tracking.

Phase 5

Swarm-aware registry

Add: swarm template + evaluation suite + workflow type registries, subagent skill constraints, parent-child scope inheritance, join-strategy compatibility.

Phase 6

Optional extraction

If the kernel stabilizes, extract concord-registry as a reusable library. Keep Concord-specific domain semantics in Concord.

Section 119 Acceptance criteria

119.1 · Functional · v1

  • Create registry entity.
  • Create registry version.
  • Validate manifest.
  • Publish version.
  • Bind version to scope.
  • Resolve version by scope.
  • Resolve skills / tools / connectors / command types for an AgentRun.
  • Prevent retired version from resolving.
  • Write lifecycle events.
  • Rollback binding.

119.2 · Technical · v1

  • All registry writes are audited.
  • Published versions are immutable.
  • Runtime manifests exclude disallowed capabilities.
  • DBOS lifecycle workflows run idempotently.
  • High-risk registry changes are blocked without approval.
  • Resolution is deterministic.
  • Postgres is the source of truth.
Section 120 Open questions
  1. How generic should the registry kernel be in v1?
  2. Should typed domain tables exist immediately, or are manifests + relationships sufficient at first?
  3. How strict should manifest validation be at first?
  4. Should SKILL.md and other large artifacts live in Git, object storage, or Postgres?
  5. Should high-risk approval be per version or per binding?
  6. How should runtime manifest caching work?
  7. How should registry impact analysis be visualized?
  8. Should deprecated versions resolve in production?
  9. How should active DBOS workflows behave if a registry version is retired mid-run?
  10. Should command type contracts be required before any command can run?
Section 121 Final recommendation

Build a new core Concord module concord.registry — the Concord Domain Registry, not only a Skills Registry.

121.1 · The twelve registries

Skill
Tool
Connector
Command type
Policy
Artifact schema
Memory schema
Agent role
Swarm template
Approval type
Evaluation suite
Workflow type

121.2 · On a shared kernel

versioned records
lifecycle state machine
artifact references
immutable published versions
generic bindings
scope resolution
relationship graph
compatibility checks
audit event helpers
rollback
deprecation
retirement

121.3 · The architecture

Concord Core defines the primitive model. Concord Registry defines capability contracts. DBOS Runtime executes them durably. Postgres records the truth.
Key framing

The Domain Registry is Concord's capability graph. Skills are one node type in that graph.