Features

Zart provides everything you need to build reliable, long-running workflows in Rust. Here is a detailed look at each feature.

1. State Recovery

Every step’s result is serialized and written to the database before the workflow advances. On any restart — planned or unplanned — completed steps are replayed from storage.

How it works:

Worker calls ctx.execute_step(MyStep { ... }).
Zart checks zart_steps for a row with (execution_id, step_name).
Hit: deserializes and returns the stored value. Step logic is never called.
Miss: runs step.run(), serializes the result, inserts the row, returns the value.

What this means for you:

A payment is never charged twice, even if your process crashes after the charge but before the receipt.
Database inserts, API calls, and emails become safe to “retry” — the retry only runs incomplete work.
No idempotency logic needed inside the step logic for most cases.

// This can crash anywhere — Zart recovers from the last completed step
ctx.execute_step(StepA).await?;
ctx.execute_step(StepB).await?; // <-- crash here
ctx.execute_step(StepC).await?; // resumes here

2. Retry Policies

Configure retry behaviour at both the task level and the individual step level.

Task-level retries via fn max_retries():

fn max_retries(&self) -> usize { 3 }

Step-level retries via ZartStep::retry_config():

struct ChargeStep { card: String, amount: i64 }

#[async_trait]
impl ZartStep for ChargeStep {
    type Output = ChargeResult;
    fn step_name(&self) -> Cow<'static, str> { Cow::Borrowed("charge-card") }
    fn retry_config(&self) -> Option<RetryConfig> {
        Some(RetryConfig::exponential(3, Duration::from_secs(2)))
    }
    async fn run(&self, _ctx: StepContext) -> Result<Self::Output, StepError> {
        stripe.charge(&self.card, self.amount).await
    }
}

// Usage:
let result = ctx.execute_step(ChargeStep { card, amount }).await?;

RetryConfig variants:

Config	Behaviour
`RetryConfig::none()`	No retries — fail immediately
`RetryConfig::fixed(n, delay)`	`n` retries, constant `delay` between each
`RetryConfig::exponential(n, base)`	`n` retries, delay doubles each attempt

Jitter is applied automatically to exponential backoff to prevent thundering herd.

3. Parallel Execution

Fan out work to multiple concurrent sub-tasks using ctx.schedule_step() and ctx.wait_all().

Each scheduled step becomes an independent task in the scheduler — it can run on a different worker node. wait_all durably suspends the parent until all children complete.

let h1 = ctx.schedule_step(EmailStep);
let h2 = ctx.schedule_step(BillingStep);
let h3 = ctx.schedule_step(ProvisionStep);

// Parent suspends here — no thread is blocked
ctx.wait_all(vec![h1, h2, h3]).await?;

If the parent process restarts while waiting, the children continue running independently and the parent resumes when they’re done.

See Parallel Steps for full documentation.

4. Durable Waits

Pause a workflow without occupying a thread or blocking a worker.

`ctx.sleep(name, duration)`

The execution is suspended and re-queued after the duration expires. Zero threads blocked. Each sleep requires a unique, stable name for replay.

ctx.sleep("approval-wait", Duration::from_secs(3600)).await?; // wait 1 hour

`ctx.sleep_until(name, timestamp)`

Sleep until an absolute point in time.

let next_monday = compute_next_monday();
ctx.sleep_until("wait-for-monday", next_monday).await?;

`ctx.wait_for_event(name, timeout)`

Suspend until an external signal arrives.

let payload: Approval = ctx.wait_for_event(
    "hr-approval",
    Some(Duration::from_secs(5 * 86400)),
).await?;

See Wait for Event for full documentation.

5. Idempotency

Zart prevents duplicate executions at the scheduling layer.

Execution IDs: every execution has a unique ID. Scheduling with an explicit ID that already exists is a no-op — the existing execution continues.

// Safe to call multiple times — only one execution is created
scheduler.schedule_with_id(
    "checkout",
    &format!("checkout-order-{order_id}"),
    CheckoutData { order_id },
).await?;

Step-level idempotency: within an execution, step names act as idempotency keys. A step that has already completed will never re-run its closure.

External call idempotency: use ctx.execution_id() to derive stable keys for external APIs:

let idempotency_key = format!("stripe-charge-{}", ctx.execution_id());
stripe.charge_idempotently(&card, amount, &idempotency_key).await?;

6. Concurrent Workers

Multiple worker instances can run against the same database with zero configuration. Zart uses SELECT … FOR UPDATE SKIP LOCKED to claim tasks — each execution is processed by exactly one worker at a time.

Worker A ─── polls ──→ claims exec-001, exec-003
Worker B ─── polls ──→ claims exec-002, exec-004
Worker C ─── polls ──→ claims exec-005 (skips locked rows)

Workers can be added or removed at any time. There is no leader election, no ZooKeeper, no coordination service.

let worker = Worker::new(scheduler, registry, WorkerConfig {
    poll_interval:        Duration::from_secs(5),
    max_tasks_per_poll:   10,
    max_concurrent_tasks: 16,  // tokio tasks per worker
    shutdown_timeout:     Duration::from_secs(30),
});

7. PostgreSQL Storage

All workflow state is persisted to PostgreSQL via the zart-postgres crate. No extra infrastructure — just a database you already have.

[dependencies]
zart-postgres = "0.1"

8. Events

External systems can deliver signals to waiting workflows via offer_event.

Rust:

scheduler.offer_event(execution_id, "payment-received", payload).await?;

HTTP:

POST /api/v1/events/{execution_id}/{event_name}

Events can carry arbitrary JSON payloads. The waiting wait_for_event call deserializes the payload into the expected type automatically.

Events time out gracefully — if no event arrives within the specified window, wait_for_event returns Err(TaskError::EventTimeout), allowing the workflow to take a fallback path.

9. Observability

Every execution and every step attempt is tracked in the database.

Execution status:

pending — scheduled, not yet started
running — actively being processed
waiting — durably paused (sleep or event wait)
completed — finished successfully
failed — exhausted retries

Step records (zart_steps table): stores the step name, attempt count, result JSON, and per-attempt error messages for debugging.

Query execution status via the Rust API:

let exec = scheduler.get_execution("exec-abc-123").await?;
println!("Status: {:?}", exec.status);
println!("Steps:  {}", exec.completed_steps);

10. Task Heartbeats

Long-running steps — external API calls, data pipelines, video processing — can take minutes or even hours. Without heartbeats, Zart’s orphan recovery would detect these tasks as “stuck” and reset them back to scheduled, causing duplicate execution.

How it works:

When a worker picks up a task, it locks the row with locked_at. A background heartbeat task runs in parallel with the handler, periodically updating locked_at to NOW(). When the handler returns (success or failure), the heartbeat is cancelled.

Worker picks up task → locked_at = T
    ↓
Heartbeat loop: every 100s → UPDATE locked_at = NOW()
    ↓ (handler runs for 20 minutes — external API call, data pipeline, etc.)
Heartbeat continues renewing...
    ↓
Handler returns → heartbeat cancelled → result persisted

Configuration:

let worker = Worker::new(scheduler, registry, WorkerConfig {
    orphan_timeout:     Duration::from_secs(300),   // 5 minutes
    heartbeat_interval: None,                        // auto: orphan_timeout / 3
    // ...
});

Setting	Default	Effect
`heartbeat_interval: None`	`orphan_timeout / 3`	Auto-computed — 2 retries before orphan reclaim
`heartbeat_interval: Some(60s)`	—	Custom 60-second interval
`heartbeat_interval: Some(0)`	—	Heartbeat disabled entirely

Failure modes:

Scenario	Behavior
Single heartbeat DB error	Logged, retried on next interval
Persistent DB failure (3+ misses)	Orphan recovery reclaims task; handler result is discarded
Handler panics	Cancellation token fires, heartbeat stops cleanly
Worker graceful shutdown	Cancellation cascades to heartbeat

Observability:

Heartbeat activity is tracked via Prometheus metrics:

zart_task_heartbeat_renewals_total — counter with success / failed / not_found status labels
zart_heartbeat_active — gauge showing currently running heartbeat loops

No schema changes required — the heartbeat only updates the existing locked_at column.