Data flow: Change apply

sequenceDiagram
  autonumber
  participant Admin as MSP admin
  participant Next as Next.js
  participant SbDB as Supabase Postgres
  participant ResyncEdge as graph-policies-read
  participant WriteEdge as graph-policies-write
  participant Graph as Microsoft Graph

  Admin->>Next: Approve + Apply on change_request
  Next->>SbDB: verify status=dry_run_complete, dry_run_at within 30min, dry_run_result.ok
  Next->>SbDB: CAS status dry_run_complete → applying
  Next->>ResyncEdge: invoke (source=pre_change)
  ResyncEdge->>Graph: GET CA policies
  ResyncEdge->>SbDB: insert policy_snapshot source=pre_change
  Next->>WriteEdge: invoke { tenant_id, change_id, payload }
  WriteEdge->>WriteEdge: change-guard (status=applying, payload match, dry_run TTL)
  WriteEdge->>Graph: PATCH /identity/conditionalAccess/policies/{id}
  Graph-->>WriteEdge: 204 No Content
  WriteEdge-->>Next: { ok: true }
  Next->>ResyncEdge: invoke (source=post_change)
  ResyncEdge->>SbDB: insert policy_snapshot source=post_change
  Next->>SbDB: update change_request status=applied, snapshot ids
  Next->>SbDB: insert audit_log
  Next-->>Admin: page revalidate, status=applied

State machine

Statuses are defined in lib/changes/status-guard.ts. Transitions below are the operator-visible paths; cancelled is available from several pre-apply states.

draft ── runDryRun ──┬──> dry_run_blocked ── re-run dry_run ──┐
                     │                                         │
                     ├──> awaiting_approval ── approveChange ──┤
                     │         (second admin; creator cannot   │
                     │          self-approve when required)     │
                     │                                         v
                     └──> dry_run_complete ─────────────────────┘
                                    │
                    ┌───────────────┼───────────────┐
                    │               │               │
            (future scheduled_for)    │         approveAndApply
                    │               │         (or cron when due)
                    v               v               v
              stays complete    cancelled      applying
              until due                          │
                    │                            ├── pre_snapshot_failed → failed
                    │                            ├── graph write error → failed
                    │                            └── success → applied
                    │                                      │
                    └──────── cron picks up ───────────────┘
                                                         │
                                              rollback (applied only)
                                                         v
                                                   rolled_back

Status reference

Status	Meaning	Next steps
`draft`	Created, not dry-run yet	Run dry-run
`dry_run_blocked`	Dry-run computed `ok: false` (validation/safety gate)	Fix payload or override; re-run dry-run
`awaiting_approval`	Dry-run passed but workspace `require_approval` or critical change requires second admin	Different admin calls `approveChange` → `dry_run_complete`
`dry_run_complete`	Ready to apply (within 30 min TTL)	Approve + apply, or schedule for later
`applying`	Compare-and-swap claim taken; pre-change snapshot and Graph write in flight	Completes to `applied` or `failed`
`applied`	Graph mutation succeeded; pre/post snapshots recorded (post may be missing with `error_message=post_snapshot_failed`)	Rollback available
`failed`	Pre-snapshot or Graph write failed before a successful apply	Re-run dry-run (allowed from `failed`), then apply again
`rolled_back`	Reverted via rollback path	Read-only; may carry `post_rollback_snapshot_failed` on `error_message`
`cancelled`	Operator cancelled before apply	Terminal

Dry-run can be re-run from: draft, dry_run_complete, awaiting_approval, dry_run_blocked, and failed. Re-running clears approval stamps and scheduled_for, and recomputes the next status from the fresh result.

Approval gate

When msp.require_approval is true or the dry-run marks a critical/destructive change, runDryRun lands in awaiting_approval instead of dry_run_complete. approveChange (second admin, not the creator) moves the row to dry_run_complete. approveAndApply rejects awaiting_approval directly so approval cannot be skipped.

Dry-run TTL

Dry-run results expire after 30 minutes (DRY_RUN_TTL_MINUTES in lib/changes/service.ts and supabase/functions/_shared/change-guard.ts). Stale dry-runs cannot apply until re-run.

`applying` and compare-and-swap

Before any Graph write, the app (or scheduler) atomically updates dry_run_complete → applying. Only one caller wins; others get change_apply_conflict. If the pre-change snapshot fails while applying, the row is stamped failed so it does not stay in-flight indefinitely.

`failed` recovery

A row in failed retains the last dry-run payload and error. Recovery:

Re-run dry-run (transitions back through blocked/approval/complete as appropriate).
Apply again once status is dry_run_complete and TTL is fresh.

If Graph succeeded but post-change snapshot failed, status stays applied (with error_message=post_snapshot_failed) so rollback remains possible - the change is live in Entra.

Change-guard on `graph-policies-write`

Every CA write through graph-policies-write calls validateChangeForWrite (supabase/functions/_shared/change-guard.ts) before touching Graph. This is the authoritative server-side gate; the edge function does not trust the caller payload alone.

Apply mode requires:

Row exists and caller payload matches stored change_request.payload (anti-tamper).
status === 'applying' (Graph writes cannot bypass the workflow from draft/complete).
pre_change_snapshot_id is still null (set only after a successful apply).
dry_run_result.ok === true and dry_run_at within TTL.

Rollback mode requires:

status === 'applied'
pre_change_snapshot_id present (and matches caller when supplied)

Mismatch or wrong status returns 403 / 409 with generic codes such as change_not_applicable, payload_mismatch, or dry_run_stale.

Scheduled apply (cron)

Changes with a future scheduled_for remain at dry_run_complete until due. The changes-scheduled-apply edge function (pg_cron every 5 minutes) finds rows where status = 'dry_run_complete', scheduled_for is set, and scheduled_for <= now(), then mirrors approveAndApply:

Re-checks dry-run TTL and dry_run_result.ok
Optionally compares live policy fingerprint to the dry-run baseline (portal drift)
CAS to applying, pre-change snapshot, graph-policies-write, post-change snapshot
Per-row error isolation - one failure does not block others

See docs/admin/scheduled-changes.md for operator-facing detail.

Why pre + post snapshots

Pre-change snapshot is the rollback source. If the PATCH succeeds but produces unexpected behaviour, rollbackChange reads this snapshot's policy doc and PATCHes back to that shape.
Post-change snapshot is the verification source. It confirms what Microsoft actually accepted (sometimes Graph's PATCH does normalization or drops fields silently - we want to record what landed, not what we sent).

Snapshots are full per-tenant captures, not policy-level. That's deliberate - it means a change can be rolled back even if Microsoft normalized fields we didn't touch.

Rollback

Interactive rollback (rollbackChange in lib/changes/service.ts):

Requires status === 'applied' and a non-null pre_change_snapshot_id.
Claims the row (error_message = rollback_in_progress) to prevent double rollback.
Invokes graph-policies-write with mode: 'rollback' (change-guard validates applied + snapshot).
On Graph success, sets status = 'rolled_back' and audits.
Post-rollback resync: calls resyncTenant(..., 'post_change') to refresh the tenant snapshot after revert. If that resync fails, status remains rolled_back but error_message is set to post_rollback_snapshot_failed - the Entra revert already happened; the operator should manually resync.

Non-reversible kinds (policy.delete, location.create, location.delete) and missing create IDs are rejected before Graph. policy.create rollback deletes the created policy via a synthesized delete payload.

Rollback constraints

Cannot rollback if status ≠ applied
Cannot rollback if pre_change_snapshot_id is null (apply failed before snapshot, or never applied)
A rollback Graph failure leaves status at applied with error_message populated; admin can retry

Why "single-admin approve" not "two-admin"

Product decision for the default path. The dry-run requirement + the 30-minute TTL + the pre/post snapshots + the audit log are the safety net. Workspaces with require_approval or critical changes use the awaiting_approval second-admin step instead. A customer's compliance regime demanding standing four-eyes review on every change is covered by that flag plus critical-change promotion.