Data flow: Change apply
sequenceDiagram
autonumber
participant Admin as MSP admin
participant Next as Next.js
participant SbDB as Supabase Postgres
participant ResyncEdge as graph-policies-read
participant WriteEdge as graph-policies-write
participant Graph as Microsoft Graph
Admin->>Next: Approve + Apply on change_request
Next->>SbDB: verify status=dry_run_complete, dry_run_at within 30min, dry_run_result.ok
Next->>SbDB: CAS status dry_run_complete → applying
Next->>ResyncEdge: invoke (source=pre_change)
ResyncEdge->>Graph: GET CA policies
ResyncEdge->>SbDB: insert policy_snapshot source=pre_change
Next->>WriteEdge: invoke { tenant_id, change_id, payload }
WriteEdge->>WriteEdge: change-guard (status=applying, payload match, dry_run TTL)
WriteEdge->>Graph: PATCH /identity/conditionalAccess/policies/{id}
Graph-->>WriteEdge: 204 No Content
WriteEdge-->>Next: { ok: true }
Next->>ResyncEdge: invoke (source=post_change)
ResyncEdge->>SbDB: insert policy_snapshot source=post_change
Next->>SbDB: update change_request status=applied, snapshot ids
Next->>SbDB: insert audit_log
Next-->>Admin: page revalidate, status=applied
State machine
Statuses are defined in lib/changes/status-guard.ts. Transitions below are the
operator-visible paths; cancelled is available from several pre-apply states.
draft ── runDryRun ──┬──> dry_run_blocked ── re-run dry_run ──┐
│ │
├──> awaiting_approval ── approveChange ──┤
│ (second admin; creator cannot │
│ self-approve when required) │
│ v
└──> dry_run_complete ─────────────────────┘
│
┌───────────────┼───────────────┐
│ │ │
(future scheduled_for) │ approveAndApply
│ │ (or cron when due)
v v v
stays complete cancelled applying
until due │
│ ├── pre_snapshot_failed → failed
│ ├── graph write error → failed
│ └── success → applied
│ │
└──────── cron picks up ───────────────┘
│
rollback (applied only)
v
rolled_back
Status reference
| Status | Meaning | Next steps |
|---|---|---|
draft |
Created, not dry-run yet | Run dry-run |
dry_run_blocked |
Dry-run computed ok: false (validation/safety gate) |
Fix payload or override; re-run dry-run |
awaiting_approval |
Dry-run passed but workspace require_approval or critical change requires second admin |
Different admin calls approveChange → dry_run_complete |
dry_run_complete |
Ready to apply (within 30 min TTL) | Approve + apply, or schedule for later |
applying |
Compare-and-swap claim taken; pre-change snapshot and Graph write in flight | Completes to applied or failed |
applied |
Graph mutation succeeded; pre/post snapshots recorded (post may be missing with error_message=post_snapshot_failed) |
Rollback available |
failed |
Pre-snapshot or Graph write failed before a successful apply | Re-run dry-run (allowed from failed), then apply again |
rolled_back |
Reverted via rollback path | Read-only; may carry post_rollback_snapshot_failed on error_message |
cancelled |
Operator cancelled before apply | Terminal |
Dry-run can be re-run from: draft, dry_run_complete, awaiting_approval,
dry_run_blocked, and failed. Re-running clears approval stamps and
scheduled_for, and recomputes the next status from the fresh result.
Approval gate
When msp.require_approval is true or the dry-run marks a critical/destructive
change, runDryRun lands in awaiting_approval instead of dry_run_complete.
approveChange (second admin, not the creator) moves the row to dry_run_complete.
approveAndApply rejects awaiting_approval directly so approval cannot be skipped.
Dry-run TTL
Dry-run results expire after 30 minutes (DRY_RUN_TTL_MINUTES in
lib/changes/service.ts and supabase/functions/_shared/change-guard.ts).
Stale dry-runs cannot apply until re-run.
applying and compare-and-swap
Before any Graph write, the app (or scheduler) atomically updates
dry_run_complete → applying. Only one caller wins; others get
change_apply_conflict. If the pre-change snapshot fails while applying, the
row is stamped failed so it does not stay in-flight indefinitely.
failed recovery
A row in failed retains the last dry-run payload and error. Recovery:
- Re-run dry-run (transitions back through blocked/approval/complete as appropriate).
- Apply again once status is
dry_run_completeand TTL is fresh.
If Graph succeeded but post-change snapshot failed, status stays applied
(with error_message=post_snapshot_failed) so rollback remains possible - the
change is live in Entra.
Change-guard on graph-policies-write
Every CA write through graph-policies-write calls
validateChangeForWrite (supabase/functions/_shared/change-guard.ts) before
touching Graph. This is the authoritative server-side gate; the edge function
does not trust the caller payload alone.
Apply mode requires:
- Row exists and caller
payloadmatches storedchange_request.payload(anti-tamper). status === 'applying'(Graph writes cannot bypass the workflow from draft/complete).pre_change_snapshot_idis still null (set only after a successful apply).dry_run_result.ok === trueanddry_run_atwithin TTL.
Rollback mode requires:
status === 'applied'pre_change_snapshot_idpresent (and matches caller when supplied)
Mismatch or wrong status returns 403 / 409 with generic codes such as
change_not_applicable, payload_mismatch, or dry_run_stale.
Scheduled apply (cron)
Changes with a future scheduled_for remain at dry_run_complete until due.
The changes-scheduled-apply edge function (pg_cron every 5 minutes) finds
rows where status = 'dry_run_complete', scheduled_for is set, and
scheduled_for <= now(), then mirrors approveAndApply:
- Re-checks dry-run TTL and
dry_run_result.ok - Optionally compares live policy fingerprint to the dry-run baseline (portal drift)
- CAS to
applying, pre-change snapshot,graph-policies-write, post-change snapshot - Per-row error isolation - one failure does not block others
See docs/admin/scheduled-changes.md for operator-facing detail.
Why pre + post snapshots
- Pre-change snapshot is the rollback source. If the PATCH succeeds but produces unexpected behaviour,
rollbackChangereads this snapshot's policy doc and PATCHes back to that shape. - Post-change snapshot is the verification source. It confirms what Microsoft actually accepted (sometimes Graph's PATCH does normalization or drops fields silently - we want to record what landed, not what we sent).
Snapshots are full per-tenant captures, not policy-level. That's deliberate - it means a change can be rolled back even if Microsoft normalized fields we didn't touch.
Rollback
Interactive rollback (rollbackChange in lib/changes/service.ts):
- Requires
status === 'applied'and a non-nullpre_change_snapshot_id. - Claims the row (
error_message = rollback_in_progress) to prevent double rollback. - Invokes
graph-policies-writewithmode: 'rollback'(change-guard validatesapplied+ snapshot). - On Graph success, sets
status = 'rolled_back'and audits. - Post-rollback resync: calls
resyncTenant(..., 'post_change')to refresh the tenant snapshot after revert. If that resync fails, status remainsrolled_backbuterror_messageis set topost_rollback_snapshot_failed- the Entra revert already happened; the operator should manually resync.
Non-reversible kinds (policy.delete, location.create, location.delete) and
missing create IDs are rejected before Graph. policy.create rollback deletes the
created policy via a synthesized delete payload.
Rollback constraints
- Cannot rollback if status ≠
applied - Cannot rollback if
pre_change_snapshot_idis null (apply failed before snapshot, or never applied) - A rollback Graph failure leaves status at
appliedwitherror_messagepopulated; admin can retry
Why "single-admin approve" not "two-admin"
Product decision for the default path. The dry-run requirement + the 30-minute TTL + the pre/post snapshots + the audit log are the safety net. Workspaces with require_approval or critical changes use the awaiting_approval second-admin step instead. A customer's compliance regime demanding standing four-eyes review on every change is covered by that flag plus critical-change promotion.