Production incident response (BPMN)

The shape of an incident-response runbook is exactly what BPMN was designed for: multiple roles handing off to each other under time pressure, with a clear escalation path and a hard line between "what we did internally" and "what we told users". Generic flowcharts blur those distinctions; BPMN's lanes make them auditable.

Three lanes, three roles. L1 owns the page-acknowledgement SLA. L2 owns the technical fix. Comms owns the user-facing narrative. Every cross-lane handoff is visible as a sequence-flow that crosses a lane partition. When the post-mortem arrives, you can answer "why did Comms post the all-clear before L2 confirmed?" by reading the lane crossings.

Timer intermediate event. T1: intermediate timer "30 min" is a double-ringed circle with a clock-face inner glyph. It says: wait 30 minutes after the patch deploys before telling users the problem is fixed. That delay isn't an arbitrary process step — it's the empirical bake time most SRE teams use to confirm a regression hasn't surfaced. Encoding it as a timer event (rather than a vague "wait" task) lets process-mining tools measure the actual wait distribution in production.

Self-loop on the triage XOR. The default flow G1 --* "info" --> TR routes back to triage for info-level pages — the engineer keeps revisiting the page in their queue rather than escalating. This is a real cycle in the process graph; the layout's DFS cycle-break detects the back edge and lays out the rest as a forward DAG so the columns don't collapse.

Message flow to the monitoring system. Status-page updates go out to the monitoring system as a message flow (SP ~~> "Monitoring"). The monitoring system is rendered as a black-box pool — Datadog / PagerDuty / Sentry / whatever your stack uses, you don't model their internals, you just model the message exchange.