Schematex
bpmn·OMG BPMN 2.0.2 / ISO/IEC 19510:2013·software, sre, devops·complexity 3/3

Production incident response (BPMN)

Three-lane BPMN of an on-call rotation handling a production page — L1 triage, L2 investigation, and a Comms lane that posts status updates. Exercises timer intermediate event, severity-based XOR routing, and a self-loop on the triage gate.

For the on-call engineer

Open in Playground →
bpmn·§
↘ preview
100%
Production incident response BPMN LR — 2 pool(s), 13 flow object(s). Monitoring Engineering On-call L1 On-call L2 Comms P1 P2-P3 info Alert Status Page received Acknowledge Triage Severity? Investigate Root cause Implement fix Deploy patch Status page 30 min Update users Post-mortem Resolved
UTF-8 · LF · 45 lines · 865 chars✓ parsed·11.3 ms·15.0 KB SVG

The shape of an incident-response runbook is exactly what BPMN was designed for: multiple roles handing off to each other under time pressure, with a clear escalation path and a hard line between "what we did internally" and "what we told users". Generic flowcharts blur those distinctions; BPMN's lanes make them auditable.

Three lanes, three roles. L1 owns the page-acknowledgement SLA. L2 owns the technical fix. Comms owns the user-facing narrative. Every cross-lane handoff is visible as a sequence-flow that crosses a lane partition. When the post-mortem arrives, you can answer "why did Comms post the all-clear before L2 confirmed?" by reading the lane crossings.

Timer intermediate event. T1: intermediate timer "30 min" is a double-ringed circle with a clock-face inner glyph. It says: wait 30 minutes after the patch deploys before telling users the problem is fixed. That delay isn't an arbitrary process step — it's the empirical bake time most SRE teams use to confirm a regression hasn't surfaced. Encoding it as a timer event (rather than a vague "wait" task) lets process-mining tools measure the actual wait distribution in production.

Self-loop on the triage XOR. The default flow G1 --* "info" --> TR routes back to triage for info-level pages — the engineer keeps revisiting the page in their queue rather than escalating. This is a real cycle in the process graph; the layout's DFS cycle-break detects the back edge and lays out the rest as a forward DAG so the columns don't collapse.

Message flow to the monitoring system. Status-page updates go out to the monitoring system as a message flow (SP ~~> "Monitoring"). The monitoring system is rendered as a black-box pool — Datadog / PagerDuty / Sentry / whatever your stack uses, you don't model their internals, you just model the message exchange.

BPMN syntax