Incident response
What we do when something goes wrong
This page documents what OpenSense does when an incident occurs, and what we ask of customers in return. It is short on purpose.
What counts as an incident
Three categories, with different response classes.
P0 — customer data loss or exposure
- Customer data is exposed to a party that should not have it.
- Customer data is irrecoverably lost.
- The audit trail's integrity is compromised in a way that cannot be silently corrected.
Examples: a database leak, a backup that turned out unrecoverable during a real restore, an audit-log row found to have been silently edited.
P1 — ingest or alarm path down
- Ingest endpoint returning 5xx for > 5 minutes.
- Alarm dispatch paths (Telegram, email) failing for > 5 minutes.
- Dashboard down for > 15 minutes.
Examples: Postgres unreachable, Postmark account suspended, Hetzner DC network partition.
P2 — degraded but functional
- Slow ingest (p99 latency > 2 s).
- Slow report rendering (> 2 min for a monthly).
- Single device's ingest token mis-issued; one customer affected.
Response timelines
| Class | Acknowledgement | Status update cadence | Resolution target |
|---|---|---|---|
| P0 | 30 minutes | Hourly | Best effort, with full disclosure post-resolution |
| P1 | 30 minutes | Every 60 minutes | 8 business hours |
| P2 | 4 business hours | Once per business day | 5 business days |
Acknowledgement: we publicly confirm we know about it (status page
- email to affected customers).
Status update cadence: even when there is "no new news", we post "still investigating" at the cadence above.
Resolution: the incident is closed and a post-mortem is published.
Communication channels
- Status page (
opensense.murzin.digital/status) — the one-second health check; we post incident updates here first. - Affected-customer email — direct to operators on file, from
alerts@. Sent from infrastructure independent of the SaaS, so it survives outages of the main service. - Public post-mortem — within 7 business days of resolution,
for any P0 or any P1 lasting > 1 h. Posted at
/blog/postmortem/<date-slug>on the main site.
What we ask of customers
- Keep an up-to-date email on file. We will not phone you; we will email.
- Optionally, opt in to Telegram for incident notices. The Telegram path is faster than email for the operator's situational awareness.
- Be patient at the start. The first 30 minutes are mostly diagnosis. We will not invent a story to fill the silence.
What we will not do during an incident
- Delete or edit historical data to fix the problem. Audit-log integrity is a hard rule.
- Blame customers. If our ingest rejected a payload because the customer's payload was wrong, we will say "the ingest rejected this payload because X" — that is a fact, not blame. We will not fabricate fault.
- Mark "resolved" before it is. We would rather have a long open incident than a closed one that recurs.
Pre-incident hygiene
We do the boring things because they pay off in the incident:
- Postgres dumps hourly. Tested restore monthly (target — see architecture; we acknowledge the gap).
- All deployments roll back with a single command.
- Schema migrations are reviewed by a second pair of eyes (today, the founder reviews their own work, which is suboptimal; this is one of the reasons we are hiring a second engineer in 2027).
- Audit-log heads are published daily; even a P0 cannot silently rewrite history without that being externally visible.
Past incidents
(Nothing yet to disclose. This section will accumulate. We will not pretend we never have an incident; we will name them when they occur.)