Incident response

What we do when something goes wrong

This page documents what OpenSense does when an incident occurs, and what we ask of customers in return. It is short on purpose.

What counts as an incident

Three categories, with different response classes.

Customer data is exposed to a party that should not have it.
Customer data is irrecoverably lost.
The audit trail's integrity is compromised in a way that cannot be silently corrected.

Examples: a database leak, a backup that turned out unrecoverable during a real restore, an audit-log row found to have been silently edited.

Examples: Postgres unreachable, Postmark account suspended, Hetzner DC network partition.

Class	Acknowledgement	Status update cadence	Resolution target
P0	30 minutes	Hourly	Best effort, with full disclosure post-resolution
P1	30 minutes	Every 60 minutes	8 business hours
P2	4 business hours	Once per business day	5 business days

Acknowledgement: we publicly confirm we know about it (status page

Status update cadence: even when there is "no new news", we post "still investigating" at the cadence above.

Resolution: the incident is closed and a post-mortem is published.

Status page (opensense.murzin.digital/status) — the one-second health check; we post incident updates here first.
Affected-customer email — direct to operators on file, from alerts@. Sent from infrastructure independent of the SaaS, so it survives outages of the main service.
Public post-mortem — within 7 business days of resolution, for any P0 or any P1 lasting > 1 h. Posted at /blog/postmortem/<date-slug> on the main site.

Keep an up-to-date email on file. We will not phone you; we will email.
Optionally, opt in to Telegram for incident notices. The Telegram path is faster than email for the operator's situational awareness.
Be patient at the start. The first 30 minutes are mostly diagnosis. We will not invent a story to fill the silence.

Delete or edit historical data to fix the problem. Audit-log integrity is a hard rule.
Blame customers. If our ingest rejected a payload because the customer's payload was wrong, we will say "the ingest rejected this payload because X" — that is a fact, not blame. We will not fabricate fault.
Mark "resolved" before it is. We would rather have a long open incident than a closed one that recurs.

We do the boring things because they pay off in the incident:

Postgres dumps hourly. Tested restore monthly (target — see architecture; we acknowledge the gap).
All deployments roll back with a single command.
Schema migrations are reviewed by a second pair of eyes (today, the founder reviews their own work, which is suboptimal; this is one of the reasons we are hiring a second engineer in 2027).
Audit-log heads are published daily; even a P0 cannot silently rewrite history without that being externally visible.

(Nothing yet to disclose. This section will accumulate. We will not pretend we never have an incident; we will name them when they occur.)