Incident response

What we do when something goes wrong

This page documents what OpenSense does when an incident occurs, and what we ask of customers in return. It is short on purpose.

What counts as an incident

Three categories, with different response classes.

P0 — customer data loss or exposure

  • Customer data is exposed to a party that should not have it.
  • Customer data is irrecoverably lost.
  • The audit trail's integrity is compromised in a way that cannot be silently corrected.

Examples: a database leak, a backup that turned out unrecoverable during a real restore, an audit-log row found to have been silently edited.

P1 — ingest or alarm path down

  • Ingest endpoint returning 5xx for > 5 minutes.
  • Alarm dispatch paths (Telegram, email) failing for > 5 minutes.
  • Dashboard down for > 15 minutes.

Examples: Postgres unreachable, Postmark account suspended, Hetzner DC network partition.

P2 — degraded but functional

  • Slow ingest (p99 latency > 2 s).
  • Slow report rendering (> 2 min for a monthly).
  • Single device's ingest token mis-issued; one customer affected.

Response timelines

ClassAcknowledgementStatus update cadenceResolution target
P030 minutesHourlyBest effort, with full disclosure post-resolution
P130 minutesEvery 60 minutes8 business hours
P24 business hoursOnce per business day5 business days

Acknowledgement: we publicly confirm we know about it (status page

  • email to affected customers).

Status update cadence: even when there is "no new news", we post "still investigating" at the cadence above.

Resolution: the incident is closed and a post-mortem is published.

Communication channels

  • Status page (opensense.murzin.digital/status) — the one-second health check; we post incident updates here first.
  • Affected-customer email — direct to operators on file, from alerts@. Sent from infrastructure independent of the SaaS, so it survives outages of the main service.
  • Public post-mortem — within 7 business days of resolution, for any P0 or any P1 lasting > 1 h. Posted at /blog/postmortem/<date-slug> on the main site.

What we ask of customers

  • Keep an up-to-date email on file. We will not phone you; we will email.
  • Optionally, opt in to Telegram for incident notices. The Telegram path is faster than email for the operator's situational awareness.
  • Be patient at the start. The first 30 minutes are mostly diagnosis. We will not invent a story to fill the silence.

What we will not do during an incident

  • Delete or edit historical data to fix the problem. Audit-log integrity is a hard rule.
  • Blame customers. If our ingest rejected a payload because the customer's payload was wrong, we will say "the ingest rejected this payload because X" — that is a fact, not blame. We will not fabricate fault.
  • Mark "resolved" before it is. We would rather have a long open incident than a closed one that recurs.

Pre-incident hygiene

We do the boring things because they pay off in the incident:

  • Postgres dumps hourly. Tested restore monthly (target — see architecture; we acknowledge the gap).
  • All deployments roll back with a single command.
  • Schema migrations are reviewed by a second pair of eyes (today, the founder reviews their own work, which is suboptimal; this is one of the reasons we are hiring a second engineer in 2027).
  • Audit-log heads are published daily; even a P0 cannot silently rewrite history without that being externally visible.

Past incidents

(Nothing yet to disclose. This section will accumulate. We will not pretend we never have an incident; we will name them when they occur.)