API outage

Postmortem

March 17, 2026 at 7:39 PM

Postmortem

March 17, 2026 at 7:39 PM

Summary

During a routine deployment, we introduced a database migration that ended up degrading the performance of our primary database more than expected. This slowed down our ingestion pipeline and caused a growing backlog of incoming measurements.

As the backlog increased, it exposed a weakness in how our ingestion workers handle concurrency. Instead of recovering once the database performance stabilized, the system got stuck in a state where workers were effectively slowing each other down. That meant the backlog couldn’t clear itself, and data delays persisted longer than they should have.

Throughout the incident, no measurement data was lost. However, dashboards and API responses showed outdated values until the system fully recovered.

Root Cause

There were two things at play here.

The initial trigger was the database migration, which temporarily made queries slower and reduced ingestion throughput. That part alone would have been manageable.

The bigger issue was how our ingestion workers behaved under pressure. They were configured to handle many tasks concurrently using shared database and cache connections. Under normal conditions, this works well and is efficient. But once the backlog built up, tasks started competing for those shared resources. Instead of working through the queue faster, they got in each other’s way, turning a temporary slowdown into a sustained bottleneck.

Resolution

We first addressed the database performance issue to remove the initial trigger.

After that, we changed how ingestion workers run. Instead of using a shared, thread-based model, we moved to process-based workers where each one has its own dedicated connections. This removed the contention entirely, and the system caught up within minutes.

Follow-up Actions

We’re adding better visibility into this kind of situation, especially around queue depth and processing times, so we can react earlier if something similar happens again.

We’re also reviewing other parts of the system for similar concurrency patterns that could behave poorly under load.

Finally, we’ll introduce an additional staging step for database migrations to catch performance regressions before they reach production.

If anything was unclear during the incident or you have questions, feel free to reach out.

Resolved

March 17, 2026 at 7:30 PM

Resolved

March 17, 2026 at 7:30 PM

All systems have fully recovered.

The ingestion backlog has been cleared and incoming data is now being processed in real time again. Dashboards and API responses are up to date.

We’ll publish a detailed postmortem shortly with more information on what happened and what we’re doing to prevent this in the future.

Update

March 17, 2026 at 5:28 PM

Update

March 17, 2026 at 5:28 PM

There is still a backlog of data being processed, which is currently causing ingestion delays of around 30–40 minutes.

We’re actively working on improving queue performance to reduce this delay and will continue to keep you updated as things progress.

Monitoring

March 17, 2026 at 2:39 PM

Monitoring

March 17, 2026 at 2:39 PM

We identified a number of database queries that got stuck, which impacted overall performance. After clearing those, things are back to normal again.

There is still a backlog of data being processed, so you might notice slight delays in data ingestion until the queue has fully caught up.

We’re keeping a close eye on it and will share further updates if needed.

Identified

March 17, 2026 at 2:22 PM

Identified

March 17, 2026 at 2:22 PM

We have identified the issue to be caused by an upstream database. We are continuing to work on a fix for this incident.

Investigating

March 17, 2026 at 2:14 PM

Investigating

March 17, 2026 at 2:14 PM

We are currently investigating issues with the availability of our API.

Datacake - API outage – Incident details

All systems operational

Summary

Root Cause

Resolution

Follow-up Actions