Datacake - API outage – Incident details

API outage

Resolved
Major outage
Started about 6 hours agoLasted about 5 hours
Updates
  • Postmortem
    Postmortem

    Summary

    During a routine deployment, we introduced a database migration that ended up degrading the performance of our primary database more than expected. This slowed down our ingestion pipeline and caused a growing backlog of incoming measurements.

    As the backlog increased, it exposed a weakness in how our ingestion workers handle concurrency. Instead of recovering once the database performance stabilized, the system got stuck in a state where workers were effectively slowing each other down. That meant the backlog couldn’t clear itself, and data delays persisted longer than they should have.

    Throughout the incident, no measurement data was lost. However, dashboards and API responses showed outdated values until the system fully recovered.


    Root Cause

    There were two things at play here.

    The initial trigger was the database migration, which temporarily made queries slower and reduced ingestion throughput. That part alone would have been manageable.

    The bigger issue was how our ingestion workers behaved under pressure. They were configured to handle many tasks concurrently using shared database and cache connections. Under normal conditions, this works well and is efficient. But once the backlog built up, tasks started competing for those shared resources. Instead of working through the queue faster, they got in each other’s way, turning a temporary slowdown into a sustained bottleneck.


    Resolution

    We first addressed the database performance issue to remove the initial trigger.

    After that, we changed how ingestion workers run. Instead of using a shared, thread-based model, we moved to process-based workers where each one has its own dedicated connections. This removed the contention entirely, and the system caught up within minutes.


    Follow-up Actions

    We’re adding better visibility into this kind of situation, especially around queue depth and processing times, so we can react earlier if something similar happens again.

    We’re also reviewing other parts of the system for similar concurrency patterns that could behave poorly under load.

    Finally, we’ll introduce an additional staging step for database migrations to catch performance regressions before they reach production.


    If anything was unclear during the incident or you have questions, feel free to reach out.

  • Resolved
    Resolved

    All systems have fully recovered.

    The ingestion backlog has been cleared and incoming data is now being processed in real time again. Dashboards and API responses are up to date.

    We’ll publish a detailed postmortem shortly with more information on what happened and what we’re doing to prevent this in the future.

  • Update
    Update

    There is still a backlog of data being processed, which is currently causing ingestion delays of around 30–40 minutes.

    We’re actively working on improving queue performance to reduce this delay and will continue to keep you updated as things progress.

  • Monitoring
    Monitoring

    We identified a number of database queries that got stuck, which impacted overall performance. After clearing those, things are back to normal again.

    There is still a backlog of data being processed, so you might notice slight delays in data ingestion until the queue has fully caught up.

    We’re keeping a close eye on it and will share further updates if needed.

  • Identified
    Identified

    We have identified the issue to be caused by an upstream database. We are continuing to work on a fix for this incident.

  • Investigating
    Investigating

    We are currently investigating issues with the availability of our API.