API endpoints timing out, customers unable to complete checkout
Incident Report for Upscribe
Postmortem

Description

Symptom

On August 16th, 2021 in the early morning the checkout and other API endpoints were down. The admin also was not loading due to slow API response times and timeouts.  

Cause

The cause was in part due to a new feature, Cohorts, that was deployed on August 10th. The cohort calculations caused surplus stress on the database server. This created another problem where upcoming charges were billed multiple times.

Resolution

As a resolution the Cohort calculations were stopped and locked database queries were killed. This brought the database server back to usual and normal service resumed.

Downtime Caused

For approximately 1.25 hours the API endpoints were unusable and new customers were unable to checkout (6AM - 7:15AM PT). 

Cohorts

The Cohort feature has been in development for some time.  We soft-released it on August 10th.  We observed increased activity on the database due to the extra calculations that are made to aggregate the cohort data.

Whenever a customer, subscription, or subscription queue is modified, we flag the cohort/customer for recalculation. A cron job then checks for customers that need recalculation and processes them. These calculations include finding active subscriptions, aggregating customer tags, compiling lists of product and variant IDs the customer has subscribed to, etc. It takes approximately 1 second per cohort recalculation. Recalculations had been working fine for the past week up until this point.

Upcoming Charge Processing

We have made several optimizations to our platform to allow for the large number of upcoming charges that need to be processed in under 24 hours. This leads to hundreds of customer, subscription, and subscription queue record modifications a minute.  This created a large backlog of Cohort recalculations that needed processing.

As the cohorts started recalculating, they had to access tables also used by the upcoming charge processor. This compounded the stress on the database server, as two busy processes were competing for the same resources, one slowing down the other.

Database Server CPU

The database server CPU reached and maintained 100% usage approximately 12 minutes after the day’s batch of upcoming charges started processing. Under these conditions read/write operation latency increases. When write/commit latency slows, the integrity of the data isn’t as trustworthy. A customer or subscription may be updated through an API request successfully. But in the background, the change is in a queue and not yet saved to the table’s row. This means reading the record a few seconds later may not reflect the change you already submitted.

Slow API Response Times

Due to the database CPU spike caused by cohort and upcoming charge processing in the background, API response times slowed. They were also competing for database resources. This caused slower API response times. As early morning turned into late morning, API traffic increased. This caused further stress on the database server.

Multiple Charges

The background process that processes upcoming charges runs in regular intervals.  It “asks” the database for 5 pending charges that are flagged as unprocessed. It loops through these queues, creates charges and orders, then flags the charge as processed.

Because of the write/commit latency of the database, the “flag as processed” operations were queued but not actually saved to the upcoming charge. This means that the next time the processor ran, when “asking” for 5 pending charges, the same records from the previous batch were returned. These were again looped through, charges and orders were created, and then flagged as processed.

Eventually as the database committed the queued “processed” flags the processor would move onto the next 5 pending charges. But due to the delay of the queue, several upcoming charges were processed multiple times.

Monitoring

Upscribe has several monitoring systems that analyze API request throughput, response times, database CPU usage, etc. 

Starting at 6 AM Pacific the following alarms were signaled:

  • Email Sending Anomaly
  • Upcoming Notifications Timing Out
  • API Latency High
  • API 5XX Errors High
  • Error Rate Increase Detected

Resolution

At 7:21 AM Pacific, the database server was manually rebooted. The cleared pending write/commit operations and allowed service to resume for a short time. However, after approximately 5 minutes the CPU usage returned to 100% and the degraded performance returned.

At 9:04 AM Pacific the Cohort cron was disabled. Locked database queries were cancelled/killed to relieve the CPU stress on the system. Other crons (maintenance, tracking) were paused to reduce the number of operations on the database server. At 9:20 the CPU usage returned to normal levels.

Cohort recalculations are being left off until we can investigate further and resolve the stress it puts on the database server.

Next Steps

  • Create process to kill deadlocked queries after X threshold
  • Create alarm/monitor for locked DB queries
  • Optimize cohort recalculation logic to prevent CPU spikes and revise quality assurance process for any new features that may impact infrastructure
  • Set up a process to communicate issues proactively at the time of downtime and outage not only during discovery and/or resolution
  • Add developer operations capacity to provide additional support internal monitoring and systems
Posted Aug 17, 2021 - 15:25 PDT

Resolved
In part due to a new feature soft-released 6 days prior, Cohorts, an increased load was imparted on the system. This caused API endpoints to time out and some upcoming charges to be billed multiple times. The incident has been resolved.
Posted Aug 16, 2021 - 01:00 PDT