API Endpoints Down
Incident Report for Upscribe
Postmortem

Symptom

In the early morning of February 1st, API endpoints were down impacting Checkout, Merchant Portal, and Customer Portal.  

Cause

The cause appears to be from a normal API endpoint request creating a costly query.  This query was stuck running for at least 8 hours, and most likely locked resources for other API requests.

Resolution

As a resolution, the database was rebooted to disable the high volume of queries to the database. This brought the database server back to usual and normal service resumed.

Downtime Caused

For approximately 2.5 hours the API endpoints were unusable and new customers were unable to checkout or access the Customer Portal (6:16 AM - 8:50 AM PT). 

High Connection Count

The database reached and maintained a maximum connection count of 5,000 for a sustained period of time. As a result, new customers were unable to complete checkout. 

Slow API Response Times

Due to the database connections spike caused by API queries running in the background, API response times slowed. They were also competing for database resources. This caused slower API response times. As early morning turned into late morning, API traffic increased. This caused further stress on the database server.

Monitoring

Upscribe has several monitoring systems that analyze API request throughput, response times, database CPU usage, etc. 

From 3:25 AM to 6:11 AM Pacific Time the following alarms were signaled:

  • CPU Usage High
  • Upcoming Notifications Timing Out
  • API Latency High
  • API 5XX Errors High

At 6:16 AM Pacific Time, the following alarm was signaled:

  • API and Checkout Down or Slow

Resolution

At 8:47 AM Pacific Time, the database server was manually rebooted. The cleared pending write/commit operations and allowed service to resume. 

New subscriptions from the prior day were processed successfully and were completed prior to the outage. 

Next Steps

  • Update process that kills deadlocked queries to a lower threshold
  • Create alarms for normal and high connection counts over a sustained period of time (100 in 10 minutes and 1,000 in 1 minute).
  • Update communication process to proactively disseminate information during periods of downtime and outage throughout investigation and resolution.
  • Re-evaluate priority status of different alarms to escalate similar issues in the future to resolve before downtime occurs.
Posted Feb 02, 2022 - 09:57 PST

Resolved
Due to API endpoint request creating a costly query, resources for other API requests were locked, impacting Checkout, the Customer Portal, and the Merchant Portal. After a reboot of the database, normal service resumed. This issue has been resolved.
Posted Feb 01, 2022 - 06:30 PST