In the early morning of February 1st, API endpoints were down impacting Checkout, Merchant Portal, and Customer Portal.
The cause appears to be from a normal API endpoint request creating a costly query. This query was stuck running for at least 8 hours, and most likely locked resources for other API requests.
As a resolution, the database was rebooted to disable the high volume of queries to the database. This brought the database server back to usual and normal service resumed.
For approximately 2.5 hours the API endpoints were unusable and new customers were unable to checkout or access the Customer Portal (6:16 AM - 8:50 AM PT).
The database reached and maintained a maximum connection count of 5,000 for a sustained period of time. As a result, new customers were unable to complete checkout.
Due to the database connections spike caused by API queries running in the background, API response times slowed. They were also competing for database resources. This caused slower API response times. As early morning turned into late morning, API traffic increased. This caused further stress on the database server.
Upscribe has several monitoring systems that analyze API request throughput, response times, database CPU usage, etc.
From 3:25 AM to 6:11 AM Pacific Time the following alarms were signaled:
At 6:16 AM Pacific Time, the following alarm was signaled:
At 8:47 AM Pacific Time, the database server was manually rebooted. The cleared pending write/commit operations and allowed service to resume.
New subscriptions from the prior day were processed successfully and were completed prior to the outage.