Unusual amount of failed requests

Update

January 17, 2019 at 4:11 AMUTC

Update

January 17, 2019 at 4:11 AMUTC

# Incident Timeline & Post-mortem ## Pre-Event Situation This morning, the systems were all running without issue. The error rate was less 1 out of every 500 requests. The system load was normal. ## Summary At 1:57pm Central Time on January 16th, 2019, ResaleAI experienced an increase in system errors that resulted in degraded system performance. The core of the issue was related the way database connections were shared across multiple services. ![](https://cl.ly/e51991/Screenshot-2019-01-17T02%3A38%3A39.033Z.png) ## Increased error rate - **1:57pm** _error rate increased to 2%_ - **2:00pm** _error rate peaked at 95%_ - **2:01pm** _automated system posts to an internal Slack channel about a sustained increase in error rates in the North East signifying a likely service interruption_ - **2:02pm** _error rate decreased to 76%_ - **2:03pm** _automated systems post to an internal Slack channel about a sustained increase in error rates in Canada signifying a wide-spread system interruption_ - **2:03pm** one of our stores used an internal Slack channel to notify us of an issue - **2:03pm** first customer chat comes in with report of problem \(8 different stores initiated chat conversations within 10 minutes\) - **2:04pm** _error rate dropped to 47%_ - **2:06pm** _error rate increased to 86%_ - **2:07pm** 10 minutes since first error. Active conversation in office, Whitney is responding to customer chats, BJ actively diagnosing the source of the issue - **2:08pm** first post to the Facebook user group asking about system status \(3 separate threads posted to Facebook group within 6 minutes\) - **2:12pm** internal Slack channel dedicated to emergency response activated to coordinate response efforts - **2:13pm** basic understanding of the problem identified - automations and scheduled jobs stopped to free up resources - small configuration adjustment deployed - update to status page initiated - **2:14pm** _error rate drops to 36%_ - **2:15pm** _error rate drops to 11%_ - **2:16pm** _error rate returns to normal at 0.02%_ - **2:20pm** automations and background jobs reactivated - **2:24pm** ResaleAI status page update posted ## Back to normal This was an unusual event. We see no indication that any issues have persisted. At this time we are marking the issue as resolved. ![](https://cl.ly/68e529/Screenshot-2019-01-17T02%3A56%3A36.390Z.png)

Resolved

January 17, 2019 at 4:11 AMUTC

Resolved

January 17, 2019 at 4:11 AMUTC

This incident has been resolved.

Monitoring

January 16, 2019 at 8:24 PMUTC

Monitoring

January 16, 2019 at 8:24 PMUTC

We noticed a increase in failed requests to RAI. We made a small configuration change in response and are monitoring the situation. Service seems to be restored, but stores may need to restart RAI.

ResaleAI - Unusual amount of failed requests – Incident details

Experiencing partially degraded performance