Affected
Operational from 8:24 PM to 4:11 AM
- UpdateUpdate
# Incident Timeline & Post-mortem ## Pre-Event Situation This morning, the systems were all running without issue. The error rate was less 1 out of every 500 requests. The system load was normal. ## Summary At 1:57pm Central Time on January 16th, 2019, ResaleAI experienced an increase in system errors that resulted in degraded system performance. The core of the issue was related the way database connections were shared across multiple services.  ## Increased error rate - **1:57pm** _error rate increased to 2%_ - **2:00pm** _error rate peaked at 95%_ - **2:01pm** _automated system posts to an internal Slack channel about a sustained increase in error rates in the North East signifying a likely service interruption_ - **2:02pm** _error rate decreased to 76%_ - **2:03pm** _automated systems post to an internal Slack channel about a sustained increase in error rates in Canada signifying a wide-spread system interruption_ - **2:03pm** one of our stores used an internal Slack channel to notify us of an issue - **2:03pm** first customer chat comes in with report of problem \(8 different stores initiated chat conversations within 10 minutes\) - **2:04pm** _error rate dropped to 47%_ - **2:06pm** _error rate increased to 86%_ - **2:07pm** 10 minutes since first error. Active conversation in office, Whitney is responding to customer chats, BJ actively diagnosing the source of the issue - **2:08pm** first post to the Facebook user group asking about system status \(3 separate threads posted to Facebook group within 6 minutes\) - **2:12pm** internal Slack channel dedicated to emergency response activated to coordinate response efforts - **2:13pm** basic understanding of the problem identified - automations and scheduled jobs stopped to free up resources - small configuration adjustment deployed - update to status page initiated - **2:14pm** _error rate drops to 36%_ - **2:15pm** _error rate drops to 11%_ - **2:16pm** _error rate returns to normal at 0.02%_ - **2:20pm** automations and background jobs reactivated - **2:24pm** ResaleAI status page update posted ## Back to normal This was an unusual event. We see no indication that any issues have persisted. At this time we are marking the issue as resolved. 
- ResolvedResolved
This incident has been resolved.
- MonitoringMonitoring
We noticed a increase in failed requests to RAI. We made a small configuration change in response and are monitoring the situation. Service seems to be restored, but stores may need to restart RAI.