At approximately 00:10 UTC, one of our servers running RabbitMQ (the software that Telerivet uses internally to queue messages and other tasks) began experiencing very high CPU usage, very slow response times, and intermittent errors when queueing or dequeueing messages.
During this time, messages were still able to be queued (with intermittent errors), and only 2% of API requests failed; however, the slow response times and intermittent errors from RabbitMQ caused the worker processes dequeuing messages to gradually fall further and further behind.
Switching to a standby server in our RabbitMQ cluster did not resolve the issue. Eventually, we restarted the RabbitMQ process, at which time the CPU usage returned to normal, the intermittent errors stopped, and the worker processes quickly caught up.
At this time, Telerivet has not yet identified a particular bug or configuration issue with RabbitMQ that caused this issue. In the next few days, we will be upgrading RabbitMQ to the latest release, as well as performing additional testing to try to reproduce the behavior in RabbitMQ outside of Telerivet's production environment.
Posted May 11, 2016 - 05:32 UTC
The message queue returned to normal at approximately 00:59, and messages queued during the delay have been sent. We are continuing to investigate the root cause of the delays and intermittent errors with the message queue to prevent the issue from happening again.
Posted May 11, 2016 - 01:15 UTC
Telerivet is currently observing long response times and intermittent errors with the message queue, and we are currently working to resolve the issue.