At approximately 00:10 UTC, one of our servers running RabbitMQ (the software that Telerivet uses internally to queue messages and other tasks) began experiencing very high CPU usage, very slow response times, and intermittent errors when queueing or dequeueing messages.
During this time, messages were still able to be queued (with intermittent errors), and only 2% of API requests failed; however, the slow response times and intermittent errors from RabbitMQ caused the worker processes dequeuing messages to gradually fall further and further behind.
Switching to a standby server in our RabbitMQ cluster did not resolve the issue. Eventually, we restarted the RabbitMQ process, at which time the CPU usage returned to normal, the intermittent errors stopped, and the worker processes quickly caught up.
At this time, Telerivet has not yet identified a particular bug or configuration issue with RabbitMQ that caused this issue. In the next few days, we will be upgrading RabbitMQ to the latest release, as well as performing additional testing to try to reproduce the behavior in RabbitMQ outside of Telerivet's production environment.