This incident report describes 3 separate service interruptions as well as the followup actions Telerivet has taken to further improve the reliability of our systems.
Server Hardware Failure – Saturday 31 January 2015, 05:10 to 05:16 UTC
The first partial service interruption lasted approximately six minutes from 05:10 to 05:16 UTC, causing errors whenever attempting to send a message.
This first interruption was caused by an unexpected hardware failure on one of the servers in our message queue cluster, preventing this server from responding to any network requests starting at 05:10 UTC.
Telerivet’s automated failover systems actively monitor the availability of the message queue service, as well as several other internal services. These services are designed with redundancy so that Telerivet can quickly and automatically recover from a server failure by failing over to a standby server. The interruption in connectivity was detected at 05:12 UTC and the system attempted to fail over to another message queue server. The automated failover process for the message queue involved making an API request to Amazon Route 53, which provides DNS records for telerivet.com. Unfortunately, the Route 53 API returned a “service unavailable” error, preventing the failover process from completing at 05:12 UTC.
After a short wait, the automated failover system tried again at 05:15 UTC. This time, the Amazon Route 53 API worked properly. After the automated failover process was complete, Telerivet returned to normal operation.
Network Hardware Failure – Saturday 31 January 2015, 07:03 to 08:38 UTC
The second service interruption lasted approximately 95 minutes from 07:03 to 08:38 UTC, causing nearly every user of the web app or API to receive the error message “Couldn't connect to the database”.
This second interruption was triggered by network issues in Telerivet’s primary datacenter, caused by a malfunctioning network switch, which failed in an unusual way that prevented the datacenter from automatically failing over to a backup network switch. (For more details, read the data center’s full incident report.) The logs from our monitoring tools showed that the network switch was fixed around 08:27 UTC.
After the network hardware was fixed, Telerivet’s API servers and web servers were still unable to connect to the database for an additional 11 minutes. This occurred due to a quirk in the MariaDB database, which (as we later learned) by default will block a host from making future connections after 100 consecutive aborted connections from that host. As a result of the network interruptions starting at 07:03 UTC, Telerivet’s active web and API servers quickly reached the limit of 100 aborted connections, so they became blocked from connecting to our primary MariaDB server.
Although other web and API servers were available on standby (which had not been blocked by MariaDB), and other MariaDB hosts were also available on standby (which had not blocked any web or API servers), Telerivet automated failover systems did not make the standby hosts active, because all of the active servers appeared to be working correctly when considered in isolation.
Consequently, Telerivet’s system administrators needed to manually diagnose and resolve the problem. Shortly thereafter, one of Telerivet’s sysadmins manually restarted the MariaDB service at 08:38 UTC. This reset the list of blocked hosts and restored Telerivet to normal operation.
Intermittent Network Interruptions – various times from January 28 - February 7 and February 22
Starting on January 28, Telerivet also experienced a handful of intermittent network interruptions typically lasting less than 30 seconds and occurring once or twice per day, which also resulted in the error “Couldn't connect to the database”. These network interruptions are likely unrelated to the server and network hardware failures on 31 January. Although these intermittent network interruptions actually started a few days earlier, we first detected that they were a recurring problem on 31 January while investigating the other service interruptions.
After investigation it was determined that packet loss appeared to occur only between certain pairs of servers. Due to the unpredictable and infrequent nature of the network interruptions, our process for diagnosing this problem basically involved experimenting with a particular change to our server infrastructure and waiting a couple of days to see whether or not it fixed the problem.
These network interruptions were mostly resolved by February 5, although there were a small number of additional network interruptions on February 7 and 22 as we continued experimenting with changes to our server infrastructure during weekends when Telerivet usage is somewhat lower.
The root cause of this packet loss has not yet been identified. However, migrating a small subset of our servers to new hardware appears to have caused these network interruptions to stop. It is possible that the network interruptions could have been caused by a bug in the hardware or virtualization software used by the affected servers.
Followup Actions Taken
We know that our customers rely on Telerivet to be available all the time. For nearly 3 years, Telerivet has earned an excellent record and reputation for reliability, in large part because of our significant work to build systems and processes for redundancy, monitoring, alerting, and automatic failover.
Generally, occasional hardware problems like these are expected and would not typically result in significant downtime, except that in this case they unfortunately coincided with additional unrelated problems such as the outage of the Amazon Route 53 API, and the MariaDB behavior which inadvertently blocked misbehaving hosts. It was also highly unusual that 3 unrelated hardware problems occurred at nearly the same time. Generally Telerivet's hardware has been highly reliable, and we would normally expect to see 3 problems over the course of one year, instead of 3 in one day.
However, these service interruptions highlighted several issues that we have worked to address over the past few weeks:
max_connect_errorssetting that caused MariaDB to block our own servers after 100 aborted connections. (Learn more)
To customers who were impacted, we sincerely apologize for the impact these service interruptions had on your business or organization.
For customers who have reported being impacted by these outages, we have added a credit to your account equal to approximately 10% of your monthly service plan price.
We hope you enjoy our new status page – if you want to receive notifications of any future outages, go to http://status.telerivet.com and click the "Subscribe to Updates" button at the top of the page.