Network and Database Outage

Incident Report for Telerivet

Postmortem

This incident report describes 3 separate service interruptions as well as the followup actions Telerivet has taken to further improve the reliability of our systems.

Server Hardware Failure – Saturday 31 January 2015, 05:10 to 05:16 UTC

The first partial service interruption lasted approximately six minutes from 05:10 to 05:16 UTC, causing errors whenever attempting to send a message.

This first interruption was caused by an unexpected hardware failure on one of the servers in our message queue cluster, preventing this server from responding to any network requests starting at 05:10 UTC.

Telerivet’s automated failover systems actively monitor the availability of the message queue service, as well as several other internal services. These services are designed with redundancy so that Telerivet can quickly and automatically recover from a server failure by failing over to a standby server. The interruption in connectivity was detected at 05:12 UTC and the system attempted to fail over to another message queue server. The automated failover process for the message queue involved making an API request to Amazon Route 53, which provides DNS records for telerivet.com. Unfortunately, the Route 53 API returned a “service unavailable” error, preventing the failover process from completing at 05:12 UTC.

After a short wait, the automated failover system tried again at 05:15 UTC. This time, the Amazon Route 53 API worked properly. After the automated failover process was complete, Telerivet returned to normal operation.

Network Hardware Failure – Saturday 31 January 2015, 07:03 to 08:38 UTC

The second service interruption lasted approximately 95 minutes from 07:03 to 08:38 UTC, causing nearly every user of the web app or API to receive the error message “Couldn't connect to the database”.

This second interruption was triggered by network issues in Telerivet’s primary datacenter, caused by a malfunctioning network switch, which failed in an unusual way that prevented the datacenter from automatically failing over to a backup network switch. (For more details, read the data center’s full incident report.) The logs from our monitoring tools showed that the network switch was fixed around 08:27 UTC.

After the network hardware was fixed, Telerivet’s API servers and web servers were still unable to connect to the database for an additional 11 minutes. This occurred due to a quirk in the MariaDB database, which (as we later learned) by default will block a host from making future connections after 100 consecutive aborted connections from that host. As a result of the network interruptions starting at 07:03 UTC, Telerivet’s active web and API servers quickly reached the limit of 100 aborted connections, so they became blocked from connecting to our primary MariaDB server.

Although other web and API servers were available on standby (which had not been blocked by MariaDB), and other MariaDB hosts were also available on standby (which had not blocked any web or API servers), Telerivet automated failover systems did not make the standby hosts active, because all of the active servers appeared to be working correctly when considered in isolation.

Consequently, Telerivet’s system administrators needed to manually diagnose and resolve the problem. Shortly thereafter, one of Telerivet’s sysadmins manually restarted the MariaDB service at 08:38 UTC. This reset the list of blocked hosts and restored Telerivet to normal operation.

Intermittent Network Interruptions – various times from January 28 - February 7 and February 22

Starting on January 28, Telerivet also experienced a handful of intermittent network interruptions typically lasting less than 30 seconds and occurring once or twice per day, which also resulted in the error “Couldn't connect to the database”. These network interruptions are likely unrelated to the server and network hardware failures on 31 January. Although these intermittent network interruptions actually started a few days earlier, we first detected that they were a recurring problem on 31 January while investigating the other service interruptions.

After investigation it was determined that packet loss appeared to occur only between certain pairs of servers. Due to the unpredictable and infrequent nature of the network interruptions, our process for diagnosing this problem basically involved experimenting with a particular change to our server infrastructure and waiting a couple of days to see whether or not it fixed the problem.

These network interruptions were mostly resolved by February 5, although there were a small number of additional network interruptions on February 7 and 22 as we continued experimenting with changes to our server infrastructure during weekends when Telerivet usage is somewhat lower.

The root cause of this packet loss has not yet been identified. However, migrating a small subset of our servers to new hardware appears to have caused these network interruptions to stop. It is possible that the network interruptions could have been caused by a bug in the hardware or virtualization software used by the affected servers.

Followup Actions Taken

We know that our customers rely on Telerivet to be available all the time. For nearly 3 years, Telerivet has earned an excellent record and reputation for reliability, in large part because of our significant work to build systems and processes for redundancy, monitoring, alerting, and automatic failover.

Generally, occasional hardware problems like these are expected and would not typically result in significant downtime, except that in this case they unfortunately coincided with additional unrelated problems such as the outage of the Amazon Route 53 API, and the MariaDB behavior which inadvertently blocked misbehaving hosts. It was also highly unusual that 3 unrelated hardware problems occurred at nearly the same time. Generally Telerivet's hardware has been highly reliable, and we would normally expect to see 3 problems over the course of one year, instead of 3 in one day.

However, these service interruptions highlighted several issues that we have worked to address over the past few weeks:

We updated several configuration values for MariaDB to fix poor default values, such as the max_connect_errors setting that caused MariaDB to block our own servers after 100 aborted connections. (Learn more)
Our automated failover system now checks whether each API and web server can connect to the database and message queue. This allows Telerivet to recover automatically from most failures caused by connectivity issues between certain pairs of hosts, even when each host is working correctly in isolation.
Our automated failover system now retries failed requests to the Amazon Route 53 API after a shorter wait time, and in some cases will be able to proceed even if the Amazon Route 53 API is unavailable.
We added additional metrics and alerts to our internal server monitoring tools, including the average response time and error rate from our web and API servers.
We updated the networking settings on our servers to use a second IP address for internal communication with our other servers. The new configuration allows our servers to communicate with each other on the same Ethernet network without passing packets through an intermediate router, reducing the chance of network interruptions affecting communication between our servers.
We have improved our status page to make it easier for Telerivet to communicate with customers about service interruptions. The new status page makes it easy for customers to see the current and historical status of the Telerivet service, and subscribe to be notified of any issues. The status page currently contains 6 real-time public metrics to make Telerivet’s performance more transparent -- uptime, average response time, and error rate, for both the Telerivet API and web app.

To customers who were impacted, we sincerely apologize for the impact these service interruptions had on your business or organization.

For customers who have reported being impacted by these outages, we have added a credit to your account equal to approximately 10% of your monthly service plan price.

We hope you enjoy our new status page – if you want to receive notifications of any future outages, go to http://status.telerivet.com and click the "Subscribe to Updates" button at the top of the page.

Posted Feb 23, 2015 - 20:34 UTC

Resolved

The network issue was resolved and database connectivity has been restored.

Posted Jan 31, 2015 - 08:38 UTC

Update

A network issue in Telerivet's primary datacenter caused the database servers to become unreachable.

Posted Jan 31, 2015 - 07:03 UTC

Update

Telerivet automatically failed over to a standby message queue server, restoring normal operation.

Posted Jan 31, 2015 - 05:16 UTC

Identified

Telerivet's primary message queue server stopped responding due to a hardware issue.

Posted Jan 31, 2015 - 05:10 UTC