On October 2nd, 2017, DevResults experienced intermittent availability between 07:01 am UTC and 12:24 pm UTC, with complete unavailability between 11:54 am UTC and 12:24 pm UTC. We understand how important DevResults is to our partners and we are deeply sorry for this event and any issues it may have caused. Below is an explanation of the cause, what we've done to correct the issue and prevent it from happening in the future, and a detailed timeline of the event.
Root cause of the event
The underlying reason for the interruption in service was due to severe errors which affected one of our DNS servers from operating properly and eventually crashing. Once the DNS server stopped being able to satisfy DNS queries from other servers, the DevResults application could no longer generate new connections to the production database, even though the production database did not encounter any availability issues.
Due to a feature called connection pooling, requests that received a connection to the database prior to the DNS server failure worked as expected; requests for connections to the database when there were no available connections in the pool were given new connections that could not be established due to the fact that the host name of the database server could not be resolved.
This intermittent behavior where some requests would succeed and some would fail is what prevented our monitoring systems from reporting the issue as a severe downtime issue until hours after the original failure of the DNS server. It wasn't until after the production application was restarted that all of the monitoring systems began reporting issues.
DevResults has been designed from the start to tolerate server failures like this and we had a secondary DNS server running and active, but the virtual network configuration for the production servers did not include the secondary DNS server's information, causing the secondary server to be unused.
Corrective actions and future work
Once we had discovered that the virtual networks only listed a single DNS server, we immediately added our secondary DNS server so that this issue does not occur again in the future.
Additionally, the production application has no dependencies on this internal network outside of the fact that the production servers are connected to our Active Directory domain, so we also added two additional public DNS servers to the virtual network configuration so that the production application can tolerate a complete failure of our internal network.
While our ultimate intentions have been to replace our internal active directory domain with Microsoft's Azure Active Directory implementation, we will likely be accelerating our plans to move our systems to that in the future which should add even greater resiliency to issues like this.
Since DevResults was believed to have no dependencies on these non-production servers, we did not have any monitoring configured that would alert DevResults' technical staff when issues were encountered. Until we can migrate to Microsoft's Azure Active Directory system, we will be rolling out monitoring to all of our internal systems.
We will also be improving our existing monitoring so that it will not be subject to the intermittent behavior that occurred. We will continue using features like connection pooling because it is very beneficial to our applications performance, but we will bypass the feature for our monitoring.
The timeline
07:01 UTC - one of the DNS servers on our internal network reports an issue with an IO operation to it's virtual disk.
07:05 UTC - NewRelic (one of our monitoring systems) detects the first error where the production application cannot connect to the production database due to the fact that the server cannot resolve the host name for the database.
08:48 UTC - the DNS server begins reporting many IO operation issues until 08:49 UTC. We believe at this point the machine has crashed as no other events are reported until 12:25 UTC when the machine reports that it is starting up.
09:58 UTC - one of our staff members encounters the issue and alerts our technical staff. An engineer responds at 10:07 UTC that they would look into the issue when they arrived at work. At the time, neither staff member was aware of the underlying issue and instead thought that it was an issue with what the other staff member was doing, rather than a severe error affecting the entire system, due to the fact that DevResults still appeared to be working properly. As such, the response from the engineer matched the understood severity of the event, rather than the actual severity.
10:30 UTC - NewRelic opens an incident after it detects an elevated rate of errors on DevResults.
10:38 UTC - NewRelic updates the on-going incident after it detects that application performance is below our acceptable threshold.
10:44 UTC - Our help desk site receives a support ticket from a user detailing that they have been encountering several errors.
10:46 UTC - NewRelic opens a new downtime incident after it fails to get a successful response from DevResults in the past 2 minutes.
10:48 UTC - NewRelic closes the downtime incident as it has been able to re-establish a successful connection to DevResults.
11:38 UTC - The engineer who was alerted at 09:58 UTC sends a message to our internal chat that the site is down and is investigating.
11:44 UTC - NewRelic opens another downtime incident as it has received an invalid response again from the application.
11:45 UTC - The engineer has determined that the underlying issue is that DevResults cannot connect to the database server intermittently. The engineer decides to restart the production application.
11:54 UTC - The production application begins restarting.
12:01 UTC - The production application has restarted and is ready to fully ready to serve requests, though at 12:00 UTC there are three reported warnings related to DNS in the event log. Since the application has been restarted, the connection pool is now empty and all requests for connections to the production database will receive new connections.
12:02 UTC - Pingdom (one of our other monitoring systems) detects DevResults is down and opens an incident. This is the first time that Pingdom has detected the issue.
12:03 UTC - The engineer still can't access DevResults so they attempt to remotely connect to the production application to view event logs, but they are unable to do so because the DNS server is not responding to queries.
12:12 UTC - The engineer has determined that the DNS server is the underlying issue and decides to restart the virtual machine.
12:13 UTC - The engineer finds that the virtual machine in a faulted state and can't be started like the engineer normally would do. To resolve this, the engineer resized the virtual machine which forced it onto a new virtual host and triggers an automatic restart. Due to this, it will take a bit before the virtual machine will start up because it has to move to new host hardware.
12:25 UTC - The DNS server begins starting up.
12:27 UTC - The DNS server has finished starting up and all DevResults sites are responding to requests properly.
12:28 UTC - NewRelic and Pingdom close their downtime incidents.
12:30 UTC - The DNS servers event log records that NTFS has finished repairing itself.
15:17 UTC - Post-mortem analysis of the event begins to make sure the entire event is properly understood.
15:37 UTC - Engineers determine that while the system should have been capable of tolerating the outage of the DNS server, the configuration of the virtual network only listed a single DNS server, which was the server that failed.
15:40 UTC - The virtual network configuration is updated to include the fallback DNS server, as well as two additional public DNS servers.