Configuration Error Likely Culprit in Massive Google Outage
by Brent Whitfield
Google has revealed some preliminary findings as it continues to investigate the massive Google Cloud outage that took down the majority of its services and several high profile third-party dependants in Sunday's four hour outage.
According to Benjamin Treynor Sloss, Google's VP of Engineering, the root cause of the event appears to be some incorrectly applied server configuration updates in Eastern USA. These were supposed to be applied to a small group of servers in one region but the updates were actually sent out to a larger number of servers located in neighboring regions.
When the servers failed, over half of Google's capacity in the East was lost. As Google tried to squeeze all of its traffic into the remaining capacity, the resultant congestion affected multiple Google Cloud and API dependent services.
Google has said that the full results of the event post-mortem will be revealed in due course.
How events unfolded
At around 3.25pm EDT (12.25pm PT) on Sunday 2nd June, millions of users of Google Cloud and multiple dependent third-party services were affected by outages and severe disruption. Although most affected users were based in the East of the USA, other areas throughout the country and as far afield as Brazil and Europe also experienced downtime and delays.
In response to the congestion, a triage system kicked in which prioritized latency-dependent workloads.
By the time Google engineers knew something was wrong, just seconds after the misapplied configurations, it may have already been too late to provide a timely response. According to one Google employee, the communication channels used by the engineers themselves were taken down by the outage, hampering the disaster recovery efforts.
In addition, the fixes to the servers had to be applied through the very same congested network the problem had caused.
When Google missed providing a 7pm EDT (4pm PT) update, users were beginning to wonder how long the outage was going to go on for. Shortly afterwards, more than four-and-a-half hours after the initial problems, Google users were relieved to see the following message on their dashboards:
“The problem with (Google's services) should be resolved. We apologize for the inconvenience and thank you for your patience and continued support. Please rest assured that system reliability is a top priority at Google, and we are making continuous improvements to make our systems better.”
Google then sent out a further message:
“We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.
“We will provide a detailed report of this incident once we have completed our internal investigation.”
The hit list: which services were affected?
Practically all Google Cloud services were either taken down completely or disrupted. These included the following:
- Gmail (1% of users thought to be affected)
- Google Analytics
- Google App Engine
- Google Cloud Storage (30% traffic reduction was reported)
- Google Docs
- Google Drive
- Google Photos
- Google Search
- G Suite Status Dashboard
- YouTube (10% traffic reduction) –There were a trend on twitter, #YouTubeDown.
In addition, several third parties which are dependent on Google Cloud for functionality were affected. These included:
- Apple cloud-based services (including iCloud Mail, iCloud Drive and iMessage)
- Rocket League
Nest outage stirs fears around Internet of Things
Of the many dependant services affected on Sunday, the Nest failures seem to have generated the most controversy. Nest smart homes make use of connected devices – the so-called Internet of Things – to help homeowners to manage their home remotely.
Nest devices include baby monitor cameras, remote door locks and automatic thermostats. Although the locks and thermostats have manual overrides, the outage has reinforced fears about the potential for smart homes to be taken offline or even hacked.
Speculation likely to continue despite Google update
Despite Google's partial explanation, speculation is likely to continue as to the cause of the outage. While some have suggested that a parallel outage by business ISP Level 3 could have compounded problems, more fanciful explanations have blamed China and even Google themselves, a response to the threat of antitrust proceedings announced by the Department of Justice on Friday.
About the Author
Brent Whitfield is the CEO of DCG Technical Solutions Inc. located in Los Angeles, CA since 1993. DCG provides IT consulting for Los Angeles area businesses who need to remain competitive and productive, while being sensitive to limited IT budgets. Brent writes & blogs frequently and has been featured in Fast Company, CNBC, Network Computing, Reuters, and Yahoo Business. https://www.dcgla.com was recognized among the Top 10 Fastest Growing MSPs in North America by MSP Mentor.