Google suffered a global outage of its services, with users reporting that they could not use Gmail, YouTube, and most of its cloud services. It was a week that Google would have chosen never to face by their high industry standards.
On December 14, Google Cloud services suffered an outage for about 47 minutes. The services were not down, but users across the globe could not authenticate and access them.
Another outage followed the Monday and Tuesday after for a combined total of 6 hours and 41 minutes. Gmail was bouncing emails sent to some addresses in various locations across the globe in the United States, Australia, New Zealand, and Europe. The bounce-back error message ‘the email account that you tried to reach does not exist.’ From the internet outages list, network admins reported that they saw unexpected bounce-back issues, which averaged 10% bounce-backs on their test emails. Other network admins said that they noticed bounces when sending from G-Suite to consumer Gmail.
According to reports from Google, the 47-minute outage happened during the migration of the user ID service. The user ID service is responsible for maintaining unique identifiers for each account and handling OAuth authentication credentials. The migration was also introducing a new quota management system. The challenge was that segments of the old system remained in place, leading to errors in user ID service.
With the expiry of the grace period on the enforcement of quota restrictions, the user ID service usage fell to zero. The system uses a distributed database to store account data and is designed to reject authentication requests upon detecting outdated data. Following what it thought was zero usage, the quota management system reduced available storage for the database. Within no time, nearly all read operations became outdated, leading to authentication errors. The two significant services affected were Google Cloud Platform, which implied that BigQuery, Cloud Console, Cloud Storage, and the Google Kubernetes Engine were giving authentication errors. Google Workspace, formerly G Suite, services were the second service to be affected and included Gmail, Calendar, Meet, Docs, and Drive. Google Workspace has two billion users, and all these users worldwide were experiencing authentication errors. Interestingly for internal Google staff, some internal tools were affected.
Even though Google has put in place safety checks to detect and address unintended quota changes, the edge cases of zero usage were not covered. A key lesson from this incident is that even the impossible must be considered, including the edge cases.
The second wave of outages affected Gmail, where users started experiencing delivery errors. These errors were traced back to recent code changes to an underlying configuration system that provided invalid domain names to the SMTP inbound service. The Gmail accounts service could not detect a valid user when it checked these addresses, thereby generating SMTP error 550. The SMTP error 550 is usually a permanent error that many automated mailing systems will remove the user from their lists. A simple code change on Monday ended up reversing and correcting the situation.
On Tuesday 17th, another issue was reported as the configuration system was updated, and the bouncing emails started all over again. Google didn’t say whether it was a bug, the same change, or a re-applied configuration.
Early this week, Google published preliminary details of the cause of Monday’s global outage of its services that included Gmail, YouTube, and Google Cloud Platform services. According to Google, the crux of the issue, referred to as ‘Google Cloud Infrastructure Components incident 20013’, was the diminished capacity of Google’s central identity-management system, which led to the blocking of its services require users to log in.
A similar outage similar to Google’s caused a significantly longer failure at Amazon in late November. Amazon’s, another of the biggest IT companies globally, datacenter in Virginia, was reportedly down for more than 90 minutes. The Amazon outage had a far-reaching impact on users across the globe and took down many other services and websites that rely on Amazon Web Services (AWS), the company’s cloud computing arm. Some of the affected companies and services included Flickr, the photo-sharing site, the Podcasting service Anchor, Roku, the streaming service, and the logistics business Shipt.
Google has already taken steps to address all identified problems and get things moving again. The first step taken was disabling the quota management system in one data center, which quickly improved the situation and subsequently disabled it everywhere. This fix helped restore services even though there were lingering challenges.
Google has undertaken to institute measures that will prevent outages and other similar challenges in the future.
- Review the quota management automation to prevent fast implementation of global changes.
- Update existing configuration difference tests to identify unexpected changes to the SMTP service configuration before applying any changes.
- Improve monitoring and alerting to help catch incorrect configurations.
- Improve internal service logging for faster and accurate diagnosis of errors.
- Improve the reliability of tools and procedures for generating external communication during outages, especially those that affect internal tools.
- Implement improved write failure resilience into the user ID service database.
- Implement restrictions on configuration changes that affect production resources globally.
- Improve the strength of Google Cloud Platform services to limit the data plane’s impact in situations of user ID service failures.
- Improve the static analysis tools for configuration differences to project differences in production behavior accurately.
Later, Google apologized for the inconveniences caused to its users and thanked them for their patience and unwavering support. Google proceeded to assure all that system reliability is a top priority, and that continuous improvement will improve our systems and improve user experience.
Follow on LinkedIn: Alessandro Civati
Originally published at https://lirax.org on December 29, 2020.