Google Cloud Services Disrupted by UPS Failure During Six-Hour Outage

In a significant incident affecting numerous users, Google disclosed that a recent outage lasting six hours in one of its cloud regions stemmed from a failure in the uninterruptible power supplies (UPS). This outage began on March 29th and resulted in either degraded service or complete unavailability for over 20 Google Cloud services located in the us-east5-c zone, which is primarily centered in Columbus, Ohio.
According to Googles incident report, the root cause of the outage was a loss of utility power within the affected zone. Major cloud providers, commonly referred to as hyperscalers, are designed to withstand such power disruptions; they utilize uninterruptible power supplies that are intended to provide immediate backup power in the event of a grid failure. This system usually functions seamlessly, maintaining power until diesel-powered generators can take over after a few hours.
However, in this instance, Googles UPS units experienced what the company described as a critical battery failure, which rendered them incapable of supplying any backup power. The incident report further indicated that this failure inadvertently obstructed power from reaching Googles data racks, prompting engineers to bypass the UPS units entirely before power could be restored.
Google's engineering team became aware of the outage at precisely 12:54 PM Pacific Time. Their coordinated response enabled the generators to come online approximately two hours later, at 14:49 PM. Following this, the majority of Google Cloud services began to recover shortly thereafter, although some services needed a longer restoration period due to the necessity of manual interventions to achieve full recovery.
In light of this disruption, Google has issued a public apology, expressing its commitment to preventing such incidents from occurring in the future. The tech giant has outlined several proactive measures it plans to implement:
- Enhancing the power failure recovery pathways within its cloud clusters to ensure a more predictable and rapid service restoration time following power outages.
- Conducting a thorough audit of systems that failed to automatically switch over to backup power to identify and rectify any gaps causing these failures.
- Collaborating closely with the vendor of their uninterruptible power supply systems to diagnose and resolve issues related to the battery backup infrastructure.
The industry is abuzz with speculation on what this meeting with the UPS vendor may entail; one can only imagine the tense discussions that could unfold.
While hyperscalers like Google typically deliver resilient services, this incident serves as a stark reminder that even the most robust plans can occasionally falter. It highlights the importance of regular testing of disaster recovery capabilities and reviewing procedures for responding to outages, particularly in public cloud infrastructure. Companies relying on cloud services must recognize that preparedness is not just advisable but essential.