AWS - Asia Pacific (Sydney): SI-20200122
Incident Report for Snowflake
Postmortem

Summary

On Jan 22, 2020, between the hours of 13:50 and 23:50 PDT, Snowflake customers in AWS - Asia Pacific (Sydney) could not provision warehouses and experienced a large rate of query failure incidents.

Root Cause

The root cause of this issue was the failure of the AWS data store responsible for storing the Virtual Private Cloud (VPC) network configuration impacting operations in AWS Sydney. As a result Snowflake services could not provision warehouses to execute queries.

Resolution

First, our engineers responded to alerts on a burst of warehouse provisioning failures in the AWS - Pacific (Sydney) region. The log messages were apparent in indicating the source of the problem “Cannot connect using AWS API to provision warehouses.” Around this time, we were notified by AWS of the ongoing problem in AWS Sydney, and the incident was also posted on https://status.aws.amazon.com/
We also opened a case with AWS to track the progress of the resolution.
Ultimately, our cloud provider AWS took a series of steps to fix the data store problem that caused the AWS outage by restoring the VPC configuration data.
Finally, the data store was restored entirely by 20:49 PDT on Jan 22, 2020.

Improvement/Changes

First, we apologize for the inconvenience caused by this incident.

AWS has outlined several improvements in its processes and code which are scheduled for completion by Feb 28, 2020.

The Snowflake engineering team is monitoring the progress of these improvements and following up on them in detail during our meetings with AWS engineering.

Note: The information contained in this report is confidential and is intended solely to promote safety and reduce customer risk.

Posted Feb 14, 2020 - 15:28 PST

Resolved
The issue with AWS platform is now fixed. All Snowflake Services are healthy.

We will post a detailed RCA in the next seven business days.

We apologize for the inconvenience. If you have any questions or see any related issues, please send feedback to support@snowflake.com or submit a support request ticket via the Snowflake Lodge Portal.
Posted Jan 23, 2020 - 00:24 PST
Monitoring
All fixes by AWS are now complete. Snowflake eng is now monitoring the system for any further recurrence of the problem. We are monitoring our services for the next 30 minutes to make sure that it does not recur.

We apologize for the inconvenience. If you have any questions or see any related issues, please send feedback to support@snowflake.com or submit a support request ticket via the Snowflake Lodge Portal.
Posted Jan 22, 2020 - 23:49 PST
Update
We are continuing to work with our cloud provider AWS towards resolving the problem. As of now, part of the fix that including restoration of the affected data store by AWS is complete. AWS is working on the second part, the solution to expose the data store for read and write operations. We do not have an ETA for this part of the resolution yet.

We will provide an update as soon as we know more.

We will provide an update in 90 minutes.
Posted Jan 22, 2020 - 23:26 PST
Update
We have identified the problem with the AWS Platform that is resulting in the partial outage of Snowflake Services. Our cloud provider AWS has informed us that the fix for the issue is taking longer than planned. Providing an ETA for resolution is a challenge, but they have given us an estimate of 3 hours.

We will provide an update as soon as the issue is resolved.

We will provide an update in 90 minutes.
Posted Jan 22, 2020 - 21:34 PST
Identified
We have identified the problem with the AWS Platform that is resulting in the degraded performance of Snowflake Services. We will provide an update as soon as the issue is resolved.

We will provide an update in 90 minutes.
Posted Jan 22, 2020 - 18:13 PST
Investigating
We are investigating an issue with one of the Snowflake services. We will provide more information as soon as we have identified the problem.

We will provide an update in 90 minutes.
Posted Jan 22, 2020 - 16:53 PST
This incident affected: AWS - Asia Pacific (Sydney) (Snowflake Data Warehouse (Database), Snowpipe (Data Ingestion)).