AZURE East US2: SI-20190528
Incident Report for Snowflake
Postmortem

Summary

On May 28, 2019, at 17:25 PDT a few customers could not use Snowflake services returning an error “Error 300002 Incident xxxxxxx. Please contact Snowflake support” while executing DML queries like Insert, Update and Delete operations. This issue interrupted the scheduled ETL/ELT workloads and other operations using Snowflake UI.

The issue with the writing to cloud storage lasted for about eighteen minutes from 16:04 to 16:22 PDT. However, we kept the incident open on Snowflake Status Page until 22:33 PDT while we were taking the following actions:
1. Identifying a fix for the problem
2. Developing a recovery solution for the customers impacted by this problem

Resolution

Our priority was to mitigate the problem, and we took a series of steps as follows:
• First, we reviewed the logs and system health to understand the impact of the problem. This step showed us that the problem was not an ongoing event and lasted for eighteen minutes only.
• Second, we debugged and identified the source of the problem and then identified the list of customers and the tables impacted by this incident.
• . Third, we spent a bulk of our time in defining solutions for recovering the impacted customers.
• . Simultaneously another team of engineers started looking into why the “Re-try” attempt to store metadata in cloud storage failed?
• . Fourth, Once we were confident of the fix and the steps to recover the few impacted customers, we closed the incident and reached out to the customers.

Root Cause

The root cause of this issue was a software bug in the Cloud Services layer that surfaced during a “Re-try” operation of writing the metadata of DML operations to cloud storage. The re-try logic got triggered during a small window of eighteen minutes between the hours of 16:04 and 16:22 PDT on May 28, 2019, resulting in errors that impacted a few of our customers.

Improvement/Changes

First, we apologize for the inconvenience caused by this incident.

We have performed a thorough RCA (Root Cause Analysis) of the problem and areas for improvement to help with preventing the problem from recurring:
1. Enhance code to add additional data validation checks in the write code path for all cloud environments.
2. Create additional test cases to handle multiple platform-specific behaviors.
3. Improve audit trail and logging.

If you have any questions or issues, please send feedback to support@snowflake.com or submit a support request ticket via the Snowflake Lodge Portal.

Posted 4 months ago. May 30, 2019 - 18:32 PDT

Resolved
This issue has been resolved. See Postmortem for RCA report.
Posted 4 months ago. May 28, 2019 - 22:33 PDT
Update
This issue has been resolved. See Postmortem for RCA report.
Posted 4 months ago. May 28, 2019 - 21:12 PDT
Identified
This issue has been resolved. See Postmortem for RCA report.
Posted 4 months ago. May 28, 2019 - 20:20 PDT
Update
This issue has been resolved. See Postmortem for RCA report.
Posted 4 months ago. May 28, 2019 - 18:59 PDT
Investigating
This issue has been resolved. See Postmortem for RCA report.
Posted 4 months ago. May 28, 2019 - 17:20 PDT
This incident affected: AZURE - East US 2 (Virginia) (Snowflake Data Warehouse (Database), Snowpipe (Data Ingestion)).