AWS - US West (Oregon): SI-20190516
Incident Report for Snowflake
Postmortem

Summary

On May 16, 2019, at 20:15 PDT few customers could not use Snowflake services returning an error “Unable to get the result due to an internal error. Please contact Snowflake support. This issue interrupted the query execution requests.

Resolution

Our first task during the incident was to mitigate the problem and minimize the exposure to our customers.
- First, we moved the customers from the affected instances to other instances
- Second, we debugged the problem and identified the root cause of the problem with connections closing.
- We immediately started to focus on resolving this problem and deployed an emergency fix to mitigate the problem in the Front End layer that accepts client close connection requests.
- The rollout of the fix took a little longer than anticipated as were being very careful in the rollout of the fix to minimize the exposure to our customers.

Root Cause

The root cause of this issue was a software bug in Cloud Services layer that serves as the front end for handling all service requests that surfaced while we were performing a routine update on downstream service that holds the Application Metadata:
1. Increase the disk size
2. Increase the memory
This change was being made over many days, and the operation was successful in other regions without any incident.
However, on AWS - US West (Oregon) region, the operation was successful on a couple of instances before it failed on an instance that required a restart. After this restart, the front end had difficulty processing client close connection requests.

Improvement/Changes

First, we apologize for the inconvenience caused by this incident.

We have performed a thorough RCA (Root Cause Analysis) of the problem and areas for improvement to help with:
1. Prevent the problem from recurring
i. Better controls in managing the front end server to redirect traffic from faulty instances
ii. Improved error handling in session handling in front end server
iii. Improved audit trail and logging

2. Reduce the time in identifying the root cause and emergency patch rollouts:
i. Additional logging and alerts to detect issues early
ii. Additional tools to help with isolating the problem instances
iii. Updated internal processes to manage the Service Incidents and provide a streamlined communication via Status Page

If you have any questions or issues, please send feedback to support@snowflake.com or submit a support request ticket via the Snowflake Lodge Portal.

Note: The information contained in this report is confidential and is intended solely to promote safety and reduce customer risk.

Posted about 1 month ago. May 22, 2019 - 11:24 PDT

Resolved
This issue has been resolved. See Postmortem for RCA report.
Posted about 1 month ago. May 17, 2019 - 02:58 PDT
Update
This issue has been resolved. See Postmortem for RCA report.
Posted about 1 month ago. May 17, 2019 - 02:28 PDT
Monitoring
This issue has been resolved. See Postmortem for RCA report.
Posted about 1 month ago. May 17, 2019 - 00:34 PDT
Update
This issue has been resolved. See Postmortem for RCA report.
Posted about 1 month ago. May 16, 2019 - 22:27 PDT
Identified
This issue has been resolved. See Postmortem for RCA report.
Posted about 1 month ago. May 16, 2019 - 21:15 PDT
Investigating
This issue has been resolved. See Postmortem for RCA report.
Posted about 1 month ago. May 16, 2019 - 20:54 PDT
This incident affected: AWS - US West (Oregon) (Snowflake Data Warehouse (Database), Snowpipe (Data Ingestion)).