AWS - US West (Oregon): SI-20190513
Incident Report for Snowflake
Postmortem

Summary

On May 13, 2019, at 08:28 PDT, a small number of customers could not use Snowflake services reporting the following symptoms:
Cannot read Schema
Cannot execute queries
A high number of random errors while executing queries

Resolution

First, we isolated the problem machines and removed them from the cluster.
Second, we increased the size of the cluster from 2 to 4 to handle the backlog.
The above two actions cleared the backlog and we didn’t see an recurrence of the issue.

Root Cause

The root cause was exceptions in our code while reading or writing data from the metadata database. The exceptions were caused by two machines that were in a non-responsive state, and as a result, subsequent job requests were not scheduled to warehouses building up a backlog queue.

Improvement/Changes

First, we apologize for the inconvenience caused by this incident.

We are working on improving resiliency in our code to be able to handle these corner case scenarios where the machines hang without giving any indication that they are down.

If you have any questions or issues, please send feedback to support@snowflake.com or submit a support request ticket via the Snowflake Lodge Portal.

Posted 3 months ago. May 28, 2019 - 16:56 PDT

Resolved
This issue has been resolved. See Postmortem for RCA report.
Posted 3 months ago. May 13, 2019 - 09:15 PDT
Update
This issue has been resolved. See Postmortem for RCA report.
Posted 3 months ago. May 13, 2019 - 09:05 PDT
Investigating
This issue has been resolved. See Postmortem for RCA report.
Posted 3 months ago. May 13, 2019 - 08:54 PDT
This incident affected: AWS - US West (Oregon) (Snowflake Data Warehouse (Database), Snowpipe (Data Ingestion)).