AWS - US East (N. Virginia): SI-20200218
Incident Report for Snowflake
Postmortem

Summary

Between the hours of 17:15 and 17:40 on Feb 18, 2020, PST some Snowflake customers in AWS - US East (N. Virginia) could not connect to Snowflake services and or experienced intermittent query failures.

Root Cause

The root cause of this issue was the failure of a backup job resulting in a cascading effect on the metadata datastore impacting snowflake operations.

Resolution

  • First, Snowflake engineers worked towards mitigating the problem by aborting the failed backup job and taking a series of steps outlined below to protect the Metadata data store from regressing further.
  • We started to throttle the incoming query load, and the effect of the throttling for the customer was one of the following:
    Rejected query connections
    Degraded performance
    Queuing of queries
  • Next, we ran through the steps as defined in production playbooks built to restore and normalize performance.
  • Finally, after completing all the steps outlined in the playbook, we reset the throttling of incoming query load to normal and observed the systems for performance closely to catch any remnants of performance issues showing up after the recovery process was complete.

Improvement/Changes

First, we apologize for the inconvenience caused by this incident.

  • Elevate an existing alert to monitor the size of Metadata data store to Level 2 (Preventive maintenance window) [ Done ]
  • Enhanced backup code to handle exceptions more gracefully [ Done ]
  • Elevate backup failure alerts to Level 1 for immediate attention and expand notification of this alert to additional on-call engineering teams in addition to current standard monitoring practices [ Q1 2020 ]

Note: The information contained in this report is confidential and is intended solely to promote safety and reduce customer risk.

Posted Mar 03, 2020 - 13:30 PST

Resolved
The issue is now resolved.

We will post a detailed RCA in the next seven business days.

We apologize for the inconvenience. If you have any questions or see any related issues, please send feedback to support@snowflake.com or submit a support request ticket via the Snowflake Lodge Portal.
Posted Feb 18, 2020 - 18:13 PST
Monitoring
We applied a fix to resolve the issue. We are now monitoring the system for any further recurrence of the problem, and we will be in this state for 30 minutes.
Please reach out to Support at support@snowflake.com if you continue to face any issues caused by this issue.
Posted Feb 18, 2020 - 17:41 PST
Identified
We have identified the problem with the Snowflake Service that is affecting the overall performance of query execution. We will provide an update in 90 minutes.
Posted Feb 18, 2020 - 17:27 PST
This incident affected: AWS - US East (N. Virginia) (Snowflake Data Warehouse (Database)).