Azure East US 2 (Virginia) - SI-20190711
Incident Report for Snowflake
Postmortem

Summary

Between 21:49 PDT on July 10, 2019, and 16:08 PDT on July 11, 2019, a section of Snowflake customers in Azure - East US2 (Virginia) experienced performance degradation while executing queries and workloads.

Resolution

  • First, we responded to internal alerts for performance degradation and managed the services from going into a critical state by making adjustments to various services to gracefully handle the incoming load. Also, we disabled some housekeeping tasks to divert resources to customer workloads.
  • Second, we suspended the throughput of limited metastore processes and spread the workload over other healthy processes.
  • Third, we added additional metastore processes to spread the load across more processes and mitigate the throughput limitations we faced with disk services.
  • Fourth, we monitored the changes for the next 24 hours and started opening up the production workload to pre-incident performance levels.

Root Cause

The root cause was performance degradation of metastore hitting throughput limitations of the disks.

Improvement/Changes

First, we apologize for the inconvenience caused by this incident. We have identified the following steps to make our infrastructure more resilient :

  • Added more hosts and made adjustments in our capacity calculation to take into consideration the high latency and throughput of Azure managed disks.

Note: The information contained in this report is confidential and is intended solely to promote safety and reduce customer risk.

Posted 2 months ago. Jul 15, 2019 - 18:22 PDT

Resolved
The issue is now resolved.

We will post a preliminary Root Cause Analysis report within the next 48 business hours and follow up with a detailed RCA in the next seven business days.

We apologize for the inconvenience. If you have any questions or see any related issues, please send feedback to support@snowflake.com or submit a support request ticket via the Snowflake Lodge Portal.
Posted 2 months ago. Jul 11, 2019 - 18:16 PDT
Update
Snowflake Services have been stabilized. However, as part of our steps to maintain the systems, we are phasing in the systems to work at 100%. As a result, you will see some slowness on some queries. We will provide another update in the next 90 minutes or less.
Posted 2 months ago. Jul 11, 2019 - 16:07 PDT
Monitoring
Snowflake Services have stabilized, and at this time, we are deferring the action of "Recycling the Services." We will continue to monitor the services closely for any performance degradation.
Posted 2 months ago. Jul 11, 2019 - 11:15 PDT
Identified
Snowflake services will be taken offline to perform planned emergency maintenance. The services will not be available for a few minutes. We will provide an update as soon as the services are restored.
Posted 2 months ago. Jul 11, 2019 - 10:24 PDT
This incident affected: AZURE - East US 2 (Virginia) (Snowflake Data Warehouse (Database), Snowpipe (Data Ingestion)).