Between 21:49 PDT on July 10, 2019, and 16:08 PDT on July 11, 2019, a section of Snowflake customers in Azure - East US2 (Virginia) experienced performance degradation while executing queries and workloads.
- First, we responded to internal alerts for performance degradation and managed the services from going into a critical state by making adjustments to various services to gracefully handle the incoming load. Also, we disabled some housekeeping tasks to divert resources to customer workloads.
- Second, we suspended the throughput of limited metastore processes and spread the workload over other healthy processes.
- Third, we added additional metastore processes to spread the load across more processes and mitigate the throughput limitations we faced with disk services.
- Fourth, we monitored the changes for the next 24 hours and started opening up the production workload to pre-incident performance levels.
The root cause was performance degradation of metastore hitting throughput limitations of the disks.
First, we apologize for the inconvenience caused by this incident. We have identified the following steps to make our infrastructure more resilient :
- Added more hosts and made adjustments in our capacity calculation to take into consideration the high latency and throughput of Azure managed disks.
Note: This is a preliminary RCA as the investigation into the issue is still in progress, and Final RCA will be updated with seven business days of the incident.
Note: The information contained in this report is confidential and is intended solely to promote safety and reduce customer risk.