AZURE - West Europe: SI-20191106
Incident Report for Snowflake
Postmortem

Summary
Between 20:27 PST on Nov 06, 2019, and 16:31 PST on Nov 07, 2019, our cloud provider notified Snowflake of ongoing disruption in the AZURE West Europe region and that Snowflake may be exposed to extended latency while provisioning warehouses. We were monitoring our systems for any issues, and as part of our due diligence, we posted an update on the Snowflake Status page.

Note: Due to the preventive steps taken by Snowflake engineering, Snowflake was able to minimize the exposure and, our customers did not see any visible impact on their workloads.

Resolution

  • First, Snowflake disabled the release of servers from our pre-provisioned warehouse pool to meet customer’s demand for warehouses.
  • Second, we contacted AZURE regarding the issue and also found that AZURE had posted an outage notice against AZURE Europe Storage Services.
  • At this point, we started working closely with AZURE engineers while they took multiple steps to resolve the issue:
    Please read the RCA posted by the Microsoft AZURE team on https://status.azure.com/en-us/status/history/ and reproduced below for your convenience.

Begin Microsoft AZURE RCA
Summary of Impact: Between 02:40 and 10:55 UTC on 07 Nov 2019, a subset of customers using Storage in West Europe experienced service availability issues. In addition, resources with dependencies on the impacted storage scale units may have experienced downstream impact in the form of availability issues, connection failures, or high latency.

Root Cause: Every Azure region has multiple storage scale units that serve customer traffic. We distribute and balance load across the different scale units and add new scale units as needed. The automated load-balancing operations occur in the background to ensure all the scale units are running at healthy utilization levels and are designed to be impactless for customer facing operations. During this incident, we had just enabled three storage scale units to balance the load between them, to keep up with changing utilization on the scale units. A bug in this process resulted in backend roles crashing across the scale units participating in the load balancing operations, causing them to become unhealthy. It also impacted services dependent on storage in the region.
Mitigation: Engineers mitigated the impact to all but one scale unit by deploying a platform hotfix. Mitigation to the remaining scale unit was delayed due to compatibility issues identified when applying the fix but has since been completed.

Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
The fix has already been deployed to the impacted scale units in the region. We are currently deploying the fix globally to our fleet.
We have been performing cross-scale-unit load-balancing operations numerous times before without any adverse effect. In the wake of this incident, we are reviewing our procedures, tooling and service again for such load balancing operation. We have paused further load balancing actions in this region until this review is completed.
We are rigorously reviewing all our background management processes and deployments to prevent any further impact to customers in this region.
We are reviewing our validation procedures gaps to catch these issues in our validation environment.

End Microsoft AZURE RCA

Note: The information contained in this report is confidential and is intended solely to promote safety and reduce customer risk.

Posted Nov 14, 2019 - 11:18 PST

Resolved
The issue is now resolved.

We will post a preliminary Root Cause Analysis report within the next 48 business hours and follow up with a detailed RCA in the next seven business days.

We apologize for the inconvenience. If you have any questions or see any related issues, please send feedback to support@snowflake.com or submit a support request ticket via the Snowflake Lodge Portal.
Posted Nov 07, 2019 - 04:31 PST
Update
All Snowflake services are operational at this time.

However, we are continuing to monitor the ongoing incident as the Microsoft AZURE team has not cleared the service incident with the AZURE West Europe region on their Status Page https://status.azure.com/en-us/status.

We will provide an update in 90 minutes. Please reach out to Support at support@snowflake.com if you continue to face any issues caused by this issue.
Posted Nov 06, 2019 - 23:48 PST
Monitoring
Dear Snowflake customer,

We are tracking a service availability issue in Azure West Europe Deployment. The issue is currently under investigation by Microsoft Azure Team. The issue only affects a subset of Azure's customer. Snowflake's services are operational as of now.

Detailed issue description is available here: https://status.azure.com/en-us/status. We will provide an update as soon as the issue is resolved.

Regards,
Support & Services
Posted Nov 06, 2019 - 21:50 PST
This incident affected: AZURE - West Europe (Netherlands) (Snowflake Data Warehouse (Database)).