AWS - US East (N. Virginia) : SI-20200115
Incident Report for Snowflake
Postmortem

Summary

Between the hours of 09:30 Jan 14, 2020, PDT and, 17:30 Jan 15, 2020, some Snowflake customers in AWS - US East (N. Virginia) were experiencing intermittent query failures while executing warehouses of size large and extra-large.

Resolution

  • First, Snowflake engineers worked towards mitigating the problem by re-locating customers to a healthier Availability Zone in AWS - US East (N. Virginia) as soon as we identified the problem zone.
  • In parallel, our engineers started to debug the problem and built a program to reproduce the problem on demand. This program helped us narrow down the root cause of a networking problem within AWS infrastructure.
  • Second, we engaged and started to work with AWS engineers to troubleshoot the problem in the AWS environment using the program developed by Snowflake engineering.
  • Next, Snowflake engineers were monitoring the systems for any network-related errors and worked with customers directly to find alternative recommendations like running the query on smaller warehouses.
  • Finally, our cloud provider AWS took a series of steps to remediate the problem:

    • Isolated the problematic network router and removed it from service.
    • AWS team continued to monitor the network infrastructure for a further loss in traffic.

Root Cause

The root cause for this issue was an incorrect routing configuration in AWS US East (N. Virginia) datacenter that resulted in communication failures between machines in a large warehouse. The loss of network connection resulted in queries failing on a non-deterministic basis.

Improvement/Changes

First, we apologize for the inconvenience caused by this incident.

  • AWS has outlined several improvements to more easily detect these issues, all of which have been completed except for an improved active-monitoring system scheduled for completion in Q1 2020.
  • Snowflake team has already implemented logging improvements to allow on-call Engineers a more rapid mechanism to find specific network connections to allow isolation of the impacted compute resources.

Note: The information contained in this report is confidential and is intended solely to promote safety and reduce customer risk.

Posted Feb 05, 2020 - 14:31 PST

Resolved
The issue with intermittent network failure is now fixed. All Snowflake Services are healthy.

We will post a preliminary Root Cause Analysis report within the next 48 business hours and follow up with a detailed RCA in the next seven business days. We apologize for the inconvenience.

If you have any questions or see any related issues, please send feedback to support@snowflake.com or submit a support request ticket via the Snowflake Lodge Portal.
Posted Jan 15, 2020 - 19:07 PST
Monitoring
AWS applied a fix to resolve the intermittent network connectivity issue. We are now monitoring the system for any further recurrence of the problem.

We will provide an update in 90 minutes.
Posted Jan 15, 2020 - 17:30 PST
Identified
We have identified the problem with the AWS platform that is resulting in intermittent networking issues. We are now working with our cloud provider AWS to debug the problem further.

We will provide an update in 90 min.
Posted Jan 15, 2020 - 16:41 PST
Investigating
We are investigating an intermittent networking issue with one of our cloud providers. We are working with our cloud service provider to troubleshoot and resolve the issue.

We will provide an update in within 90 minutes.
Posted Jan 15, 2020 - 14:16 PST
This incident affected: AWS - US East (N. Virginia) (Snowflake Data Warehouse (Database)).