AZURE - US East 2 (Virginia) : SI-20200311
Incident Report for Snowflake
Postmortem

Summary
Between the hours of 03:00 on March 10, 2020, PDT and 22:36 on March 12, 2020, PDT some Snowflake customers using Snowflake ODBC drivers in an embedded or standalone environment could not connect to Snowflake AZURE - US East 2 (Virginia).

Root Cause
The root cause of this issue was an outage in the AZURE Online Certificate Status Protocol (OCSP) responder service combined with customers using an older version of an ODBC driver in the range of 2.19.7 to 2.19.16.

Technical Details About Online Certificate Status Protocol (OCSP)
Snowflake drivers use OCSP to perform certificate revocation checks for SSL Certificates. The OCSP infrastructure also includes a Snowflake run OCSP Response Cache server that proactively fetches and caches OCSP responses for a predetermined set of URLs. This entire infrastructure, however, is dependent on the correct behavior of the OCSP responders. The drivers are designed with a Fail-Open mode to protect in the event the OCSP responder fails. A driver in fail-open mode overrides any failure to obtain a valid OCSP response and continue with the connection as opposed to the Fail Closed behavior where a connection is dropped.

Resolution

  • First, Snowflake engineers started to go through a series of steps from production playbooks to narrow down the source of the problem. Our telemetry logs showed that some connections using the latest driver were successful. We then filed a critical priority case with the Microsoft Azure team and shared our findings.
  • Via our support case, Microsoft notified us of a significant outage with their OCSP responder services. They established a Critical Situation team and worked closely with the Snowflake Cloud Services team to fix the problem.
  • Snowflake ODBC drivers are designed to handle the OCSP responder failures more gracefully by using the “Fail Open.” The Fail Open option introduced in version 2.19.xx did not work as designed due to a bug introduced in the range of 2.19.7 to 2.19.15. The bug was fixed in later version 2.19.16.
  • Around this time, Microsoft Azure OCSP service was still not functioning correctly. The Snowflake Status Page was updated with the outage notice.
  • Snowflake engineering team continued to explore other options to mitigate the problem and provided the following workarounds:
1. Upgrade standalone drivers to use the latest driver 2.20.5.
2. Developed a workaround to replace the Snowflake ODBC driver bundled into Client Business Intelligence tools with the latest driver 2.20.5.
3. Worked with our business partners Microsoft Power BI, Tableau, QLIK Sense to make an emergency release of their online platforms to support the last few customers affected using the online solutions.
4. Updated OCSP Responder Cache Server maintained by Snowflake with failed OCSP URLs as an extended workaround while the drivers are being upgraded.
  • While Snowflake Engineering was working on the workaround, our Cloud Engineering team was working continuously with Microsoft on alternate solutions to minimize the exposure. Partial remediation of Microsoft OCSP issues was seen at 17:40, March 10, 2020, PDT.
  • Full Microsoft OCSP recovery was announced at 18:56 March 12, 2020, PDT, Snowflake is expecting an RCA from Microsoft related to the OCSP failures, due the week ending March 27, 2020.

Improvement/Changes

First, we apologize for the inconvenience caused by this incident. We have identified and planned the following improvements to our service and process to avoid such incidents in the future:

  • Early Detection - Enhance alerting to generate Level 1 alert monitored by On-Call Engineers to track driver connectivity failures to enable earlier detection [ Q2 2020]
  • Enhance Snowflake QA testing framework to simulate OCSP responder failures and exercise the Fail Open mode in order to prevent future regressions [ April 15, 2020]
  • BI Partners:

    • All partners embedding our drivers have been contacted. PowerBI and Tableau Online are planning on releasing hotfixes within the next 2 weeks.
  • Azure OCSP Responder Service improvements

    • We are continuing to work with the Azure team on improving reliability. [March 27, 2020]

If you have any questions or issues, please send feedback to support@snowflake.com or submit a support request ticket via the Snowflake Lodge Portal.

Note: The information contained in this report is confidential and is intended solely to promote safety and reduce customer risk.

Posted Mar 24, 2020 - 17:55 PDT

Resolved
"The issue is now resolved. A detailed RCA will be posted in the next seven business days.

We apologize for the inconvenience.

If you have any questions or see any related issues, please send feedback to support@snowflake.com or submit a support request ticket via the Snowflake Lodge Portal."
Posted Mar 12, 2020 - 22:36 PDT
Update
Snowflake Engineering has a workaround to unblock customers using the following tools:
1. Power BI Native
2. Tableau Desktop

We are working with the affected customer directly via Support Cases.

We are continuing to work with Power BI and Tableau online services to upgrade the Snowflake ODBC drivers to use the latest.

In addition, we are continuing to work with the AZURE team to mitigate the root cause with the OCSP services.

We apologize for the inconvenience.

If you have any questions or see any related issues, please send feedback to support@snowflake.com or submit a support request ticket via the Snowflake Lodge Portal.
Posted Mar 12, 2020 - 12:38 PDT
Update
We are continuing to work with AZURE team to resolve the problem. At this time, the issue is impacting a few of our customers. We will provide another update as soon as we have additional information.

We apologize for the inconvenience, and please reach out to support, if you need any assistance with the workaround suggested in the case you reported for this problem.

We will provide an update as soon as we have more details to share.
Posted Mar 12, 2020 - 03:42 PDT
Update
We are continuing to work with the AZURE engineering team on the issue, and Snowflake engineering has helped the AZURE team narrow down the problem further.

We apologize for the inconvenience, and please reach out to support, if you need any assistance with the workaround suggested in the case you reported for this problem.

We will provide an update as soon as we have more details to share.
Posted Mar 11, 2020 - 23:35 PDT
Identified
During the monitoring phase, we ran some additional tests that failed. So, we are continuing to experience the issues related to Azure OCSP services.

Our engineers are working on the problem and will post an update as soon as we have additional information.
Posted Mar 11, 2020 - 18:11 PDT
Monitoring
The issue with AZURE OCSP services is now resolved.

We are now monitoring the system for any further recurrence of the problem, and we will be in this state for 90 minutes.

Please reach out to Support at support@snowflake.com if you continue to face any issues caused by this issue.
Posted Mar 11, 2020 - 16:43 PDT
Update
We are continuing to investigate the network issues with AZURE OCSP services.

We will provide an update in 90 minutes.
Posted Mar 11, 2020 - 14:44 PDT
Update
We are continuing to investigate the network issues with AZURE OCSP services.

We will provide an update in 90 minutes.
Posted Mar 11, 2020 - 11:32 PDT
Identified
There is an on-going network issue with Azure OCSP services, which the Snowflake drivers use to check certificate revocation status as part of the authentication.

Due to this failure, a few Snowflake customers are unable to connect.

The latest Snowflake drivers will circumvent this issue by using a fallback method that skips this check temporarily. We advise upgrading to the latest version available from https://sfc-repo.snowflakecomputing.com/odbc/index.html.

We apologize for the inconvenience.

If you have any questions or see any related issues, please send feedback to support@snowflake.com or submit a support request ticket via the Snowflake Lodge Portal.
Posted Mar 11, 2020 - 09:49 PDT
This incident affected: AZURE - US East 2 (Virginia) (Snowflake Data Warehouse (Database)).