AWS - US West, AWS - US East: SI-20190501
Incident Report for Snowflake
Postmortem

Summary

On May 01, 2019 at 02:16 PDT a small number of customers using JDBC driver with an older version between 3.6.0 to 3.6.5 couldn’t connect with Snowflake returning an error “validity out of range.”

Resolution

  • Responding to internal alerts and customer reported cases, First, we verified the drivers by running tests internally and didn’t find any issues.
  • Next, we collected logs from customer deployments and found the errors that help us narrow down the problem to cached entries.
  • At this point, we opened a case with AWS to help resolve this.
  • Simultaneously, while AWS was working on the problem, we started to debug our cache refresh logic and found that a Lambda function that refreshes the cache periodically has failed.
  • So, we ran the refresh job manually that updated all the caches used by JDBC clients
  • Verified the code in production and also in customer’s deployments with the updated cache.
  • Bottom line the root cause was the stale cache used by the JDBC drivers.

Root Cause

Snowflake JDBC client drivers with versions between 3.6.0 and 3.6.5 use OCSP cache to check for certificate validity as part of establishing a TLS connection with Snowflake services and storage buckets on AWS S3. The expected behavior in client drivers is:
1. Validate against the certificate in the cache
2. Validate against Digicert If the cached certificate check fails
The cache is updated periodically by a backend Lambda function in AWS.
However, the Lambda job failed to run; as a result, the cache data had become invalid.
These older JDBC client drivers didn’t have the second part: Validate against Digicert code failing. So, the root cause is the failure of the backend Lambda function that refreshes the cache periodically.

Improvement/Changes

First, we apologize for the inconvenience caused by this incident.
Second, we are now running the Lambda function to refresh the OCSP cache every hour and monitoring the schedule.

Update As of May 16, 2019:

The root cause was that Lambda server crossed the upper limits on the number of jobs that can be executed concurrently. As a result, the OCSP cache refresh function was affected.

Fixes Applied: We now increased the capacity of the number of concurrent jobs on the Lambda server. However, the latest version of JDBC clients handles the cache issue more gracefully by connecting with the Certificate Authority directly.

If you have any questions or issues, please send feedback to support@snowflake.com or submit a support request ticket via the Snowflake Lodge Portal.

Posted 20 days ago. May 03, 2019 - 23:25 PDT

Resolved
This issue has been resolved. See Postmortem for RCA report.
Posted 23 days ago. May 01, 2019 - 08:05 PDT
Monitoring
This issue has been resolved. See Postmortem for RCA report.
Posted 23 days ago. May 01, 2019 - 07:54 PDT
Identified
This issue has been resolved. See Postmortem for RCA report.
Posted 23 days ago. May 01, 2019 - 07:22 PDT
Investigating
This issue has been resolved. See Postmortem for RCA report.
Posted 23 days ago. May 01, 2019 - 03:58 PDT
This incident affected: AWS - US East (N. Virginia) (Snowflake Data Warehouse (Database), Snowpipe (Data Ingestion)) and AWS - US West (Oregon) (Snowflake Data Warehouse (Database), Snowpipe (Data Ingestion)).