AWS - US East (N. Virginia): INC0111066
Incident Report for Snowflake
Postmortem

Snowflake Engineering has completed the postmortem of this service incident. A detailed Root Cause Analysis (RCA) is available on the Snowflake Community site:
https://community.snowflake.com/s/article/INC0110718-INC0110871-INC0111066-INC0111213

Posted Jul 16, 2024 - 15:53 PDT

Resolved
Current status: We've confirmed that the issue has been remediated following the implementation of the fix and extended monitoring of the affected infrastructure through periods of peak connection requests. If you experience additional issues or have questions, please open a support case via Snowflake Community.
Customer experience: Users in the affected region(s) intermittently could not sign in or use the Snowflake service via Snowsight. As the issue was intermittent, attempting to reload the page or sign in again may have succeeded.
Incident start time: 13:30 UTC July 01, 2024
Incident end time: 00:15 UTC July 02, 2024
Preliminary root cause: The catalyst for impact has been attributed to a significant increase in connection requests against a primary database supporting Snowsight application infrastructure. Our initial analysis indicated that the increase in requests was due to organic usage increases over time, and during each occurrence of impact (INC0110718, INC0110871, and INC0111066) we implemented a number of optimizations to reduce load and improve performance of the database infrastructure. However, further investigation identified a previously implemented change that resulted in compounded traffic and subsequent connection requests being initiated against the affected database infrastructure over time. During periods of peak traffic, the problem was exacerbated and caused the service to experience intermittent periods of degraded performance while core services were throttled to manage the significant increase in requests.
A root cause analysis (RCA) document will be published within seven business days.
Posted Jul 02, 2024 - 14:51 PDT
Monitoring
Current status: We've implemented a fix for this issue and will monitor the environment during peak hours on July 2 and beyond to ensure all services function properly.
Customer experience: Users in the affected region(s) intermittently could not sign in or use the Snowflake service via Snowsight. As the issue was intermittent, attempting to reload the page or sign in again may have succeeded.
Incident start time: 13:30 UTC July 01, 2024
Incident end time: 00:15 UTC July 02, 2024
Preliminary root cause: An update for Snowsight, intended to improve overall query performance caused an unexpected performance degradation.
Posted Jul 01, 2024 - 18:11 PDT
Update
Current status: The Engineering team needs to carry out necessary remediation steps that will require momentary service unavailability between 12:15 AM and 12:30 AM UTC. This step is necessary for service recovery to BAU levels. We'll provide another update within 60 minutes.
Customer experience: Users in the affected region(s) were intermittently unable to sign in or use the Snowflake service via Snowsight. As the issue was intermittent in nature, attempting to reload the page or sign in again may have succeeded.
Incident start time: 13:30 UTC July 01, 2024
Posted Jul 01, 2024 - 17:05 PDT
Update
Current status: As of 22:10 AM UTC, system performance had improved, and Engineers continued to optimize the request load on the affected system to restore performance to BAU levels. We'll provide another update within 60 minutes.
Customer experience: Users in the affected region(s) were intermittently unable to sign in or use the Snowflake service via Snowsight. As the issue was intermittent in nature, attempting to reload the page or sign in again may have succeeded.
Incident start time: 13:30 UTC July 01, 2024
Posted Jul 01, 2024 - 15:26 PDT
Update
Current status: Engineers are actively optimizing request load on the affected system to balance performance more effectively in response to the increase in connections. Error rates have largely subsided as of 21:00 UTC. We'll provide another update within 60 minutes.
Customer experience: Users in the affected region(s) may be intermittently unable to sign in or use the Snowflake service via Snowsight. As the issue is intermittent in nature, attempting to reload the page or sign in again may succeed.
Incident start time: 13:30 UTC July 01, 2024
Posted Jul 01, 2024 - 14:30 PDT
Update
Current status: We have observed another occurrence of impact starting at 20:22 UTC. Engineers are actively engaged and investigating the health of the affected systems. We'll provide another update within 60 minutes.
Customer experience: Users in the affected region(s) may be intermittently unable to sign in or use the Snowflake service via Snowsight. As the issue is intermittent in nature, attempting to reload the page or sign in again may succeed.
Incident start time: 13:30 UTC July 01, 2024
Posted Jul 01, 2024 - 13:29 PDT
Update
Current status: Following the latest optimizations and mitigations that we implemented, the service has remained stable since 16:35 UTC. We are continuing to investigate background processes and contributing sources of load within the affected database infrastructure while we closely monitor the environment. We'll provide another update within two hours.
Customer experience: Users in the affected region(s) may be intermittently unable to sign in or use the Snowflake service via Snowsight. As the issue is intermittent in nature, attempting to reload the page or sign in again may succeed.
Incident start time: 13:33 UTC July 01, 2024
Posted Jul 01, 2024 - 12:00 PDT
Update
Current status: We are continuing to observe periods of intermittent service instability and are actively investigating the latest occurrence between 16:20 and 16:35 UTC. In parallel, we have identified an additional background process that may be contributing to system load during periods of increased connection requests, and we are continuing to tune and optimize throttling and self-healing mechanisms to balance performance during these periods more effectively. We'll provide another update within 90 minutes.
Customer experience: Users in the affected region(s) may be intermittently unable to sign in or use the Snowflake service via Snowsight. As the issue is intermittent in nature, attempting to reload the page or sign in again may succeed.
Incident start time: 13:33 UTC July 01, 2024
Posted Jul 01, 2024 - 10:29 PDT
Update
Current status: Our latest investigation efforts identified that a self-healing mechanism was aggressively restarting affected systems after database connection increases occurred, significantly contributing to the observed periods of intermittent impact. We've implemented an additional optimization to the throttling and self-healing mechanisms used by the affected database infrastructure, and we are currently monitoring the environment to ensure stability. We'll provide another update within 60 minutes.
Customer experience: Users in the affected region(s) may be intermittently unable to sign in or use the Snowflake service via Snowsight. As the issue is intermittent in nature, attempting to reload the page or sign in again may succeed.
Incident start time: 13:33 UTC July 01, 2024
Posted Jul 01, 2024 - 09:26 PDT
Identified
Current status: On July 1, 2024, starting at 13:33 UTC, we began observing a reoccurrence of the intermittent periods of impact associated with INC0110871. Based on the identified increase in connections affecting database performance, we are attempting to implement additional optimizations to reduce impact while we continue our investigation to further isolate and remediate the source of the issue. We'll provide another update within 60 minutes.
Customer experience: Users in the affected region(s) may be intermittently unable to sign in or use the Snowflake service via Snowsight. As the issue is intermittent in nature, attempting to reload the page or sign in again may succeed.
Incident start time: 13:33 UTC July 01, 2024
Posted Jul 01, 2024 - 08:12 PDT
Investigating
Current status: We're investigating an issue with Snowflake Data Cloud. We'll provide an update within 60 minutes.
Customer experience: Users in the affected region(s) may be intermittently unable to sign in to the Snowflake service via Snowsight.
Incident start time: 13:35 UTC July 01, 2024
Posted Jul 01, 2024 - 07:01 PDT
This incident affected: AWS - US East (N. Virginia) (Snowsight).