Post Mortem: 2024-10-18, Scaleway Elements Partial Loss of VPC Connectivity in FR-PAR-1
Summary
On Friday, October 18, 2024 Scaleway Elements cloud ecosystem experienced a network incident that caused cascade effects in various products in the FR-PAR-1 Availability Zone. The incident started at 6:20 UTC and had two main impact periods:
- From 6:20 to 6:21 UTC: around a minute of instability for a very limited number of instances hosted in a single rack. Some customers might have experienced heavy packet loss for all types of network connectivity for a subset of their instances.
- From 7:50 to 8:01 UTC: unavailability of VPC network connectivity for a subset of around 25% of hypervisors of FR-PAR-1 AZ and subsequent impact for VPC dependent products. Public internet connectivity was not impacted during this period. Elastic Metal VPC connectivity was not impacted.
Scaleway Elements Infrastructure
Scaleway Elements cloud ecosystem is built on top of stacked layers of infrastructure:
- Data Centers: electrical power systems, cooling systems, optical fiber cabling and physical security.
- Hardware and physical network infrastructure: servers, data center fabric networks, backbone routers, and inter-datacenter links.
Virtualized multi-tenant cloud foundation:
- Virtual machines running on top of hypervisors,
- Virtualized software defined network, providing multi-tenant VPC networks and VPC edge services, such as DHCP and DNS within VPC
High-level PaaS products: K8S, Database, Load Balancer, Serverless, Observability and many more, running on top of VM instances and using VPC networks for internal communication.
These layers are running on top of each other: the higher layers are dependent on the lower layers of the infrastructure.
In a high-load massive scale environment, consisting of many thousands of physical machines, eventual hardware failures are routine events happening several times per week. The absolute majority of them are invisible to the customers as we build our infrastructure in a redundant fashion. All of the layers have their own redundancy and failover mechanisms which make them resilient to the failures in the lower layers. All the critical systems have at least two instances (often more), deployed in an active-active fashion with a 50% load threshold, so if a critical instance fails, the remaining capacity is able to handle 100% of the load.
Timeline
- At 6:20 UTC, one of the Top of Rack (ToR) switches experienced a software crash. Due to the unstable state of its software modules, it took around 50 seconds for the network protocols to failover all traffic of the rack to the second ToR switch.
- During the convergence time, the traffic to and from the hypervisors in the impacted rack experienced a high percentage of packet drops of a non deterministic nature.
- By 6:21 UTC, all traffic was fully rerouted to the backup device and the instability was resolved.
- At 6:38 UTC, the crashed switch terminated its automatic reboot and restored normal operation of the redundant ToR mode.
- However, this instability caused a cascade effect on one of the infrastructure blocks of the virtualized network infrastructure: a BGP route reflector (RR), used for VPC services. This software was hosted on one of the hypervisors in the impacted rack.
- The RR software stack experienced an instability and got stuck in an abnormal state which could not be resolved by the auto-restart process.
- At this point, customers didn’t experience any impact of the VPC services as the backup RR was operating normally. However the RR redundancy was lost.
- At 7:50 UTC, the second RR experienced a critical condition and also got stuck in a non-operational state.
- At this point customers experienced a disruption of the VPC connectivity for a subset of FR-PAR-1 AZ Scaleway Elements virtual products. Around 25% of the hypervisors lost VPC connectivity with the rest of the region.
- Both RR software stacks have health-check monitoring and auto-restart mechanism which should have addressed this type of failure. The health-check monitoring successfully detected the anomaly however the auto-restart mechanism failed.
- The impact was detected by Scaleway Customer Excellence and SRE teams, and an incident was opened with subsequent notification of the technical top-management.
- Both VPC route reflectors were fixed by a manual action.
- By 8:01 UTC VPC connectivity was fully restored.
- By 8:07 UTC the situation was back to nominal state with redundancy operating normally
VPC Dependent Products Impact
Managed Database and Redis
- Impacted during both periods:
Short period of connectivity loss of the impacted rack (6:20-6:21 UTC). Some Database customers experienced 500 HTTP errors for connections to their databases.
- VPC connectivity loss for 25% of FR-PAR-1 hypervisors (7:50-8:01 UTC). Impacted customers could not use VPC to connect to their managed databases.
Serverless Functions and Containers
During the VPC connectivity loss period (7:50-8:01 UTC) one of the serverless compute nodes was unavailable. A subset of customers with workloads on this node could experience service disruptions, including potential 500 errors when calling their functions/container while their workloads were being rescheduled.
Kapsule
During the VPC connectivity loss period (7:50-8:01 UTC), there was a network partition between nodes, preventing applications running on different nodes to communicate.
The infrastructure hosting control-planes uses VPC, and was impacted too, causing some control-plane unavailability. Unfortunately, some nodes were replaced by the autohealing because they were unable to report their availability, causing workloads to be rescheduled/restarted.
By 8:04 UTC almost all clusters were recovered.
Lessons Learned and Further Actions
- We have fixed the autohealing mechanism which failed to fix the route reflectors in a stale state:
https://github.com/FRRouting/frr/pull/17163
- We are planning to introduce software version diversity for route reflectors to avoid multiple instances being impacted by a single bug.
- We have a plan to deeply investigate the software issue which caused that stale state.