On March 10th 2025 at 16:22 UTC network operation led to an outage on the FR-PAR region for all products.
Duration and impacted products:
Trigger
Impact localized to FR-PAR region
Scaleway's Network team was connecting a new datacenter to our backbone in the FR-PAR region. During this operation an human mistake in a router configuration during the new datacenter deployment led to rerouting 1Tbps+ of traffic between OpCore DC3 and DC5 via a single 100Gbps link (timestamp of the event 16:23).
This meant that the 100Gbps link was saturated and most of the internet traffic was kind of blackholed. This led to severe public Internet connectivity disruption between FR-PAR-2 AZ and the rest of PAR region. The error was quickly detected and rolled-back within 4 minutes, at 16:27 UTC the correct configuration was restored.
Network has taken until 16:27 UTC to converge and be stable.
Only public Internet connectivity was impacted during this incident. VPC communication was not impacted, neither within FR-PAR-1 and FR-PAR-2, nor between PAR1-PAR2-PAR3. Dedibox RPN was not impacted.
Others products have suffered side effects due to the activation of rollback and switchover mechanisms. But due to the fact that the whole FR-PAR-2 AZ connectivity was impacted, these mechanisms were unsuccessful. The network was not down per say, only heavily congested due to that the protection mechanisms were going back and forth. This led to overall instability of many products.
When the connectivity was restored (in 4 minutes after the initial trigger event), it took from minutes to dozens of minutes after the initial event (depending on the product) to restore the normal operation.
Scaleway Object Storage was one of the impacted products:
Object storage, being a regional product, inter AZ data-replication was cut during the connectivity issue. Due to a software bug on the Object Storage stack, the data replication was not rate-limited and it saturated some links when connectivity was restored, inducing latencies and packet drops.
The whole FR-PAR region experienced severe dysfunction of the ObjectStorage service for around 25 minutes (up until ~16:50 UTC)
No customer data loss was experienced.
At around 16:50 UTC normal operation was restored.
Improving change management: hardening the peer-review and production communication process.
Improving production procedure by implementing more failsafe and test before applying network configuration at backbone level.
Object Storage synchronization software issue is identified, fix developed, tested and being deployed.
We apologize for the incident and the delay to bring up this postmortem.