[fr-par] - Network instabilities impacting multiple products

Incident Report for Scaleway

Postmortem

Postmortem Scaleway global outage on FR-PAR region 2025-03-10

On March 10th 2025 at 16:22 UTC network operation led to an outage on the FR-PAR region for all products.

Duration and impacted products:

Trigger

Network: 16:22 to 16:27 UTC

Impact localized to FR-PAR region

DNS authoritative and cache : 16:23 to 17:20 UTC
Object storage: 16:23 to 18:00 UTC
Kubernetes Kapsule: 16:23 to 18:00 UTC
Virtual Instances: 16:23 to 17:10 UTC
Transactional Email: 16:23 to 16:27 UTC
Dedibox - dedicated servers (DC5): 16:23 to 16:27 UTC
Elastic Metal (FR-PAR-2): 16:23 to 16:27 UTC

Incident Trigger

Scaleway's Network team was connecting a new datacenter to our backbone in the FR-PAR region. During this operation an human mistake in a router configuration during the new datacenter deployment led to rerouting 1Tbps+ of traffic between OpCore DC3 and DC5 via a single 100Gbps link (timestamp of the event 16:23).

This meant that the 100Gbps link was saturated and most of the internet traffic was kind of blackholed. This led to severe public Internet connectivity disruption between FR-PAR-2 AZ and the rest of PAR region. The error was quickly detected and rolled-back within 4 minutes, at 16:27 UTC the correct configuration was restored.

Network has taken until 16:27 UTC to converge and be stable.

Impacts

Only public Internet connectivity was impacted during this incident. VPC communication was not impacted, neither within FR-PAR-1 and FR-PAR-2, nor between PAR1-PAR2-PAR3. Dedibox RPN was not impacted.

Others products have suffered side effects due to the activation of rollback and switchover mechanisms. But due to the fact that the whole FR-PAR-2 AZ connectivity was impacted, these mechanisms were unsuccessful. The network was not down per say, only heavily congested due to that the protection mechanisms were going back and forth. This led to overall instability of many products.

When the connectivity was restored (in 4 minutes after the initial trigger event), it took from minutes to dozens of minutes after the initial event (depending on the product) to restore the normal operation.

Scaleway Object Storage was one of the impacted products:

Object storage, being a regional product, inter AZ data-replication was cut during the connectivity issue. Due to a software bug on the Object Storage stack, the data replication was not rate-limited and it saturated some links when connectivity was restored, inducing latencies and packet drops.

The whole FR-PAR region experienced severe dysfunction of the ObjectStorage service for around 25 minutes (up until ~16:50 UTC)

No customer data loss was experienced.

At around 16:50 UTC normal operation was restored.

Root cause identified and measures being taken:

Improving change management: hardening the peer-review and production communication process.

Improving production procedure by implementing more failsafe and test before applying network configuration at backbone level.

Object Storage synchronization software issue is identified, fix developed, tested and being deployed.

‌

We apologize for the incident and the delay to bring up this postmortem.

Posted Apr 17, 2025 - 17:11 CEST

Resolved

Situation is back to normal on all products.
We close the incident.
A postmortem is available in the history of this ticket.

Posted Mar 10, 2025 - 18:42 CET

Monitoring

The network public connectivity was almost fully blackholed in PAR2/DC5 between 17:22 and 17:28, this affected all products.
The situation is slowly coming back to normal.
The root cause is a configuration issue on a router for a new DC in PAR.

Posted Mar 10, 2025 - 18:19 CET

Identified

The root cause has been identified and we are working to recover impacted services.
We'll share more details about this issue as soon as possible.

Posted Mar 10, 2025 - 18:00 CET

Investigating

Network instabilities impacting multiple products, we're already working on it. Thank you for your patience.

Posted Mar 10, 2025 - 17:42 CET

This incident affected: Elements - AZ (fr-par-1, fr-par-2, fr-par-3) and Dedibox - Datacenters (DC5).