Performance degradation for resources using VPC

Incident Report for Scaleway

Resolved

This incident has been resolved.

Posted Dec 12, 2025 - 15:08 CET

Update

Patch deployed in FR-PAR, now monitoring

Posted Dec 09, 2025 - 15:28 CET

Update

The situation is now stable, we have been able to workaround all impacts and have means to identify and fix issues quickly. We also have been able to reproduce the issues in our testing environments and now are digging in fixing the root causes. We will provide more details in the coming days. Thanks for your patience.

Posted Dec 02, 2025 - 12:03 CET

Update

Here is a checkpoint following issues within VPC networks in Region FR-PAR that occurred in the last few days.

From Tuesday 25th, a few issues were raised to our customer support involving connectivity issues in VPC networks.
Some cases were self solved by restarting impacted nodes and/or updated IPs/MACs information.
At the time, it did not appear to be out of ordinary issues.

On Wednesday 26th, more cases were raised and escalated, raising awareness of a potential more widespread bug that could disrupt some edge cases customers, engineers were involved to collect data and investigate cases.
At 12:45 CET, one of VPC Edge routers crashed. A priority 2 incident was opened to engage more resources and analysis.

Huge levels of BUM (Broadcast/Unicast/Multicast) traffic cross-AZ and cross-products were also identified.
In an attempt to solve these issues, the impacted VPC Edge router was fixed and re-added to the production pool to ensure enough computing resources were available to handle the BUM traffic.
This reduced the global level of BUM traffic in all impacted VPC, leading us to believe that the situation was under control.

On Wednesday afternoon, more cases of latency and connectivity issues were raised to customer support.
Investigations kept going on and pinpointed some desynchronization issues between VPC Edge routers and information systems that could explain higher BUM level than usual.
Decision to restart VPC Edge routers services in fr-par-2 was made in an attempt to clear inconsistencies and freshen data on the routers.
After a few perturbations, the situation was stable by Wednesday evening.
We kept monitoring BUM levels and decision was made to accelerate a deployment of new devices that was already being prepared (more CPU & memory to avoid saturation of VPC Edge routers in case of failover scenarios).

Thursday 27th, new devices were added in the morning in an attempt to better handle BUM traffic.
While this showed useful (reducing impacts), it did not significantly help globally in reducing BUM levels.
In the afternoon, a workaround was deployed to protect our VPC Edge routers from these excessive BUM traffic.
A rate-limiter was applied, per customer network, to ensure a no saturation scenario on VPC Edge routers and make sure legit BUM traffic was handled with the right priority in the routers.
This actually caused a high reduction of BUM traffic, as we hoped (stopping what we believe was a snowball effect).

On Thursday evening, deployment of the rate-limiter was globally done and operations were stopped to monitor the situation.
One conclusion of the day is that our maintenance model for VPC Edge routers seems to create ripple effects on customer networks (while it was designed to be fully transparent).
Friday 28th, investigations are still ongoing to understand the root causes of the issues, high levels of BUM traffic are supposed to be handled transparently, and desynchronization issues are not an expected behavior.
Data collecting on one of the VPC Edge routers caused a crash around 11:00 CET, causing ripple effects, as we understood the day before.
Data collection and analysis kept going on during the day, including customer debug sessions to better understand some specific cases.
During the afternoon, a human error was made by 15:50 CET causing another VPC Edge router unavailability, but we were able to restore it faster and with fewer ripple effects this time.

So, the current situation is that we were able to significantly reduce BUM levels globally, and we believe these are related to desynchronization issues of our VPC Edge routers (causing “unknown MAC and/or IP” issues, so a broadcast behavior).
These desynchronizations are probably the cause of our maintenance model issues (traffic is not being treated equally/fairly by redundant VPC Edge routers).

Our teams are working on reducing desynchronization issues (short term) and fixing the underlying condition once identified (mid-term).
Decision was made to stop any action on VPC Edge routers for the weekend, so all the investigation work will be done off the routers with the currently collected data to avoid creating more perturbations.

We believe the situation to be stable for almost all customers, but our VPC infrastructure in FR-PAR region cannot be considered in a fully stable condition, and we deeply apologize for this.

We want to assure you that all of our teams are focused on fixing the issues.
We encourage customers to share all collected data with our customer support team if you still encounter issues.

Posted Nov 28, 2025 - 22:13 CET

Update

From 10:26 CET to 11:11 CET a VPC equipment triggered errors.
You may have seen packet loss during this time.

Posted Nov 28, 2025 - 11:56 CET

Monitoring

Our teams have deployed a fix on FR-PAR-1.
The situation is now under monitoring and stable.

Posted Nov 27, 2025 - 21:37 CET

Update

Our teams have deployed a fix on FR-PAR-2 and are continuing to work on deploying one on FR-PAR-1 as well.

Posted Nov 27, 2025 - 19:40 CET

Update

We are still deploying a fix, during this deployement you may see some problems with DNS resolution inside kapsule clusters.

Posted Nov 27, 2025 - 17:49 CET

Update

We are deploying a fix to stabilize the broadcast traffic.

Posted Nov 27, 2025 - 16:47 CET

Update

We are still occuring high level of broadcast traffic causing some perturbations. We have a fix being tested at the moment and we will start the deployment in the next hour

Posted Nov 27, 2025 - 15:25 CET

Identified

In order to stabilize broadcast traffic, we are adding more computing ressources to better handle the load, some maintenances will occur today

Posted Nov 27, 2025 - 11:05 CET

Monitoring

The maintenance announced in the previous message was not necessary and has been canceled.
We are monitoring the situation.

Posted Nov 26, 2025 - 19:20 CET

Investigating

From 18:00 CET we are restarting a few VPC services on fr-par-1 to fix some edges cases

Posted Nov 26, 2025 - 18:07 CET

Monitoring

Maintenance completed successfully.

Posted Nov 26, 2025 - 17:36 CET

Update

From 16:30 CET we will proceed with a restart of some internal VPC services to fix some edge cases, a few perturbation can occur during the restart but should not last.

Posted Nov 26, 2025 - 16:18 CET

Update

We are continuing to investigate this issue.

Posted Nov 26, 2025 - 15:50 CET

Investigating

A few customers are still impacted by a few perturbations, the VPC team is working to fix these isolated cases

Posted Nov 26, 2025 - 15:38 CET

Monitoring

We have noticed a degradation in the performance of the product link to VPC, at 13:58 CET, A fix has been made.

We are currently monitoring the situation.

Posted Nov 26, 2025 - 15:03 CET

This incident affected: Elements - Products (Instances, Private Network, Kubernetes Kapsule, Databases) and Elements - AZ (fr-par-1, fr-par-2).