[NETWORK] VPC DNS high latency on fr-par-2
Incident Report for Scaleway
Postmortem

Report relating VPC services incident on FR-PAR-2

Following a core network maintenance operation, a bug within the BGP daemon used by our VPC services was triggered. This led to a connectivity loss and instabilities on Kapsule clusters in FR-PAR-2 and in between FR-PAR-2 and other AZ.

Timeline (UTC)

27/03/2024

07h02 - Beginning of core network maintenance 07h37 - VPC incident 09h02 - VPC services are back on track with degraded connectivity 10h05 - Root cause is identified 10h11 - Root cause is isolated, but connectivity remains degraded 10h22 - Root cause is repaired, VPC services are now functional 10h27 - Connectivity is restored 16h00 - Core network maintenance complete VPC services are at this point functional without redundancy

28/03/2024

10h44 - Redundancy is restored after patch update 11h04 - Redundancy is lost during the patch update on the second half of the infrastructure 11h15 - 11h20 - Attempt at restoring redundancy 13h26 - Restoring redundancy on the entire infrastructure with patch application. VPC services are at this point functional with redundancy.

We will provide a more in depth analysis in a few days after a more thorough analysis of the collected data.

Posted Mar 29, 2024 - 12:24 CET

Resolved
The incident has been resolved. A public summary will be available soon.
Posted Mar 29, 2024 - 09:26 CET
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Mar 27, 2024 - 11:37 CET
Investigating
We are currently investigating this issue.
Posted Mar 27, 2024 - 11:09 CET
Update
We are continuing to monitor for any further issues.
Posted Mar 27, 2024 - 10:54 CET
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Mar 27, 2024 - 10:41 CET
Update
We are continuing to investigate this issue.
Posted Mar 27, 2024 - 10:00 CET
Investigating
DNS resolutions for services within a VPC are currently very slow and might end up in a timeout.
It might prevent especially k8s clusters to boot properly when using VPC DNS

DHCP was impacted as well for a shorter period of time but with less latencies than DNS.
Posted Mar 27, 2024 - 10:00 CET
This incident affected: Elements - AZ (fr-par-2) and Elements - Products (Private Network, Observability).