Status impact color legend
  • Black impact: None
  • Yellow impact: Minor
  • Red impact: Major
  • Blue impact: Maintenance

[nl-ams-1] degraded performances due to abnormal temperature

Incident Report for Scaleway

Postmortem

Summary

Due to an issue with the cooling systems at one of our datacenter providers, temperature in a specific room exceeded safe operating thresholds. As a safety measure, several servers automatically shut down to prevent hardware damage and data loss.

The incident initially disrupted our Block Storage service, with cascading effects on Instance hypervisors. Downstream products such as Kapsule (our managed Kubernetes product), Managed Databases and Public Gateways were also impacted.

Timeline and impact

All reference times are in UTC. For people in France, Netherlands and Poland, local time is typically UTC+2.

Tuesday, July 1, 2025

13:33 UTC (15:33 UTC+2)

Internal monitoring detected a temperature rise in one datacenter room in Amsterdam.

14:00 UTC (16:00 UTC+2)

The datacenter provider confirms the outage and reports a cooling failure in the affected room.

14:52 UTC (16:52 UTC+2)

Despite preemptive efforts on our end to reduce the power demand in the affected room, the temperature rose to the point where our Block Storage services became unavailable, leading to issues with Instances, Kapsule, Managed Databases, Load Balancers, and Public Gateways.

15:00 UTC (17:00 UTC+2)

As a precaution, we began shutting down servers to protect customer data and our infrastructure.

16:43 UTC (18:43 UTC+2)

Cooling systems were restored and temperature began to decrease.

18:46 UTC (20:46 UTC+2)

Once temperatures returned to safe operating levels and the cooling systems were confirmed stable, we began safely restarting affected systems.

20:34 UTC (22:34 UTC+2)

Block Storage services were fully back to normal.

21:49 UTC (23:49 UTC+2)

All hypervisors were fully back to normal.

23:56 UTC (01:56 UTC+2)

Most impacted services were back to normal.

Wednesday, July 2, 2025

00:30 UTC (02:30 UTC+2)

Kapsule nodes (Instances) were fully back to normal.

00:43 UTC (02:43 UTC+2)

Public Gateways were fully back to normal.

01:23 UTC ( 03:23 UTC+2)

Managed Databases were fully back to normal.

Mitigation and Remediation

We are actively working with both the datacenter provider and the cooling system vendor to assess possible upgrades that would improve resilience against extreme weather conditions.

Posted Jul 03, 2025 - 18:12 CEST

Resolved

This incident has been resolved.
Posted Jul 02, 2025 - 02:42 CEST

Update

Situation is back to normal, teams are keeping an eye on it . We are closing this incident.
Posted Jul 02, 2025 - 02:42 CEST

Update

Most of the services are back online. We are working to clear the few remaining side effects
Posted Jul 02, 2025 - 01:57 CEST

Update

all servers are now up and running, our teams are now working on restoring services with issues
Posted Jul 02, 2025 - 00:07 CEST

Update

Backend is now fully up and stable, customer services will begin to come back progressively, we are still monitoring temperature to ramp up the load for the cooling with caution.
Posted Jul 01, 2025 - 22:41 CEST

Update

We are adding more servers back in production, mostly backend for now, once we are confident all services are ready and cooling keeps stable, customers services will start progressively. We thank you for your patience.
Posted Jul 01, 2025 - 21:49 CEST

Update

We are seeing improvements in temperatures, we will begin to power up a few internal servers and check how the cooling holds the load
Posted Jul 01, 2025 - 20:46 CEST

Monitoring

Our datacenter provider has informed us that the situation is stabilized, they expect temperature to improve slowly in the coming hours.
We are checking our own sensors for now, once we are confident the cooling is sufficient we will begin to power on stopped services.
Posted Jul 01, 2025 - 19:22 CEST

Investigating

Our datacenter provider in nl-ams-1 is unable to provide an update regarding the cooling issue in one of our rooms, we cannot yet share an ETA to restore services on our side. We will provide updates as soon as we have them.
Posted Jul 01, 2025 - 18:35 CEST

Monitoring

We continue to monitor the situation closely for any further issues.
Posted Jul 01, 2025 - 18:21 CEST

Identified

To prevent further issues, we will preemptively shut down several services in the datacenter. This may result in temporary downtime for your service. We are doing our best to recover the situation as quickly as possible.
Posted Jul 01, 2025 - 17:42 CEST

Investigating

Due to elevated temperatures in Netherlands, some of our services are currently experiencing degraded performance. Our teams are actively implementing mitigation measures and closely monitoring the situation to prevent further impact.
Posted Jul 01, 2025 - 16:30 CEST
This incident affected: Elements - Products (Instances, Elastic Metal, Object Storage, Block Storage, Kubernetes Kapsule, Databases) and Elements - AZ (nl-ams-1).