[DC5] - Servers down in Room 1

Incident Report for Scaleway

Resolved

New updates will be posted here but issue is currently resorbed.

Posted Oct 24, 2023 - 18:17 CEST

Update

The issue is stabilized but the investigations on previous cases are still in progress with the help of the manufacturer of the servers to understand the root cause.

Posted Oct 23, 2023 - 11:56 CEST

Update

This is a very quick follow-up concerning our favourite bizarre production incident.

- Small numbers of servers continue to exhibit errant behaviour for the first time; we continue to replace them as necessary.
- We have sent hardware samples to our supplier for analysis, along with other information such as environmental data, as well as the findings from our own investigations.
- We will communicate the results of this investigation when there’s more to report.

Posted Jun 28, 2023 - 12:36 CEST

Update

Most impacted servers have been replaced.
Our teams are working on the last few cases. We have sent collected data to our supplier for analysis.
We are working on reproducing various conditions related to the issue in order to pinpoint the root cause.

We will provide an update no later than Monday 5:00PM UTC+2.

Posted Jun 21, 2023 - 10:14 CEST

Update

This is a follow-up regarding the continuing incident affecting a small number of servers at the DC5 data center.

- While this is still an open incident, only a handful of newly-misbehaving servers have been detected since 2023-06-15 (Thursday) at 17:00 UTC+2.

- Remediation actions are on-going and only a small fraction of the total Quanta X10E-9N machines remain down. The newly-racked replacement blades have not exhibited the errant behavior.

- We do not yet know the root cause of the failures; however we continue to work with our hardware supplier and are developing a few hypotheses (which will need to be verified).

Posted Jun 16, 2023 - 18:50 CEST

Update

Here is an update on the situation and some information on what happened :

- On the morning (UTC+2) of 2023-06-12 (Monday) an incident was identified. During the week-end, a series of Quanta X10E-9N machines—representing roughly 0.5 % of the total machines in DC5—began to exhibit errant behavior resulting in a complete crash of the machines.
The rising number of failures triggered our incident process on Monday.

- The errant behavior is specific to this type of hardware—and only in DC5. Identical machines in our other data centers are not exhibiting this behavior, nor are any other types of machines.

- Speaking frankly, the behavior is remarkably inconsistent. We have not yet identified a pattern related to physical position, electrical connection, uptime, climate conditions, firmware version, or serial numbers (for example). That said, we were and are engaged in remediation actions, consisting largely of swapping the hard drives from the misbehaving machines into new hardware.

- We are currently working with the hardware supplier to understand the root cause of this issue and expect to have an explanation and action plan for prevention on 2023-06-16 (Friday).

Posted Jun 14, 2023 - 10:27 CEST

Update

We are still investigating the anomaly.

We invite you to contact support if you notice an anomaly on your server.
We can then take action on your server.

Posted Jun 13, 2023 - 10:15 CEST

Update

Some servers are still down, please get back in touch with our support if your support is still unavailable.

Posted Jun 12, 2023 - 18:24 CEST

Update

Pro-6-x and Pro-5-x a DC5 Room 1 have some issues : Not available, IPMI down, unexpected restart..
We are currently investigating on this issue.

Posted Jun 12, 2023 - 09:33 CEST

Update

We are continuing to investigate this issue.

Posted Jun 12, 2023 - 09:33 CEST

Investigating

Pro-6-x and Pro-5-x a DC5 Room 1 have IPMI down.
We are currently investigating on this issue.

Posted Jun 12, 2023 - 07:53 CEST

This incident affected: Dedibox - Datacenters (DC5).