[Serverless Functions/Containers] [fr-par] abnormal 503 errors when calling custom domains

Incident Report for Scaleway

Resolved

This incident has been resolved.

Posted Aug 27, 2024 - 10:18 CEST

Monitoring

A fix has been made yesterday (2024-08-23) at 2:45 PM UTC. We have monitored during the night and the rate of 503 errors when calling custom domains dropped to 0%. Out of precaution, we will keep monitoring during the week-end to ensure the fix is really effective, and if so, we will close the incident on next Monday (2024-08-26).

To summarize the root cause: 503 were happening during high activity periods, when our gateways were reconfigured to handle new functions/containers or custom domains. During these reconfiguration periods, congestion might happened, and as such, internal traffic linked to custom domains could be terminated abruptly, hence the 503 errors.

We again apologize for the issues this long-running incident might have caused on your side, and we thank you for your patience.

Posted Aug 23, 2024 - 10:27 CEST

Update

We are still seeing periodic, unpredictable 503 errors for custom domain users. We can't correlate these with any existing metrics/logs we have. The most likely leads so far are:

- the networking configuration inside our infrastructure
- the behavior of our gateways under high load

We have spent the last weeks tweaking the network configuration, and are running several tests to ensure no regressions occur as a result. In the meantime, we are also trying to reproduce the issue on different environments.

Sorry for the inconvenience, rest assured that we are doing our best.

Posted Jul 29, 2024 - 18:42 CEST

Investigating

Sadly, the configuration change we applied yesterday (https://status.scaleway.com/incidents/6vrd790qhzym) did not solve the issues. We are still investigating. Sorry about any inconvenience.

Posted Jul 11, 2024 - 09:32 CEST

Monitoring

We have changed our Cilium configuration (see last message and schedule maintenance). We are now monitoring.

Posted Jul 10, 2024 - 08:55 CEST

Update

Instabilities might be linked to our Cilium setup, so we are going to update their configuration. We have created a scheduled maintenance for tomorrow 2024/07/10: https://status.scaleway.com/incidents/6vrd790qhzym.

Posted Jul 09, 2024 - 11:31 CEST

Update

We have thoroughly extended our monitoring/logging to help us find out what is causing these errors. So far, we have noticed that the number of 503 are not constant through the day, but appears sporadically, in batches. For example, from yesterday (2024-07-04 12:00 PM), some requests during the following time ranges (in UTC) have been affected:

- 2024-07-04 15:49 => 2024-07-04 17:00
- 2024-07-04 19:10 => 2024-07-04 20:10
- 2024-07-05 01:30 => 2024-07-05 03:30

The pattern seems to always be the same: a peak of 503 (~2 or 4% of total requests) at the beginning of the time range, then a decrease of that 503 rate until it eventually reaches 0% (no errors).

With our extended monitoring, we are still trying to correlate these issues with other metrics or logs. Sorry about the inconvenience.

Posted Jul 05, 2024 - 10:36 CEST

Investigating

This morning's hotfix has not permanently solved the issue, we are seeing 503 again from 12:21 PM (UTC). We are still investigating.

Posted Jun 26, 2024 - 17:38 CEST

Monitoring

We have applied a hotfix at 08:52 UTC. So far we don't see 503 anymore on our side, but we are still actively monitoring.

We still need to apply the fix permanently though. We will keep you updated.

Posted Jun 26, 2024 - 13:01 CEST

Update

Quick update to inform our users that we are still working on it, and are actively trying to find a solution. Sorry for any inconvenience.

Posted Jun 25, 2024 - 19:07 CEST

Update

The issue has been escalated to networking team to investigate possible connectivity issues between hosts.

Posted Jun 19, 2024 - 17:30 CEST

Identified

We have identified the issue. Inside our infrastructure, some TCP connections are terminated unexpectedly, leading to 503 for clients doing HTTP calls using these connections. This only affects custom domains because traffic is routed differently from default endpoints. On the user side, retrying in case of 503 should help to mitigate the issue, as we have seen it is unlikely that TCP connections for 2 consecutive HTTP requests break.

Our monitoring have shown this affects around 100 custom domains, for 0.19% of total requests. For most affected clients, the rate of 503 can go up to 2%, but we have seen it can fluctuate over time.

We are still not sure about the root cause, but are working on it. Sorry for any inconvenience.

Posted Jun 18, 2024 - 09:38 CEST

Update

We are still investigating.

It has been confirmed by our tests that only calls to custom domains, in http and https, might periodically end up in 503 errors. These 503 errors have the following body: "upstream connect error or disconnect/reset before headers. reset reason: connection termination".

The 503 errors are sporadic, but are likely to happen in batches as the global load on our infrastructure (number of requests/number of connections) increase. We have some hypothesis to test before communicating further.

As a reminder if you are affected: if possible, you can use the default provided endpoint (*.functions.fnc.fr-par.scw.cloud) instead of your custom domains. If not possible, retrying in case of 503 is unfortunately the only way to mitigate the issue while we are investigating.

Sorry for any inconvenience.

Posted Jun 13, 2024 - 17:44 CEST

Update

We are still investigating. There are still a few 503 returned when calling custom domains. From what we have seen, calls with HTTPS are more likely to end up in 503 errors. Sorry for any inconvenience.

Posted Jun 12, 2024 - 14:12 CEST

Investigating

Some fr-par clients (1/10th of all clients) calling their functions/containers through a custom domain might encounter an abnormal number of HTTP 503 errors. It seems to only affect HTTP calls to the custom domains, and not calls made directly to the default endpoint, but we are still investigating.

From what we have seen so far, for those clients, there should be less than 4% of 503 errors. Though, this number can evolve through time (sometimes it's less than 0.1%).

If possible, clients experiencing these 503 errors can try to use the default provided endpoint instead of their custom domains (*.functions.fnc.fr-par.scw.cloud). If not possible, retrying in case of 503 is the only way to mitigate the issues while we are investigating.

Sorry for any inconvenience.

Posted Jun 07, 2024 - 19:01 CEST

This incident affected: Elements - Products (Serverless Functions).