Partial Outage - Central services slow response times, 50Xs
Incident Report for Topsort
Postmortem

What happened?

Central services API received an increase of traffic to the catalog service endpoints. This increase of traffic caused all the Central Services pods to consume high amounts of memory causing all the Central Services pods to restart. The pods were unable to come fully back online due to being hit by the same amount of increased traffic with each restart.

This caused all Central Service endpoints to have an increased latency as well return 500 type errors. The Manager was impacted due to it's reliance on Central Services and additionally was displaying 500 type errors.

What was done?

More strict rate limiting was applied to the catalog service API to prevent heavy traffic on high memory consuming endpoints. The number of pods to handle Central Service requests was increased.

What is being worked on?

Investigation into high memory usage on catalog service endpoints to remediate the issue and remove the stricter rate limiting.

Posted Feb 16, 2023 - 20:30 UTC

Resolved
This incident has been resolved.
Posted Feb 16, 2023 - 20:12 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Feb 16, 2023 - 19:10 UTC
Identified
The central services API is returning some 50Xs and has higher latencies than expected for all customers. Management portal has been impacted as well.
Auctions and Events are not impacted.
We have identified the issue and are working to repair it.
Posted Feb 16, 2023 - 18:48 UTC
This incident affected: Management Portal (Management Portal - US-East-2, Management Portal - EU-west-1, Management Portal - Staging - US-East-2) and Management APIs (Catalog API, Campaign API, Billing API).