Topsort Status - Partial Outage - Central services slow response times, 50Xs

Partial Outage - Central services slow response times, 50Xs

Incident Report for Topsort

Postmortem

What happened?

Central services API received an increase of traffic to the catalog service endpoints. This increase of traffic caused all the Central Services pods to consume high amounts of memory causing all the Central Services pods to restart. The pods were unable to come fully back online due to being hit by the same amount of increased traffic with each restart.

This caused all Central Service endpoints to have an increased latency as well return 500 type errors. The Manager was impacted due to it's reliance on Central Services and additionally was displaying 500 type errors.

What was done?

More strict rate limiting was applied to the catalog service API to prevent heavy traffic on high memory consuming endpoints. The number of pods to handle Central Service requests was increased.

What is being worked on?

Investigation into high memory usage on catalog service endpoints to remediate the issue and remove the stricter rate limiting.

Posted Feb 16, 2023 - 20:30 UTC

Resolved

This incident has been resolved.

Posted Feb 16, 2023 - 20:12 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Feb 16, 2023 - 19:10 UTC

Identified

The central services API is returning some 50Xs and has higher latencies than expected for all customers. Management portal has been impacted as well.
Auctions and Events are not impacted.
We have identified the issue and are working to repair it.

Posted Feb 16, 2023 - 18:48 UTC

This incident affected: Management Portal (Management Portal - US-East-2, Management Portal - EU-west-1, Management Portal - Staging - US-East-2) and Management APIs (Catalog API, Campaign API, Billing API).