What happened?
Central services API received an increase of traffic to the catalog service endpoints. This increase of traffic caused all the Central Services pods to consume high amounts of memory causing all the Central Services pods to restart. The pods were unable to come fully back online due to being hit by the same amount of increased traffic with each restart.
This caused all Central Service endpoints to have an increased latency as well return 500 type errors. The Manager was impacted due to it's reliance on Central Services and additionally was displaying 500 type errors.
What was done?
More strict rate limiting was applied to the catalog service API to prevent heavy traffic on high memory consuming endpoints. The number of pods to handle Central Service requests was increased.
What is being worked on?
Investigation into high memory usage on catalog service endpoints to remediate the issue and remove the stricter rate limiting.