Surviving Retry Storms: Investigating Webhooks Queue Latency Issues

Bookmark

Background jobs are a powerful tool for processing high volumes of data while maintaining application responsiveness. Gusto uses Sidekiq as its background job queueing system across many different services, including our webhooks service.

While the ability to process data asynchronously through background jobs provides many performance benefits for an application, it can also be a source of performance issues.

We recently found ourselves in this conundrum with the Notifications queue in our Webhooks service. This post will go through what we did to investigate and define the problem, and what we implemented to solve it.

Webhooks service architecture

Gusto’s Webhooks service is a Ruby on Rails application that is responsible for processing, storing, and delivering event notifications to external applications. At a high level, the architecture is made up of these core pieces:

Stream consumer (Kafka):
- This is the main entry point into our Webhooks service. It listens for events produced by the Kafka broker in our main service and enqueues a job to the Sidekiq Consumer queue for each event
Consumer queue (Sidekiq):
- Jobs in this queue persist event details to our database and look for any applications subscribed to that event. If a subscription exists, a job is enqueued to the Notifications queue
Webhooks Database (PostgreSQL): Stores event and notification attempt details
Notifications queue (Sidekiq):
- Jobs in this queue manage sending notifications to subscribed applications for a given event

Simplified sequence diagram of our Webhooks service

Queue latency issues

It all started with a queue latency monitor.

On Gusto Embedded, we’ve configured a series of DataDog monitors to alert us of various issues. These include queue latency monitors, which alert us when the latency of our Sidekiq queues exceeds a certain threshold. For our Demo environment, that threshold is 60 seconds.

Queue latency alert for the Notifications queue in the Webhooks service

Before diving into the root cause analysis, it’s important to understand what is meant by queue latency. It is defined as “the difference in seconds since the oldest job in the queue was enqueued”. In other words, queue latency measures how long it takes for a job to be processed after it enters the queue. Note that this is different from latency of the job itself, which measures how long a single job takes to execute.

Queue latency indicates the queue’s processing capacity and overall health, so if it spikes, the queue cannot adequately handle the volume of enqueued jobs. This could happen for a number of reasons, including an insufficient number of workers to handle the queue volume, jobs that take too long to finish, or issues with the servers managing the queues (which for Sidekiq, these are Redis servers).

Over the course of several months, the queue latency monitor for the Notifications queue in our Demo environment became noisy. During the worst instances, this resulted in large delays in partners receiving webhook notifications.

Thankfully, the alerts were only happening in our Demo environment, which we offer to partners as a sandbox environment for testing. While we were meeting our queue latency SLOs in production, we wanted to mirror those performance goals in Demo. If our Webhooks service is slow and unreliable during testing, it does not inspire confidence for the production experience. We decided to invest the time to handle this situation more robustly.

Root cause analysis

We set out to answer the question: Why was queue latency spiking?

First we turned to the Notifications queue logs. We noticed spikes in failures happening roughly around the time that our alerts were triggered. It appeared that our bottleneck was related to the jobs themselves.

So, what was going on with these jobs? The logs showed that the majority of the failures were happening after 10 seconds. That 10 second number, which is also the connection timeout limit we enforce on external HTTP requests, was our first clue.

We knew that the Notification jobs involved an HTTP request to the subscribed partner application. Perhaps we were sending repeated notifications to offline or “dead end” subscriptions.

Once we had our hypothesis, we wanted to test it out. If a handful of offline subscriptions were to blame for the spikes in notification failures, then ceasing to send them notifications should improve queue latency. Luckily, that’s exactly what happened–we found the offending subscriptions, set them to an “unreachable” state, and within minutes, queue latency started to improve.

We had confirmation that these offline subscriptions were causing the issues. Next we needed to understand how common these types of subscriptions were in our Demo environment. Were there just a handful accounting for the majority of connection timeouts, or was this a more widespread issue? How many jobs was each offline subscription receiving?

Our metrics revealed that over 90% of our Demo notification attempts resulted in failures. Yikes!

Datadog chart showing failure rates for webhook notifications in Demo (dark blue is a failure)

We also noticed a pattern with the registered subscription URLs–many of them were ngrok URLs or links to temporary servers. We started developing a sense that most subscriptions in Demo were not actively being maintained–they were likely spun up for a singular testing session, and abandoned after testing was finished.

However, because applications tied to these offline subscriptions were still actively in development, events were still being triggered, and our system was still attempting to send notifications about them. We’d retry failed notifications multiple times using an exponential backoff scheme, even though they were guaranteed to fail. This cycle would repeat until each event had exhausted its retry limits. These failures were expensive because they required hitting our timeout limits. With enough of these event failures happening in succession, we found ourselves in a “retry” storm, rapidly building bloat on our Notifications queue until it could no longer keep up with the traffic.

A rapid spike in queue latency of our webhooks Notifications queue

What did we implement?

At first, we tried horizontally scaling our Notifications queue, doubling the Sidekiq worker pods in hopes that this extra parallel processing power would mitigate the issue. While it helped, it did not stop the recurring queue latency alerts. We needed a more robust strategy.

With such a high notification failure rate, it was clear we were over-delivering notifications that were largely bound to fail. We needed a way to detect offline subscriptions so that we would no longer attempt to deliver to them.

We began looking into an “auto-deactivation” strategy. What if our system could detect the beginnings of these failure bursts, flag the responsible subscriptions, and “deactivate” them to prevent a retry storm? We decided to use Redis to keep track of notification failures for a given subscription over a sliding window of time (let’s say, five minutes). Essentially, we implemented a notification failure rate limiter.

Mitigating risk

Deactivating a webhook subscription can disrupt a partner’s system, since they rely on these event notifications for timely updates. We wanted to make sure that our auto-deactivation mechanism was not overly aggressive, and that it was flagging truly offline subscriptions. At first, we released the “rate limiter” tracking piece of the system, and logged when a subscription bypassed our failure threshold to DataDog. Eventually, we gathered enough data to confirm that the flagged subscriptions were truly offline servers. Feeling confident about our failure threshold parameters, we then turned on the “deactivation” piece in Demo.

Other considerations

Alongside this auto-deactivation change, we also explored other options.

Sidekiq autoscaling

We considered autoscaling the queue’s worker resources to dynamically handle spikes in job volume. In theory, this could have helped, but in reality, it had some limitations. Autoscaling can take several minutes to even detect the need for more resources. Spinning up new worker instances and initializing them also takes time. Since the problem we were solving tended to escalate rapidly, we couldn’t afford to wait several minutes to trigger and build the necessary resources.

Plus, adding more workers did not address the root cause. Scaling our resources would have only taken us so far without addressing the subscriptions that were creating these bursts of failures—not all traffic problems can be addressed by adding more lanes.

Priority-based queues

One strategy we did end up implementing was creating a second, lower-priority queue for the failing notification retry jobs. If the main customer concern was not being able to receive timely notifications because the queue was backed up, then we wanted to make every effort to ensure that the main queue had minimal bottlenecks. Once a notification failed, we retried it in the lower priority queue, which helped us maintain a “fast lane” for sending notifications to subscriptions that were actively online.

Conclusion

Investigating problems like queue latency and other performance issues can feel overwhelming. It’s important to clearly define the problem. By tracking the right metrics and digging into the right data, you can build a more nuanced picture of what might be happening. Once that comes into view, you can arrive at a solution that makes sense for your unique system and situation.

When our metrics pointed us to the source of the bottleneck in our Notifications queue, we still needed to understand how partners were using these subscriptions. We were initially treating our Demo environment as a clone of our Production environment, but the two were being used in very different ways. After we made that distinction, the patterns our partners were following began to reveal themselves, and we were able to build a solution that benefitted both our system and our partners’ experiences.

We are happy to report that since the introduction of this auto-deactivation mechanism, our Notifications queue has not exceeded the 60s queue latency threshold, and we have not received any new queue latency alerts in Demo 🎉. We also saw a reduction in the notification delivery failure rate by almost 20x! We drastically reduced wasted resources in our Webhooks service, spared our engineers from one less recurring headache, and ensured a much better testing experience for our partners.