{"id":807,"date":"2025-04-08T13:28:22","date_gmt":"2025-04-08T20:28:22","guid":{"rendered":"https:\/\/embedded.gusto.com\/blog\/?p=807"},"modified":"2025-04-08T13:30:49","modified_gmt":"2025-04-08T20:30:49","slug":"retry-storms-webhook-queue-latency","status":"publish","type":"post","link":"https:\/\/embedded.gusto.com\/blog\/retry-storms-webhook-queue-latency\/","title":{"rendered":"Surviving Retry Storms: Investigating Webhooks Queue Latency Issues"},"content":{"rendered":"<p><span style=\"font-weight: 400;\">Background jobs are a powerful tool for processing high volumes of data while maintaining application responsiveness. Gusto uses <\/span><a href=\"https:\/\/sidekiq.org\/\"><span style=\"font-weight: 400;\">Sidekiq<\/span><\/a><span style=\"font-weight: 400;\"> as its background job queueing system across many different services, including our webhooks service.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">While the ability to process data asynchronously through background jobs provides many performance benefits for an application, it can also be a source of performance issues.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We recently found ourselves in this conundrum with the Notifications queue in our Webhooks service. This post will go through what we did to investigate and define the problem, and what we implemented to solve it.<\/span><\/p>\n<h2><b><i>Webhooks service architecture<\/i><\/b><\/h2>\n<p><span style=\"font-weight: 400;\">Gusto\u2019s Webhooks service is a Ruby on Rails application that is responsible for processing, storing, and delivering event notifications to external applications. At a high level, the architecture is made up of these core pieces:<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Stream consumer (Kafka):<\/span>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><span style=\"font-weight: 400;\">This is the main entry point into our Webhooks service. It listens for events produced by the Kafka broker in our main service and enqueues a job to the Sidekiq Consumer queue for each event<\/span><\/li>\n<\/ul>\n<\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Consumer queue (Sidekiq):<\/span>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><span style=\"font-weight: 400;\">Jobs in this queue persist event details to our database and look for any applications subscribed to that event. If a subscription exists, a job is enqueued to the Notifications queue<\/span><\/li>\n<\/ul>\n<\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Webhooks Database (PostgreSQL): Stores event and notification attempt details<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Notifications queue (Sidekiq):<\/span>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><span style=\"font-weight: 400;\">Jobs in this queue manage sending notifications to subscribed applications for a given event<\/span><\/li>\n<\/ul>\n<\/li>\n<\/ol>\n<p><img decoding=\"async\" class=\"alignnone size-full wp-image-818 lazyload\" data-src=\"https:\/\/embeddedblog.wpengine.com\/wp-content\/uploads\/2025\/04\/EMB-BLOG-retrystorms-graphic-1.png\" alt=\"webhooks notifications sequence diagram\" width=\"3963\" height=\"3138\" data-srcset=\"https:\/\/embeddedblog.wpengine.com\/wp-content\/uploads\/2025\/04\/EMB-BLOG-retrystorms-graphic-1.png 3963w, https:\/\/embeddedblog.wpengine.com\/wp-content\/uploads\/2025\/04\/EMB-BLOG-retrystorms-graphic-1-379x300.png 379w, https:\/\/embeddedblog.wpengine.com\/wp-content\/uploads\/2025\/04\/EMB-BLOG-retrystorms-graphic-1-796x630.png 796w, https:\/\/embeddedblog.wpengine.com\/wp-content\/uploads\/2025\/04\/EMB-BLOG-retrystorms-graphic-1-150x119.png 150w, https:\/\/embeddedblog.wpengine.com\/wp-content\/uploads\/2025\/04\/EMB-BLOG-retrystorms-graphic-1-768x608.png 768w, https:\/\/embeddedblog.wpengine.com\/wp-content\/uploads\/2025\/04\/EMB-BLOG-retrystorms-graphic-1-1536x1216.png 1536w, https:\/\/embeddedblog.wpengine.com\/wp-content\/uploads\/2025\/04\/EMB-BLOG-retrystorms-graphic-1-2048x1622.png 2048w, https:\/\/embeddedblog.wpengine.com\/wp-content\/uploads\/2025\/04\/EMB-BLOG-retrystorms-graphic-1-1400x1109.png 1400w\" data-sizes=\"(max-width: 3963px) 100vw, 3963px\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" style=\"--smush-placeholder-width: 3963px; --smush-placeholder-aspect-ratio: 3963\/3138;\" \/><\/p>\n<p><i><span style=\"font-weight: 400;\">Simplified sequence diagram of our Webhooks service<\/span><\/i><\/p>\n<h2><b><i>Queue latency issues<\/i><\/b><\/h2>\n<p><span style=\"font-weight: 400;\">It all started with a queue latency monitor.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">On <\/span><a href=\"https:\/\/embedded.gusto.com\"><span style=\"font-weight: 400;\">Gusto Embedded<\/span><\/a><span style=\"font-weight: 400;\">, we\u2019ve configured a series of DataDog monitors to alert us of various issues. These include <\/span><i><span style=\"font-weight: 400;\">queue latency<\/span><\/i><span style=\"font-weight: 400;\"> monitors, which alert us when the latency of our Sidekiq queues exceeds a certain threshold. For our Demo environment, that threshold is 60 seconds.<\/span><\/p>\n<p><img decoding=\"async\" class=\"alignnone size-full wp-image-813 lazyload\" data-src=\"https:\/\/embeddedblog.wpengine.com\/wp-content\/uploads\/2025\/04\/Screenshot-2025-01-31-at-12.12.07\u202fPM.png\" alt=\"webhooks queue latency alert slack notification\" width=\"671\" height=\"430\" data-srcset=\"https:\/\/embeddedblog.wpengine.com\/wp-content\/uploads\/2025\/04\/Screenshot-2025-01-31-at-12.12.07\u202fPM.png 671w, https:\/\/embeddedblog.wpengine.com\/wp-content\/uploads\/2025\/04\/Screenshot-2025-01-31-at-12.12.07\u202fPM-425x272.png 425w, https:\/\/embeddedblog.wpengine.com\/wp-content\/uploads\/2025\/04\/Screenshot-2025-01-31-at-12.12.07\u202fPM-150x96.png 150w\" data-sizes=\"(max-width: 671px) 100vw, 671px\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" style=\"--smush-placeholder-width: 671px; --smush-placeholder-aspect-ratio: 671\/430;\" \/><\/p>\n<p><i><span style=\"font-weight: 400;\">Queue latency alert for the Notifications queue in the Webhooks service<\/span><\/i><\/p>\n<p><span style=\"font-weight: 400;\">Before diving into the root cause analysis, it\u2019s important to understand what is meant by queue latency. <\/span><a href=\"https:\/\/www.rubydoc.info\/gems\/sidekiq\/6.1.3\/Sidekiq%2FQueue:latency\"><span style=\"font-weight: 400;\">It is defined as<\/span><\/a><span style=\"font-weight: 400;\"> \u201cthe difference in seconds since the oldest job in the queue was enqueued\u201d. In other words, queue latency measures how long it takes for a job to be processed after it enters the queue. Note that this is different from latency of the job <\/span><i><span style=\"font-weight: 400;\">itself<\/span><\/i><span style=\"font-weight: 400;\">, which measures how long a single job takes to execute.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Queue latency indicates the queue\u2019s processing capacity and overall health, so if it spikes, the queue cannot adequately handle the volume of enqueued jobs. This could happen for a number of reasons, including an insufficient number of workers to handle the queue volume, jobs that take too long to finish, or issues with the servers managing the queues (which for Sidekiq, these are Redis servers).<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Over the course of several months, the queue latency monitor for the Notifications queue in our Demo environment became noisy. During the worst instances, this resulted in large delays in partners receiving webhook notifications.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Thankfully, the alerts were only happening in our Demo environment, which we offer to partners as a sandbox environment for testing. While we were meeting our queue latency SLOs in production, we wanted to mirror those performance goals in Demo. If our Webhooks service is slow and unreliable during testing, it does not inspire confidence for the production experience. We decided to invest the time to handle this situation more robustly.\u00a0<\/span><\/p>\n<h2><b><i>Root cause analysis<\/i><\/b><\/h2>\n<p><span style=\"font-weight: 400;\">We set out to answer the question: <\/span><i><span style=\"font-weight: 400;\">Why was queue latency spiking?<\/span><\/i><\/p>\n<p><span style=\"font-weight: 400;\">First we turned to the Notifications queue logs. We noticed spikes in failures happening roughly around the time that our alerts were triggered. It appeared that our bottleneck was related to the jobs themselves.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">So, what was going on with these jobs? The logs showed that the majority of the failures were happening after 10 seconds. That 10 second number, which is also the connection timeout limit we enforce on external HTTP requests, was our first clue.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We knew that the Notification jobs involved an HTTP request to the subscribed partner application. Perhaps we were sending repeated notifications to offline or \u201cdead end\u201d subscriptions.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Once we had our hypothesis, we wanted to test it out. If a handful of offline subscriptions were to blame for the spikes in notification failures, then ceasing to send them notifications should improve queue latency. Luckily, that\u2019s exactly what happened\u2013we found the offending subscriptions, set them to an &#8220;unreachable&#8221; state, and within minutes, queue latency started to improve.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We had confirmation that these offline subscriptions were causing the issues. Next we needed to understand how common these types of subscriptions were in our Demo environment. Were there just a handful accounting for the majority of connection timeouts, or was this a more widespread issue? How many jobs was each offline subscription receiving?\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Our metrics revealed that over 90% of our Demo notification attempts resulted in failures. Yikes!\u00a0<\/span><\/p>\n<p><img decoding=\"async\" class=\"alignnone size-full wp-image-815 lazyload\" data-src=\"https:\/\/embeddedblog.wpengine.com\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-08-at-12.59.47\u202fPM.png\" alt=\"failure rates for webhook notifications in Demo\" width=\"1578\" height=\"494\" data-srcset=\"https:\/\/embeddedblog.wpengine.com\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-08-at-12.59.47\u202fPM.png 1578w, https:\/\/embeddedblog.wpengine.com\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-08-at-12.59.47\u202fPM-425x133.png 425w, https:\/\/embeddedblog.wpengine.com\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-08-at-12.59.47\u202fPM-1200x376.png 1200w, https:\/\/embeddedblog.wpengine.com\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-08-at-12.59.47\u202fPM-150x47.png 150w, https:\/\/embeddedblog.wpengine.com\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-08-at-12.59.47\u202fPM-768x240.png 768w, https:\/\/embeddedblog.wpengine.com\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-08-at-12.59.47\u202fPM-1536x481.png 1536w, https:\/\/embeddedblog.wpengine.com\/wp-content\/uploads\/2025\/04\/Screenshot-2025-04-08-at-12.59.47\u202fPM-1400x438.png 1400w\" data-sizes=\"(max-width: 1578px) 100vw, 1578px\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" style=\"--smush-placeholder-width: 1578px; --smush-placeholder-aspect-ratio: 1578\/494;\" \/><\/p>\n<p><i><span style=\"font-weight: 400;\">Datadog chart showing failure rates for webhook notifications in Demo (dark blue is a failure)<\/span><\/i><\/p>\n<p><span style=\"font-weight: 400;\">We also noticed a pattern with the registered subscription URLs\u2013many of them were <\/span><i><span style=\"font-weight: 400;\">ngrok <\/span><\/i><span style=\"font-weight: 400;\">URLs or links to temporary servers. We started developing a sense that most subscriptions in Demo were not actively being maintained\u2013they were likely spun up for a singular testing session, and abandoned after testing was finished.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">However, because applications tied to these offline subscriptions were still actively in development, events were still being triggered, and our system was still attempting to send notifications about them. We\u2019d retry failed notifications multiple times using an exponential backoff scheme, even though they were guaranteed to fail. This cycle would repeat until each event had exhausted its retry limits. These failures were expensive because they required hitting our timeout limits. With enough of these event failures happening in succession, we found ourselves in a \u201cretry\u201d storm, rapidly building bloat on our Notifications queue until it could no longer keep up with the traffic.\u00a0<\/span><\/p>\n<p><img decoding=\"async\" class=\"alignnone size-full wp-image-814 lazyload\" data-src=\"https:\/\/embeddedblog.wpengine.com\/wp-content\/uploads\/2025\/04\/Screenshot-2025-01-31-at-12.06.47\u202fPM.png\" alt=\"rapid spike in latency of our webhooks Notifications queue\" width=\"979\" height=\"286\" data-srcset=\"https:\/\/embeddedblog.wpengine.com\/wp-content\/uploads\/2025\/04\/Screenshot-2025-01-31-at-12.06.47\u202fPM.png 979w, https:\/\/embeddedblog.wpengine.com\/wp-content\/uploads\/2025\/04\/Screenshot-2025-01-31-at-12.06.47\u202fPM-425x124.png 425w, https:\/\/embeddedblog.wpengine.com\/wp-content\/uploads\/2025\/04\/Screenshot-2025-01-31-at-12.06.47\u202fPM-150x44.png 150w, https:\/\/embeddedblog.wpengine.com\/wp-content\/uploads\/2025\/04\/Screenshot-2025-01-31-at-12.06.47\u202fPM-768x224.png 768w\" data-sizes=\"(max-width: 979px) 100vw, 979px\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" style=\"--smush-placeholder-width: 979px; --smush-placeholder-aspect-ratio: 979\/286;\" \/><\/p>\n<p><i><span style=\"font-weight: 400;\">A rapid spike in queue latency of our webhooks Notifications queue<\/span><\/i><\/p>\n<h2><b><i>What did we implement?<\/i><\/b><\/h2>\n<p><span style=\"font-weight: 400;\">At first, we tried horizontally scaling our Notifications queue, doubling the Sidekiq worker pods in hopes that this extra parallel processing power would mitigate the issue. While it helped, it did not stop the recurring queue latency alerts. We needed a more robust strategy.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">With such a high notification failure rate, it was clear we were over-delivering notifications that were largely bound to fail. We needed a way to detect offline subscriptions so that we would no longer attempt to deliver to them.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We began looking into an \u201cauto-deactivation\u201d strategy. What if our system could detect the beginnings of these failure bursts, flag the responsible subscriptions, and \u201cdeactivate\u201d them to prevent a retry storm? We decided to use Redis to keep track of notification failures for a given subscription over a sliding window of time (let\u2019s say, five minutes). Essentially, we implemented a notification failure rate limiter.\u00a0<\/span><\/p>\n<h2><b><i>Mitigating risk<\/i><\/b><\/h2>\n<p><span style=\"font-weight: 400;\">Deactivating a webhook subscription can disrupt a partner\u2019s system, since they rely on these event notifications for timely updates. We wanted to make sure that our auto-deactivation mechanism was not overly aggressive, and that it was flagging truly offline subscriptions. At first, we released the \u201crate limiter\u201d tracking piece of the system, and logged when a subscription bypassed our failure threshold to DataDog. Eventually, we gathered enough data to confirm that the flagged subscriptions were truly offline servers. Feeling confident about our failure threshold parameters,\u00a0 we then turned on the \u201cdeactivation\u201d piece in Demo.\u00a0<\/span><\/p>\n<h2><b><i>Other considerations<\/i><\/b><\/h2>\n<p><span style=\"font-weight: 400;\">Alongside this auto-deactivation change, we also explored other options.\u00a0<\/span><\/p>\n<p><strong><i>Sidekiq autoscaling\u00a0<\/i><\/strong><\/p>\n<p><span style=\"font-weight: 400;\">We considered autoscaling the queue\u2019s worker resources to dynamically handle spikes in job volume. In theory, this could have helped, but in reality, it had some limitations. Autoscaling can take several minutes to even detect the need for more resources. Spinning up new worker instances and initializing them also takes time. Since the problem we were solving tended to escalate rapidly, we couldn\u2019t afford to wait several minutes to trigger and build the necessary resources.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Plus, adding more workers did not address the root cause. Scaling our resources would have only taken us so far without addressing the subscriptions that were creating these bursts of failures\u2014not all traffic problems can be addressed by adding more lanes.<\/span><\/p>\n<p><strong><i>Priority-based queues<\/i><\/strong><\/p>\n<p><span style=\"font-weight: 400;\">One strategy we <\/span><i><span style=\"font-weight: 400;\">did <\/span><\/i><span style=\"font-weight: 400;\">end up implementing was creating a second, lower-priority queue for the failing notification retry jobs. If the main customer concern was not being able to receive timely notifications because the queue was backed up, then we wanted to make every effort to ensure that the main queue had minimal bottlenecks. Once a notification failed, we retried it in the lower priority queue, which helped us maintain a \u201cfast lane\u201d for sending notifications to subscriptions that were actively online.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2><b><i>Conclusion<\/i><\/b><i><span style=\"font-weight: 400;\">\u00a0<\/span><\/i><\/h2>\n<p><span style=\"font-weight: 400;\">Investigating problems like queue latency and other performance issues can feel overwhelming. It\u2019s important to clearly define the problem. By tracking the right metrics and digging into the right data, you can build a more nuanced picture of what might be happening. Once that comes into view, you can arrive at a solution that makes sense for your unique system and situation.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">When our metrics pointed us to the source of the bottleneck in our Notifications queue, we still needed to understand how partners were using these subscriptions. We were initially treating our Demo environment as a clone of our Production environment, but the two were being used in very different ways. After we made that distinction, the patterns our partners were following began to reveal themselves, and we were able to build a solution that benefitted both our system and our partners\u2019 experiences.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We are happy to report that since the introduction of this auto-deactivation mechanism, our Notifications queue has not exceeded the 60s queue latency threshold, and we have not received any new queue latency alerts in Demo \ud83c\udf89. We also saw a reduction in the notification delivery failure rate by almost 20x! We drastically reduced wasted resources in our Webhooks service, spared our engineers from one less recurring headache, and ensured a much better testing experience for our partners. <\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Background jobs are a powerful tool for processing high volumes of data while maintaining application responsiveness. Gusto uses Sidekiq as&#8230;<\/p>\n","protected":false},"author":21,"featured_media":808,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[4],"tags":[],"class_list":["post-807","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-developer-perspective"],"acf":{"exclude_from_embedded_resources":false,"popularity":0,"essentiality":0},"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.8 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Surviving Retry Storms: Investigating Webhooks Queue Latency Issues - Embedded Blog<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/embedded.gusto.com\/blog\/retry-storms-webhook-queue-latency\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Surviving Retry Storms: Investigating Webhooks Queue Latency Issues - Embedded Blog\" \/>\n<meta property=\"og:description\" content=\"Background jobs are a powerful tool for processing high volumes of data while maintaining application responsiveness. Gusto uses Sidekiq as...\" \/>\n<meta property=\"og:url\" content=\"https:\/\/embedded.gusto.com\/blog\/retry-storms-webhook-queue-latency\/\" \/>\n<meta property=\"og:site_name\" content=\"Embedded Blog\" \/>\n<meta property=\"article:published_time\" content=\"2025-04-08T20:28:22+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-08T20:30:49+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/embeddedblog.wpengine.com\/wp-content\/uploads\/2025\/04\/EMB-BLOG-retrystorms-header-1120x630.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1120\" \/>\n\t<meta property=\"og:image:height\" content=\"630\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Raquel Silva\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Raquel Silva\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"9 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Surviving Retry Storms: Investigating Webhooks Queue Latency Issues - Embedded Blog","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/embedded.gusto.com\/blog\/retry-storms-webhook-queue-latency\/","og_locale":"en_US","og_type":"article","og_title":"Surviving Retry Storms: Investigating Webhooks Queue Latency Issues - Embedded Blog","og_description":"Background jobs are a powerful tool for processing high volumes of data while maintaining application responsiveness. Gusto uses Sidekiq as...","og_url":"https:\/\/embedded.gusto.com\/blog\/retry-storms-webhook-queue-latency\/","og_site_name":"Embedded Blog","article_published_time":"2025-04-08T20:28:22+00:00","article_modified_time":"2025-04-08T20:30:49+00:00","og_image":[{"width":1120,"height":630,"url":"https:\/\/embeddedblog.wpengine.com\/wp-content\/uploads\/2025\/04\/EMB-BLOG-retrystorms-header-1120x630.png","type":"image\/png"}],"author":"Raquel Silva","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Raquel Silva","Est. reading time":"9 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/embedded.gusto.com\/blog\/retry-storms-webhook-queue-latency\/#article","isPartOf":{"@id":"https:\/\/embedded.gusto.com\/blog\/retry-storms-webhook-queue-latency\/"},"author":{"name":"Raquel Silva","@id":"https:\/\/embedded.gusto.com\/blog\/#\/schema\/person\/621162d90f98db18ab606e08a733c55b"},"headline":"Surviving Retry Storms: Investigating Webhooks Queue Latency Issues","datePublished":"2025-04-08T20:28:22+00:00","dateModified":"2025-04-08T20:30:49+00:00","mainEntityOfPage":{"@id":"https:\/\/embedded.gusto.com\/blog\/retry-storms-webhook-queue-latency\/"},"wordCount":1678,"image":{"@id":"https:\/\/embedded.gusto.com\/blog\/retry-storms-webhook-queue-latency\/#primaryimage"},"thumbnailUrl":"https:\/\/embeddedblog.wpengine.com\/wp-content\/uploads\/2025\/04\/EMB-BLOG-retrystorms-header.png","articleSection":["Developer Perspective"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/embedded.gusto.com\/blog\/retry-storms-webhook-queue-latency\/","url":"https:\/\/embedded.gusto.com\/blog\/retry-storms-webhook-queue-latency\/","name":"Surviving Retry Storms: Investigating Webhooks Queue Latency Issues - Embedded Blog","isPartOf":{"@id":"https:\/\/embedded.gusto.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/embedded.gusto.com\/blog\/retry-storms-webhook-queue-latency\/#primaryimage"},"image":{"@id":"https:\/\/embedded.gusto.com\/blog\/retry-storms-webhook-queue-latency\/#primaryimage"},"thumbnailUrl":"https:\/\/embeddedblog.wpengine.com\/wp-content\/uploads\/2025\/04\/EMB-BLOG-retrystorms-header.png","datePublished":"2025-04-08T20:28:22+00:00","dateModified":"2025-04-08T20:30:49+00:00","author":{"@id":"https:\/\/embedded.gusto.com\/blog\/#\/schema\/person\/621162d90f98db18ab606e08a733c55b"},"breadcrumb":{"@id":"https:\/\/embedded.gusto.com\/blog\/retry-storms-webhook-queue-latency\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/embedded.gusto.com\/blog\/retry-storms-webhook-queue-latency\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/embedded.gusto.com\/blog\/retry-storms-webhook-queue-latency\/#primaryimage","url":"https:\/\/embeddedblog.wpengine.com\/wp-content\/uploads\/2025\/04\/EMB-BLOG-retrystorms-header.png","contentUrl":"https:\/\/embeddedblog.wpengine.com\/wp-content\/uploads\/2025\/04\/EMB-BLOG-retrystorms-header.png","width":3840,"height":2160,"caption":"Blog header graphic \u2014 Surviving retry storms"},{"@type":"BreadcrumbList","@id":"https:\/\/embedded.gusto.com\/blog\/retry-storms-webhook-queue-latency\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/embedded.gusto.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Surviving Retry Storms: Investigating Webhooks Queue Latency Issues"}]},{"@type":"WebSite","@id":"https:\/\/embedded.gusto.com\/blog\/#website","url":"https:\/\/embedded.gusto.com\/blog\/","name":"Embedded Blog","description":"","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/embedded.gusto.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/embedded.gusto.com\/blog\/#\/schema\/person\/621162d90f98db18ab606e08a733c55b","name":"Raquel Silva","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/embeddedblog.wpengine.com\/wp-content\/uploads\/2024\/06\/Raquel-Silva-113x150.jpg","url":"https:\/\/embeddedblog.wpengine.com\/wp-content\/uploads\/2024\/06\/Raquel-Silva-113x150.jpg","contentUrl":"https:\/\/embeddedblog.wpengine.com\/wp-content\/uploads\/2024\/06\/Raquel-Silva-113x150.jpg","caption":"Raquel Silva"},"description":"Raquel Silva is a software engineer on the Gusto Embedded team, where she focuses on improving the developer experience for our partner developers. Before joining Gusto in 2024, she worked at different startups, including HomeLight and Accompany. She is proudly born and raised in Miami, and currently local to San Francisco.","url":"https:\/\/embedded.gusto.com\/blog\/author\/raquel-silva\/"}]}},"images":{"large":"https:\/\/embeddedblog.wpengine.com\/wp-content\/uploads\/2025\/04\/EMB-BLOG-retrystorms-header-1120x630.png"},"authorDetails":{"id":21,"name":"Raquel Silva","avatar":"https:\/\/embeddedblog.wpengine.com\/wp-content\/uploads\/2024\/06\/Raquel-Silva-113x150.jpg"},"_links":{"self":[{"href":"https:\/\/embedded.gusto.com\/blog\/wp-json\/wp\/v2\/posts\/807","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/embedded.gusto.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/embedded.gusto.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/embedded.gusto.com\/blog\/wp-json\/wp\/v2\/users\/21"}],"replies":[{"embeddable":true,"href":"https:\/\/embedded.gusto.com\/blog\/wp-json\/wp\/v2\/comments?post=807"}],"version-history":[{"count":0,"href":"https:\/\/embedded.gusto.com\/blog\/wp-json\/wp\/v2\/posts\/807\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/embedded.gusto.com\/blog\/wp-json\/wp\/v2\/media\/808"}],"wp:attachment":[{"href":"https:\/\/embedded.gusto.com\/blog\/wp-json\/wp\/v2\/media?parent=807"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/embedded.gusto.com\/blog\/wp-json\/wp\/v2\/categories?post=807"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/embedded.gusto.com\/blog\/wp-json\/wp\/v2\/tags?post=807"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}