Action Queue Hiccup - July 2021

What happened?

At 1:55 PM (ET) on Thursday, July 15th we were notified internally of a potential problem with outgoing messages.

While we take these kinds of concerns seriously, it's also our dealers' most common issue. Since a large percentage of our notifications are via email, and email has a host of its own deliverability problems, we spend a fair amount of time focused on making this work, work well, work reliably, and most of all, that we get alerted immediately whenever it isn't doing what is expected.

What existed at the time?

We run our queue through 2 independent instances, with some duplication prevention instituted, failures regarding queue corruption, queue halts and queue delays (or just taking longer than normal to process events) as well as the standard load balancing and system and backup redundancy that would be normally expected. When this event occurred, not only did both of our redundant queue processors stall at the same time, but the mechanisms for monitoring and generating and sending alerts also stalled at the same time, leaving us entirely in the dark about what was occurring until someone noticed it the traditional way, and informed us.

This was the very first time the queue had stalled, ever, since we rolled it out in 2019, except during regular maintenance and thus it became our first real-world test, something usually (and ideally) left to the imagination. 

What worked well?

Aside from stopping, the queue mechanism actually worked very well all by itself. It collected and logged every actionable event (this doesn't only include notifications, but logins, field changes, email and cell phone validation tests and more), and it stowed these events in it's repository for future processing exactly as it was designed to do.  Nothing was lost, other than some time.

What worked not so well?

Basically, the queue halted, didn't recover on its own from having stopped, and the mechanisms in place to WAKE PEOPLE UP, never took any action at all.

Where are we right now?

ALL action items completed processing in approximately 53 minutes from the time the problem was discovered. You will have noticed leads and notices arriving in batches, shortly after the 1:51PM (ET) queue restart.

What have we done since?

  1. We did a full investigation into the causes of both the stall to the queue and the failure of our monitoring intelligence.
  2. We've developed a plan for refactoring parts of the code that monitor our queue daemons and associated services (email, SMS, APIs, load balancers, email and phone verification, certificate validation, etc.)
  3. We've scheduled that code to be rolled out to our testing servers by Friday, July, 23rd, and our production servers on Sunday August 1st, 2021. (We do these types of updates during our lowest traffic periods)

Thank you for your continued faith in our service, your trust and your business.

Chris Purser, CTO/CXO
& my entire team.



If you believe you have other issues related to this outage, please email the details to support@digitalpower.solutions.


Overall Event Stats



Queue Stall Start7/15/21 - 12:33 am (ET)
Queue Restarted7/16/21 - 1:51 pm (ET)
Queue Emptied (including new items)7/16/21 - 2:44 pm (ET)
Duration of Anomaly38hours : 13minutes
Percentage able to rerun100%
Percentage unable to rerun0%
Percentage rerun successfully100%
Percentage rerun unsuccessfully0%
Loss of data0%


Stats of items delayed

ItemsHeld in QueueItems resentLoss %
Leads Converted - Autoresponder to Lead - EMAIL5345340%
Lead Converted - Notice to Users - EMAIL163316330%
Lead Converted - ADF emails sent - EMAIL4404400%
Lead Converted - API Posts submitted - API1441440%
Trade-In Text invitations - SMS5065060%
Trade-In Text confirmations - SMS2172170%
Users Updated19190%
Fields Updated31310%
Phone Numbers Validated5065060%
Email Addresses Validated5345340%