Action Queue Hiccup - July 2021

What happened?

At 1:55 PM (ET) on Thursday, July 15th we were notified internally of a potential problem with outgoing messages.

While we take these kinds of concerns seriously, it's also our dealers' most common issue. Since a large percentage of our notifications are via email, and email has a host of its own deliverability problems, we spend a fair amount of time focused on making this work, work well, work reliably, and most of all, that we get alerted immediately whenever it isn't doing what is expected.

What existed at the time?

We run our queue through 2 independent instances, with some duplication prevention instituted, failures regarding queue corruption, queue halts and queue delays (or just taking longer than normal to process events) as well as the standard load balancing and system and backup redundancy that would be normally expected. When this event occurred, not only did both of our redundant queue processors stall at the same time, but the mechanisms for monitoring and generating and sending alerts also stalled at the same time, leaving us entirely in the dark about what was occurring until someone noticed it the traditional way, and informed us.

This was the very first time the queue had stalled, ever, since we rolled it out in 2019, except during regular maintenance and thus it became our first real-world test, something usually (and ideally) left to the imagination.

What worked well?

Aside from stopping, the queue mechanism actually worked very well all by itself. It collected and logged every actionable event (this doesn't only include notifications, but logins, field changes, email and cell phone validation tests and more), and it stowed these events in it's repository for future processing exactly as it was designed to do. Nothing was lost, other than some time.

What worked not so well?

Basically, the queue halted, didn't recover on its own from having stopped, and the mechanisms in place to WAKE PEOPLE UP, never took any action at all.

Where are we right now?

ALL action items completed processing in approximately 53 minutes from the time the problem was discovered. You will have noticed leads and notices arriving in batches, shortly after the 1:51PM (ET) queue restart.

What have we done since?

We did a full investigation into the causes of both the stall to the queue and the failure of our monitoring intelligence.
We've developed a plan for refactoring parts of the code that monitor our queue daemons and associated services (email, SMS, APIs, load balancers, email and phone verification, certificate validation, etc.)
We've scheduled that code to be rolled out to our testing servers by Friday, July, 23rd, and our production servers on Sunday August 1st, 2021. (We do these types of updates during our lowest traffic periods)

Thank you for your continued faith in our service, your trust and your business.

Chris Purser, CTO/CXO
& my entire team.

If you believe you have other issues related to this outage, please email the details to support@digitalpower.solutions.

Overall Event Stats


Queue Stall Start	7/15/21 - 12:33 am (ET)
Queue Restarted	7/16/21 - 1:51 pm (ET)
Queue Emptied (including new items)	7/16/21 - 2:44 pm (ET)
Duration of Anomaly	38hours : 13minutes
Percentage able to rerun	100%
Percentage unable to rerun	0%
Percentage rerun successfully	100%
Percentage rerun unsuccessfully	0%
Loss of data	0%

Stats of items delayed

Items	Held in Queue	Items resent	Loss %

Items	Held in Queue	Items resent	Loss %
Leads Converted - Autoresponder to Lead - EMAIL	534	534	0%
Lead Converted - Notice to Users - EMAIL	1633	1633	0%
Lead Converted - ADF emails sent - EMAIL	440	440	0%
Lead Converted - API Posts submitted - API	144	144	0%
Trade-In Text invitations - SMS	506	506	0%
Trade-In Text confirmations - SMS	217	217	0%
Users Updated	19	19	0%
Fields Updated	31	31	0%
Phone Numbers Validated	506	506	0%
Email Addresses Validated	534	534	0%