There has been a... grumpy-bump... in one of my cooperatives' infrastructure.

The founder had rolled his own auto-scaler, which SIGTERMs our workers every ~3 minutes, causing background jobs to get retried ~4+ hours later.

Today, I removed it completely, and configured #JudoScale.

#Rails #Ruby #DelayedJob #Heroku

@mat We are using Postgres. We were on a mix of #DelayedJob and #Resque, but recently totally replaced Resque with #Sidekiq. We're migrating DJ jobs over as we need to/touch them in any significant way.

We've moved everything to latency-based queues - Sidekiq and DJ both - and it's been a great change!

This toot brought to you by me spelunking the #DelayedJob code trying to figure out why one of our workers was dying every 4 hours due to OOM’ing.

It's been years since I last used #DelayedJob. Like, the early 2010's or so? Back then it was a mix of DJ and #Resque. Then #Sidekiq came on the scene, I moved over pretty quickly.

Anyhow, the point is, I was under the impression that DelayedJob doesn't have a mechanism to recover from jobs that crash/SIGKILL’d (like, think OOM or something). And to be fair, DJ itself doesn't. But the ActiveRecord backend does, though it's not really advertised. https://github.com/collectiveidea/delayed_job_active_record/blob/97f26a3e1b82b338cd8270aad988c75b82ea5c86/lib/delayed/backend/active_record.rb#L57

#Ruby #Rails #OpenSource

delayed_job_active_record/lib/delayed/backend/active_record.rb at 97f26a3e1b82b338cd8270aad988c75b82ea5c86 · collectiveidea/delayed_job_active_record

ActiveRecord backend integration for DelayedJob 3.0+ - collectiveidea/delayed_job_active_record

GitHub