-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job with concurrency key blocked when it shouldn't have, eventually was run after duration
expired
#456
Comments
That's right. The first job should have unblocked the second job here 🤔 Would it be possible for you to enable Active Record logs at the debug level to see what's happening when the first job runs and tries to unblock it? |
That would be a lot of logging unfortunately. And this isn’t that common. |
I'll see if I can add it in our staging environment. My plan is to add an initializer which basically does module SolidQueue
class Job
def unblock_next_blocked_job
ActiveRecord.verbose_query_logs = true
super
ensure
ActiveRecord.verbose_query_logs = false
end
end
end |
@rosa Are debug logs prefixed with class FilteredSqLogger < ActiveSupport::Logger
def debug(progname = nil, &block)
super if progname.match(/^\s+SolidQueue::/)
end
end |
I think you could try |
|
Happened again Records (
|
Ohhhh, this is super useful, @leondmello! I think I know what's going on, thanks to your logs. Looks like a tricky race condition. Basically, what happens is that your first job finishes before the second one has been enqueued and blocked, not before enough to have unblocked the semaphore but before enough to not see it blocked yet. Here's the first job finishing:
Then, when that job finishes, it unblocks the semaphore, here:
Here's the second job enqueued:
At that point the semaphore hasn't been updated yet, because it was updated at 11:54:57.872605, so it gets blocked. However, when this
So it can't get unblocked, because it's not completely blocked yet. Ahh, tricky! What isolation level are you running? |
We have this in our config.active_support.isolation_level = :fiber But don't know the reason as to why that was added some time ago. Trying to find out. |
@leondmello, no, no worries about that. I meant the transaction isolation level in your DB (I imagined you're using PostgreSQL because of the logs), but thinking more about it, it shouldn't make any difference, so no worries about it. It's a tricky one, I'm not sure right now about how to fix it but I'll keep thinking about it. |
Thinking more about this and how to fix... Claiming the semaphore + enqueuing the job happens in a transaction, but the semaphore is not locked if it exists because it's just checked, so it can be released while the transaction is ongoing, and you wouldn't know, regardless of whether I think, to fix this, we'd have to take a lock there, when checking the semaphore, and releasing it when the transaction is committed 🤔 However, this might introduce a new kind of deadlock I can't see right now (I've fixed a couple of deadlocks here in the past) so I need to think carefully about it. |
We are observing that our jobs are sometimes unnecessarily blocked.
In the following example, for the same concurrency key, there are only 2 jobs. The first one completes immediately in just over a 100 milliseconds (
finished_at
value).The second one takes over 30 minutes, we are guessing it is blocked by the concurrency control duration. The
scheduled_at
time of the second job is much after when the first job completes.As per our understanding, the second job should have started immediately.
We confirmed that none of the jobs processes bounced during that time.
...
indicates redacted data.Concurrency Config
Queue Config
Console queries
Logs about the second job execution
The text was updated successfully, but these errors were encountered: