Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decrement server_job->function->job_total only if it > 0 #302

Closed
wants to merge 5 commits into from

Conversation

p-alik
Copy link
Collaborator

@p-alik p-alik commented Aug 18, 2020

The PR aims to fix #301. My tests are based on #301 (comment)

@esabol
Copy link
Member

esabol commented Aug 18, 2020

I was considering this same change, but I was worried it didn’t address the underlying problem. I think it’s worth a try though.

Sometime between last week and today, I seem to have lost the ability to restart gearman/gearmand builds in Travis CI. Anyone know why? Never mind. Signing out of Travis CI and signing back in again fixed this.

@infraweavers
Copy link

infraweavers commented Aug 19, 2020

This doesn't seem to have completely solved the problem we mentioned in #301 (comment) however like we said before, it's possible this a completely different thing from our original issue:

W0wsQHOriJ

We have had it take a few goes to crash out with the worker than doesn't timeout for some reason.

p-alik added 5 commits August 19, 2020 14:07
…ly if it > 0

this aims to prevent the value went out of the uint32_t range
See
gearman#301 (comment)
…ly if it > 0

this aims to prevent the value went out of the uint32_t range
See
gearman#301 (comment)
this aims to prevent the value went out of the uint32_t range
See
gearman#301 (comment)
… only if it > 0

this aims to prevent the value went out of the uint32_t range
See
gearman#301 (comment)
if it > 0

this aims to prevent the value went out of the uint32_t range
See
gearman#301 (comment)
@infraweavers
Copy link

Additionally, now we've completely isolated the gearmand, it doesn't crash when testing that; however if you run worker_that_doesnt_timeout.pl twice, it will still end up with -1 jobs running even with this change in, although now we end up with 1 in Jobs Waiting as well:
image

@esabol
Copy link
Member

esabol commented Aug 19, 2020

Well, you could add similar guards wherever job_running is decremented. There are only two locations:

server_job->function->job_running--;

job->function->job_running--;

But I think gearmand will still hang. Could you try that, @infraweavers ?

In order to get to -1, job_running had to have been decremented more than once, obviously. Is gearman_server_job_free( ) being called on the same job twice? I’m also kind of wondering what happens if you comment out

job->function->job_running--;

entirely.

@p-alik
Copy link
Collaborator Author

p-alik commented Aug 20, 2020

Well, you could add similar guards wherever job_running is decremented.

I did it yesterday, but the result wasn't much better:-(

There are only two locations:

server_job->function->job_running--;

a1873ea

job->function->job_running--;

968e283

Copy link
Member

@SpamapS SpamapS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may do more harm than good. I'd like a regression test before we act on the bug for every user.

@infraweavers
Copy link

THis PR does at least seem to stop the jobs_running column from going negative in mini-test

@SpamapS
Copy link
Member

SpamapS commented Oct 31, 2021

Going to close this, but feel free to ping / re-open if you can add a regression test. I am still thinking there's something deeper and this is just treating the symptom.

@SpamapS SpamapS closed this Oct 31, 2021
@infraweavers
Copy link

Yeah sounds good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

gearmand hangs intermittently when timeouts are used
4 participants