decrement server_job->function->job_total only if it > 0 #302

p-alik · 2020-08-18T16:51:17Z

The PR aims to fix #301. My tests are based on #301 (comment)

esabol · 2020-08-18T18:27:58Z

I was considering this same change, but I was worried it didn’t address the underlying problem. I think it’s worth a try though.

~~Sometime between last week and today, I seem to have lost the ability to restart gearman/gearmand builds in Travis CI. Anyone know why?~~ Never mind. Signing out of Travis CI and signing back in again fixed this.

infraweavers · 2020-08-19T08:42:03Z

This doesn't seem to have completely solved the problem we mentioned in #301 (comment) however like we said before, it's possible this a completely different thing from our original issue:

We have had it take a few goes to crash out with the worker than doesn't timeout for some reason.

…ly if it > 0 this aims to prevent the value went out of the uint32_t range See gearman#301 (comment)

this aims to prevent the value went out of the uint32_t range See gearman#301 (comment)

… only if it > 0 this aims to prevent the value went out of the uint32_t range See gearman#301 (comment)

if it > 0 this aims to prevent the value went out of the uint32_t range See gearman#301 (comment)

infraweavers · 2020-08-19T15:00:23Z

Additionally, now we've completely isolated the gearmand, it doesn't crash when testing that; however if you run worker_that_doesnt_timeout.pl twice, it will still end up with -1 jobs running even with this change in, although now we end up with 1 in Jobs Waiting as well:

esabol · 2020-08-19T16:44:09Z

Well, you could add similar guards wherever job_running is decremented. There are only two locations:

gearmand/libgearman-server/job.cc

Line 301 in e2d76cf

server_job->function->job_running--;

gearmand/libgearman-server/job.cc

Line 385 in e2d76cf

job->function->job_running--;

But I think gearmand will still hang. Could you try that, @infraweavers ?

In order to get to -1, job_running had to have been decremented more than once, obviously. Is gearman_server_job_free( ) being called on the same job twice? I’m also kind of wondering what happens if you comment out

gearmand/libgearman-server/job.cc

Line 385 in e2d76cf

job->function->job_running--;

entirely.

p-alik · 2020-08-20T08:34:29Z

Well, you could add similar guards wherever job_running is decremented.

I did it yesterday, but the result wasn't much better:-(

There are only two locations:

gearmand/libgearman-server/job.cc

Line 301 in e2d76cf

server_job->function->job_running--;

a1873ea

gearmand/libgearman-server/job.cc

Line 385 in e2d76cf

job->function->job_running--;

968e283

SpamapS

This may do more harm than good. I'd like a regression test before we act on the bug for every user.

infraweavers · 2020-08-27T09:59:46Z

THis PR does at least seem to stop the jobs_running column from going negative in mini-test

SpamapS · 2021-10-31T20:21:53Z

Going to close this, but feel free to ping / re-open if you can add a regression test. I am still thinking there's something deeper and this is just treating the symptom.

infraweavers · 2021-11-01T08:55:45Z

Yeah sounds good

p-alik mentioned this pull request Aug 18, 2020

gearmand hangs intermittently when timeouts are used #301

Open

p-alik added 5 commits August 19, 2020 14:07

gearman_server_job_free decrements server_job->function->job_total on…

bd96756

…ly if it > 0 this aims to prevent the value went out of the uint32_t range See gearman#301 (comment)

gearman_server_job_take decrements server_job->function->job_count on…

b054eb7

…ly if it > 0 this aims to prevent the value went out of the uint32_t range See gearman#301 (comment)

GEARMAND_{HASH,LIST}DEL macros decrement count cautiously

902f7bb

this aims to prevent the value went out of the uint32_t range See gearman#301 (comment)

gearman_server_job_queue decrements server_job->function->job_running…

968e283

… only if it > 0 this aims to prevent the value went out of the uint32_t range See gearman#301 (comment)

gearman_server_job_free decrement server_job->function->job_running only

a1873ea

if it > 0 this aims to prevent the value went out of the uint32_t range See gearman#301 (comment)

p-alik force-pushed the issue-301 branch from e69fdf2 to a1873ea Compare August 20, 2020 08:25

SpamapS requested changes Aug 27, 2020

View reviewed changes

SpamapS closed this Oct 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

decrement server_job->function->job_total only if it > 0 #302

decrement server_job->function->job_total only if it > 0 #302

p-alik commented Aug 18, 2020

esabol commented Aug 18, 2020 •

edited

Loading

infraweavers commented Aug 19, 2020 •

edited

Loading

infraweavers commented Aug 19, 2020

esabol commented Aug 19, 2020 •

edited

Loading

p-alik commented Aug 20, 2020

SpamapS left a comment

infraweavers commented Aug 27, 2020

SpamapS commented Oct 31, 2021

infraweavers commented Nov 1, 2021

decrement server_job->function->job_total only if it > 0 #302

decrement server_job->function->job_total only if it > 0 #302

Conversation

p-alik commented Aug 18, 2020

esabol commented Aug 18, 2020 • edited Loading

infraweavers commented Aug 19, 2020 • edited Loading

infraweavers commented Aug 19, 2020

esabol commented Aug 19, 2020 • edited Loading

p-alik commented Aug 20, 2020

SpamapS left a comment

Choose a reason for hiding this comment

infraweavers commented Aug 27, 2020

SpamapS commented Oct 31, 2021

infraweavers commented Nov 1, 2021

esabol commented Aug 18, 2020 •

edited

Loading

infraweavers commented Aug 19, 2020 •

edited

Loading

esabol commented Aug 19, 2020 •

edited

Loading