-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Implement force-kill option in verdi process kill #6575
base: main
Are you sure you want to change the base?
Conversation
This allows plumpy ignore the callback so it does not get stuck in it but directly transitions the process to killed. The disadvantage is that the job behind the process might not be killed as we do not wait for. But I think that can be clearly communicated to the user in the help message and it has a similar logic as `kill -9 <PID>`. TODO: As you mentioned @kh with this solution the worker process will not be killed. Some parts of plumpy in stepping are needed to kill the process, but I am not sure which ones.
for more information, see https://pre-commit.ci
Thanks @agoscinski
and says:
|
This part is key. If you want to kill a The question is what the code should do in this case. Maybe the EBM should be disabled when calling the scheduler's kill command. But then again, what if the |
Yes, exactly! I was thinking about this as well. Once done, this PR should address all possibilities:
|
2af67e7
to
8e080b0
Compare
8e080b0
to
75bb430
Compare
Don't worry about tests failing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @agoscinski some initial comments/suggestions
raise | ||
except Exception as exception: | ||
logger.warning(f'killing CalcJob<{node.pk}> failed') | ||
logger.warning(f'killing CalcJob<{node.pk}> excepted, the job might be orphaned.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think this warning is correct. When a TransportTaskException
is raised, the Waiting.execute
method catches it and pauses the process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm not wrong, what you are referring to is already in my test scenario
(see here test_process.py::test_process_kill
# 10)
And is being passed, meaning the job is being EXCEPTED
, not paused..
(maybe there is a bug somewhere else?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand. The exception being caught here is thrown by do_kill
so it is the kill operation (i.e. getting the transport or using it to call the kill command of the scheduler) that excepted, not the process itself. The exception is then reraised as a TransportTaskException
which is caught in Waiting.execute
and rethrown as PauseInterruption
on line 571, which means the process is going into pause. So how does this mean the process is already excepted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
Just added one more test, and I'm now explicitly patching and returning TransportTaskException
from task_kill_job
.
We still receive EXCEPTED
regardless of whether the process was running normally or was stuck in EBM.
Haven't figured out why yet.
raise Exception('Mock open exception') | ||
|
||
# patch a faulty transport open, to make EBM go crazy | ||
inject_patch.patch('aiida.transports.plugins.local.LocalTransport.open', mock_open) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the way to implement an inject_patch fixture is quite hacky. Why not defined a MockTransport with open
method implemented as mock_open?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the main problem I had here was, the daemon being instantiated from another run of the interpreter so it couldn't see any mocking function you defined here.. So that I made this hard patch fixture, which might not be an ideal way, as we discussed before.
Anyways, I remember @agoscinski had a suggestion that maybe he could instantiate the daemon on the same running interpreter so what could just mock those functions normally.
@agoscinski can you try your idea, to see if that works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We pass the monkey patched code to the daemon worker by creating the daemon worker with multiprocessing. I think in this case multiprocess is forked copying the monkeypatchey version of the class into the forked process memory. Here code snippet how it could work.
import multiprocessing as mp
# more imports ..
# monekypatch
def prepare_for_submission(self, folder: Folder) -> CalcInfo:
"""Prepare the calculation for submission.
...
aiida.calculations.arithmetic.add.ArithmeticAddCalculation.prepare_for_submission = prepare_for_submission
from aiida.engine.daemon.worker import start_daemon_worker
process = mp.Process(target=start_daemon_worker, args=(True,))
process.start()
# will run forever
process.join()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could insert this probably somehow in submit_and_await
. @unkcpz mentioned it should not affect running the tests in parallel because the daemon is started pre profile. One needs to ensure that a new profile is created for each test that is relying on a specific worker. I am not sure at the moment if each test creates a new profile, need to check it.
@khsrali @unkcpz we agreed in a discussion that we implement the solution mentioned in #6575 (comment) to avoid the inject. Maybe a separate PR would be good since this changes the infrastructure a bit. Probably this will not happen before the new year because @khsrali and me are busy with some other tasks. |
This allows plumpy ignore the callback so it does not get stuck in it but directly transitions the process to killed. The disadvantage is that the job behind the process might not be killed as we do not wait for. But I think that can be clearly communicated to the user in the help message and it has a similar logic as
kill -9 <PID>
.This PR goes together with the PR in plumpy aiidateam/plumpy#288
TODO:
As you feared @khsrali with this solution the worker process will not be killed, so the upload async function still continues, so maybe we still need to find a solution that actually sends the message back to aiida-core. It is really weird, I feel like
_stepping
part in plumpy is buggy, because it only the transition function actually returns back to the sender of the message (whereverdi process kill <PID>
was sent), while only call the callback docalcjobs.tasks.task_kill_job
in the _stepping part kills the process occupying the worker (when removing the exponential backof from the kill), but it should do both. I have this a bug on my system also when just killing local jobs that sleep. Maybe we can check on your machine.Co-author @khsrali