Python Backend's fail-over feature was not implemented #7862

zhuichao001 · 2024-12-09T08:08:04Z

Is your feature request related to a problem? Please describe.
In r23.10, we noticed that the triton_python_backend_stub subprocess did not get restarted after it exited, which led us to suspect that the Python Backend's fail-over feature was not implemented or had issues. Upon reviewing the code, we discovered that the fail-over feature was not implemented in the decoupleed mode, while it was tested for the undecoupled mode based on the ipc_queue's lock to check if the child process was healthy. We conducted tests and found that when the child process was killed, the stub subprocess did not get restarted after it exited.

Describe the solution you'd like
Use a thread in the parent process to wait for the stub process exit, and then restart the triton_python_backend_stub process when you find that it has exited. The restart process consists of three steps. The first step is to wait for pending tasks to complete, the second step is to reclaim and release threads, monitors, and message queues, and the third step is to reload and start the stub process again. It is important to note that you need to add a state to the ModelInstanceState class, which indicates whether the model is in a state where it can be serviced. During the restart process, this state is disabled, so that front-end requests can return error responses.

Describe alternatives you've considered
By modifying the current logic, where child processes send a restart message to the queue upon exit via signal handling, but this approach is complex and may have correctness issues if the message queue itself is blocked.

Additional context
We have already implemented this feature in our internal version and have applied it online. We are interested in contributing this functionality to the community and look forward to your response.

D1-3105 · 2024-12-09T17:30:47Z

+1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python Backend's fail-over feature was not implemented #7862

Python Backend's fail-over feature was not implemented #7862

zhuichao001 commented Dec 9, 2024

D1-3105 commented Dec 9, 2024

Python Backend's fail-over feature was not implemented #7862

Python Backend's fail-over feature was not implemented #7862

Comments

zhuichao001 commented Dec 9, 2024

D1-3105 commented Dec 9, 2024