You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
In r23.10, we noticed that the triton_python_backend_stub subprocess did not get restarted after it exited, which led us to suspect that the Python Backend's fail-over feature was not implemented or had issues. Upon reviewing the code, we discovered that the fail-over feature was not implemented in the decoupleed mode, while it was tested for the undecoupled mode based on the ipc_queue's lock to check if the child process was healthy. We conducted tests and found that when the child process was killed, the stub subprocess did not get restarted after it exited.
Describe the solution you'd like
Use a thread in the parent process to wait for the stub process exit, and then restart the triton_python_backend_stub process when you find that it has exited. The restart process consists of three steps. The first step is to wait for pending tasks to complete, the second step is to reclaim and release threads, monitors, and message queues, and the third step is to reload and start the stub process again. It is important to note that you need to add a state to the ModelInstanceState class, which indicates whether the model is in a state where it can be serviced. During the restart process, this state is disabled, so that front-end requests can return error responses.
Describe alternatives you've considered
By modifying the current logic, where child processes send a restart message to the queue upon exit via signal handling, but this approach is complex and may have correctness issues if the message queue itself is blocked.
Additional context
We have already implemented this feature in our internal version and have applied it online. We are interested in contributing this functionality to the community and look forward to your response.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
In r23.10, we noticed that the triton_python_backend_stub subprocess did not get restarted after it exited, which led us to suspect that the Python Backend's fail-over feature was not implemented or had issues. Upon reviewing the code, we discovered that the fail-over feature was not implemented in the decoupleed mode, while it was tested for the undecoupled mode based on the ipc_queue's lock to check if the child process was healthy. We conducted tests and found that when the child process was killed, the stub subprocess did not get restarted after it exited.
Describe the solution you'd like
Use a thread in the parent process to wait for the stub process exit, and then restart the triton_python_backend_stub process when you find that it has exited. The restart process consists of three steps. The first step is to wait for pending tasks to complete, the second step is to reclaim and release threads, monitors, and message queues, and the third step is to reload and start the stub process again. It is important to note that you need to add a state to the ModelInstanceState class, which indicates whether the model is in a state where it can be serviced. During the restart process, this state is disabled, so that front-end requests can return error responses.
Describe alternatives you've considered
By modifying the current logic, where child processes send a restart message to the queue upon exit via signal handling, but this approach is complex and may have correctness issues if the message queue itself is blocked.
Additional context
We have already implemented this feature in our internal version and have applied it online. We are interested in contributing this functionality to the community and look forward to your response.
The text was updated successfully, but these errors were encountered: