Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python Backend's fail-over feature was not implemented #7862

Open
zhuichao001 opened this issue Dec 9, 2024 · 1 comment
Open

Python Backend's fail-over feature was not implemented #7862

zhuichao001 opened this issue Dec 9, 2024 · 1 comment

Comments

@zhuichao001
Copy link

Is your feature request related to a problem? Please describe.
In r23.10, we noticed that the triton_python_backend_stub subprocess did not get restarted after it exited, which led us to suspect that the Python Backend's fail-over feature was not implemented or had issues. Upon reviewing the code, we discovered that the fail-over feature was not implemented in the decoupleed mode, while it was tested for the undecoupled mode based on the ipc_queue's lock to check if the child process was healthy. We conducted tests and found that when the child process was killed, the stub subprocess did not get restarted after it exited.

Describe the solution you'd like
Use a thread in the parent process to wait for the stub process exit, and then restart the triton_python_backend_stub process when you find that it has exited. The restart process consists of three steps. The first step is to wait for pending tasks to complete, the second step is to reclaim and release threads, monitors, and message queues, and the third step is to reload and start the stub process again. It is important to note that you need to add a state to the ModelInstanceState class, which indicates whether the model is in a state where it can be serviced. During the restart process, this state is disabled, so that front-end requests can return error responses.

Describe alternatives you've considered
By modifying the current logic, where child processes send a restart message to the queue upon exit via signal handling, but this approach is complex and may have correctness issues if the message queue itself is blocked.

Additional context
We have already implemented this feature in our internal version and have applied it online. We are interested in contributing this functionality to the community and look forward to your response.

@D1-3105
Copy link

D1-3105 commented Dec 9, 2024

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants