Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

running triton as a inference service on host #7915

Open
sriram-dsl opened this issue Jan 3, 2025 · 0 comments
Open

running triton as a inference service on host #7915

sriram-dsl opened this issue Jan 3, 2025 · 0 comments

Comments

@sriram-dsl
Copy link

problem
I am trying to run inference on a Qualcomm QCM6490 device, which requires specific dependencies to utilize its NPU. To meet these requirements, I use the Qualcomm-provided SDK image and container that includes the necessary binaries for running inference on an aarch64 architecture.

However, when I attempt to use the Triton Inference Server on the host device, it becomes necessary to install these dependencies inside the Triton container. This approach is not feasible due to compatibility issues, dependency conflicts, and the additional overhead of customizing the Triton container.

Describe the solution you'd like
To address this issue, I propose enabling Triton Inference Server to run within the Qualcomm SDK container. Specifically:

  1. The Triton server should be able to run as a service inside the SDK container, leveraging the pre-installed binaries and dependencies provided by Qualcomm.
  2. This setup would allow users to clone the Triton server, deploy it within the SDK container, and run inference as a service.
  3. The inference service should support API calls where:
    -> Users can send images and other necessary inputs.
    -> The service returns the output tensors, enabling seamless integration.
    Ideally, the inference should utilize the device's NPU for optimal performance.

Use Case
This feature would benefit users running Triton Inference Server on specialized devices like the Qualcomm QCM6490, where dependency management and NPU optimization are critical. It would streamline workflows and reduce the complexity of configuring Triton containers for such devices.

Expected Outcome
By enabling Triton Inference Server to operate within the SDK container:

-> Users can leverage the Qualcomm binaries directly without modifying the Triton container.
-> The solution becomes more scalable and user-friendly.
-> Inference tasks can take full advantage of the NPU's capabilities on aarch64 devices.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant