running triton as a inference service on host #7915

sriram-dsl · 2025-01-03T16:39:10Z

problem
I am trying to run inference on a Qualcomm QCM6490 device, which requires specific dependencies to utilize its NPU. To meet these requirements, I use the Qualcomm-provided SDK image and container that includes the necessary binaries for running inference on an aarch64 architecture.

However, when I attempt to use the Triton Inference Server on the host device, it becomes necessary to install these dependencies inside the Triton container. This approach is not feasible due to compatibility issues, dependency conflicts, and the additional overhead of customizing the Triton container.

Describe the solution you'd like
To address this issue, I propose enabling Triton Inference Server to run within the Qualcomm SDK container. Specifically:

The Triton server should be able to run as a service inside the SDK container, leveraging the pre-installed binaries and dependencies provided by Qualcomm.
This setup would allow users to clone the Triton server, deploy it within the SDK container, and run inference as a service.
The inference service should support API calls where:
-> Users can send images and other necessary inputs.
-> The service returns the output tensors, enabling seamless integration.
Ideally, the inference should utilize the device's NPU for optimal performance.

Use Case
This feature would benefit users running Triton Inference Server on specialized devices like the Qualcomm QCM6490, where dependency management and NPU optimization are critical. It would streamline workflows and reduce the complexity of configuring Triton containers for such devices.

Expected Outcome
By enabling Triton Inference Server to operate within the SDK container:

-> Users can leverage the Qualcomm binaries directly without modifying the Triton container.
-> The solution becomes more scalable and user-friendly.
-> Inference tasks can take full advantage of the NPU's capabilities on aarch64 devices.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

running triton as a inference service on host #7915

running triton as a inference service on host #7915

sriram-dsl commented Jan 3, 2025

running triton as a inference service on host #7915

running triton as a inference service on host #7915

Comments

sriram-dsl commented Jan 3, 2025