-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: ORCA Format KV Cache Utilization in Inference Response Header #7839
base: r24.10
Are you sure you want to change the base?
feat: ORCA Format KV Cache Utilization in Inference Response Header #7839
Conversation
43a1b18
to
713c8de
Compare
… for use in HandleGenerate to add kv_utilization and max_token_capacity to the inference request response header.
713c8de
to
74492a8
Compare
…nctionality to HTTPAPIServer::GenerateRequestClass::StartResponse() to extract metrics after inference request is processed for up-to-date metrics.
@nnshah1 @indrajit96 are we merging into the already frozen |
good catch we should target main |
@jbkyang-nvi @nnshah1 Is Ubuntu 24.04 required to build the main branch, with container version 24.11? I ported my changes over but am having trouble building the server image. I see in the nvidia container support matrix Ubuntu 24.04 is required for CUDA Deep-Learning base container for 24.11, but I also see in the Release 2.52.0 Known Issues that TRT-LLM backend is built from TensorRT-LLM version 0.15.0 and built out of nvcr.io/nvidia/tritonserver:24.10-py3-min. To build, I'm using a base image of For context the full build command I'm using is:
|
Follow up, I got the image built, but when trying to run it I get a version error:
From what I looked up I don't think I can get |
@nv-kmcgill53 , @nvda-mesharma After the holiday can we help with updated instructions on building? I think we are moving everything to 24.04 - but may be mistaken. |
What does the PR do?
This PR adds code to
HTTPAPIServer::GenerateRequestClass::StartResponse
insidesrc/http_server.cc
to add bothkv_cache_utilization
andmax_token_capacity
metrics composed from the existing prometheus metrics in TensorRT-LLM Backend'snv_trt_llm_kv_cache_block_metrics
metric family.This is acomplished by parsing the serialized prometheus metrics text object provided to the Triton Sever frontend by the Triton Core libraries into a structured vector of metrics for a specific metric family.
Checklist
Agreement
<commit_type>: <Title>
pre-commit install, pre-commit run --all
)Commit Type:
Check the conventional commit type
box here and add the label to the github PR.
Where should the reviewer start?
Changes are contained to 2 files:
src/http_server.cc
src/http_server.h
(the former's header file)The changes start in
HTTPAPIServer::GenerateRequestClass::StartResponse()
where the environment variable is checked and the header is written. There are 3 other funcitons below:MetricFamilyExtractor()
which parses serialized prometheus metrics into a vector ofPromMetric
(which have a map of their metric labels),and ExtractKVMetrics()
that pulls the values from the structured metrics and calculates the composite kv metrics, and finallyOrcaKVMetricHeader()
which forms the metrics into anendpoint-load-metrics
header in the ORCA format specified byORCA_METRIC_FORMAT
. If there are no TensorRT-LLM Backend metrics, no metrics found for the header, or an invalid format type the header is simply not written.The valid values for
ORCA_METRIC_FORMAT
are documented in the feature request (related issue linked below) and comments inStartResponse()
Test plan:
The feature is gated behind a feature flag in the form of the
ORCA_METRIC_FORMAT
environment variable. If unset, the feature is effectively disabled. Beyond that, the changes have been manually tested to not cause issue if either the queried metrics are not present (such as if TensorRT-LLM is not being used as the backend), or if the ORCA header metric type is invalid. In either case, nothing is parsed and no header is written. All code changes are wrapped in an#ifdef
and are only included if metrics are enabled during the Triton Server build.Caveats:
This feature only works on Triton Inference Server running with TensorRT-LLM Backend, as otherwise the KV-cache metrics are not included in the server metrics.
This change only implements the kv-cache utilization metics, but the functions it adds allows other metrics to be added easily (including metrics that don't require TensorRT-LLM Backend).
Background
This doc captures the overall requirements for model servers to integrate with llm instance gateway. More details in the Feature Request below.
Related Issues:
Screenshots
Response header before changes (or if
ORCA_METRIC_FORMAT
environment variable is unset):Response header with
ORCA_METRIC_FORMAT="json"
:Response header with
ORCA_METRIC_FORMAT="http"
:cc @yinggeh @krishung5 @jbkyang-nvi