-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add initial ext proc implementation with LoRA affinity #14
Conversation
This is a refactor of the POC implementation in ./examples/ext-proc with the following notable changes: - Re-structured repo to make it more modular: "handlers" implements the ext proc API and handles request/response; "scheduling" implements the request scheduling algorithm; "backend" contains logic to interact with backend pods - Introduced a "filter" concept in the scheduling algorithm to make it easier to write flow chart style algorithm that's being proposed. - Removed metric update from response headers since benchmarking didn't show benefits. Current implementation scrapes metrics every 200ms. Will add response based metrics update back if more benchmarking confirms the benefits. - Simplifies POC code, such as: replaced the freecache package with a sync.Map; consolidated various metrics objects into a single PodMetric object; - Various impriovements including adding leveled logging, handling errors, simplifying code, etc. - The algorithm is simplified a bit from POC - it finds pods with least queuing first, then with active LoRA adapters, then the least KV cache percent. Intial benchmarking shows slightly better throughput than POC.
I believe this is a modification of the benchmarking script I made. |
Apologies I lost track of the script. Please take a look at the updated description. |
This PR doesn't have a lot of tests yet, will follow up with more testing in a separate PR. |
/assign /hold |
/lgtm just make sure we've squashed the commits we don't want history of before merge. And this is assuming we will have those tests in a PR shortly. Thanks! This is great! |
Added squash merge label |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this LGTM! Just a few minor notes that aren’t essential to address in this PR—just for the record.
Again, awesome work here! Congrats!
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, Joffref, liu-cong The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
This is a refactor of the POC implementation in ./examples/ext-proc with the following notable changes:
Initial benchmarking shows slightly better throughput than POC.
Benchmarking setup:
examples/poc/manifests
with 6 vLLM replicas running in GCP g2-standard-24 instances (NVIDIA L4 GPU), withmax-loras=4
andmax-cpu-loras=12
.examples/poc/
This is the benchmark script I used (Credits to @kfswain and @kaushikmitr)