Add initial ext proc implementation with LoRA affinity #14

liu-cong · 2024-10-01T05:03:33Z

This is a refactor of the POC implementation in ./examples/ext-proc with the following notable changes:

Re-structured repo to make it more modular: "handlers" implements the ext proc API and handles request/response; "scheduling" implements the request scheduling algorithm; "backend" contains logic to interact with backend pods
Introduced a "filter" concept in the scheduling algorithm to make it easier to write flow chart style algorithm that's being proposed.
Removed metric update from response headers since benchmarking didn't show benefits. Current implementation scrapes metrics every 200ms. Will add response based metrics update back if more benchmarking confirms the benefits.
Simplifies POC code, such as: replaced the freecache package with a sync.Map; consolidated various metrics objects into a single PodMetric object;
Various impriovements including adding leveled logging, handling errors, simplifying code, etc.
The algorithm is simplified a bit from POC - it finds pods with least queuing first, then with active LoRA adapters, then the least KV cache percent.

Initial benchmarking shows slightly better throughput than POC.

Benchmarking setup:

vLLM deployment using the examples/poc/manifests with 6 vLLM replicas running in GCP g2-standard-24 instances (NVIDIA L4 GPU), with max-loras=4 and max-cpu-loras=12.
Control group is a LoadBalancer k8s service.
Ext proc is deployed the same way as examples/poc/
Simple benchmark script in `examples/poc/

LoadBalancer service

Total time: 300.944089417
Total completed requests: 484
Requests per second: 1.608272157587885
Total generated tokens: 192596
Average output tokens per request: 397.92561983471074
Total prompt tokens: 8228
Average input tokens per request: 17.0
Output tokens per second: 639.9726951710668
Bad content type errors: 0
Timeout requests: 516
Dropped requests: 0
Server disconnections: 0
Unknown: 0
OS errors: 0
Total dropped requests: 516

POC Ext Proc

Total time: 300.93758045799996
Total completed requests: 780
Requests per second: 2.591899618561796
Total generated tokens: 302094
Average output tokens per request: 387.3
Total prompt tokens: 13260
Average input tokens per request: 17.0
Output tokens per second: 1003.8427222689837
Bad content type errors: 0
Timeout requests: 220
Dropped requests: 0
Server disconnections: 0
Unknown: 0
OS errors: 0
Total dropped requests: 220

Ext proc in this PR

Total time: 300.836648292
Total completed requests: 846
Requests per second: 2.812157377776826
Total generated tokens: 339110
Average output tokens per request: 400.83924349881795
Total prompt tokens: 14382
Average input tokens per request: 17.0
Output tokens per second: 1127.2230359076825
Bad content type errors: 0
Timeout requests: 154
Dropped requests: 0
Server disconnections: 0
Unknown: 0
OS errors: 0
Total dropped requests: 154

This is the benchmark script I used (Credits to @kfswain and @kaushikmitr)

import time
import random
import asyncio
import aiohttp

# Global metrics
metrics = {
    "total_tokens_generated": 0,
    "total_prompt_tokens": 0,
    "dropped_requests": 0,
    "timeout_requests": 0,
    "bad_content_type": 0,
    "os_errors": 0,
    "unknown_error": 0,
    "server_disconnected": 0,
}

# Model and IP configurations
models = [
    "tweet-summary", "sql-lora", "tweet-summary-0", "sql-lora-0",
    "tweet-summary-1", "sql-lora-1", "tweet-summary-2", "sql-lora-2",
    "tweet-summary-3", "sql-lora-3", "tweet-summary-4", "sql-lora-4",
]

# IP="35.224.164.47" # IP of the vllm Service
# PORT="8000"
IP="34.69.218.45" # IP of the gateway
PORT="8081"

model_map = {
    IP: models,
}

def create_json(ip: str, model: str = None) -> dict:
    if model is None:
        model = random.choice(model_map[ip])
    return {"prompt": "Is the Necronomicon in the movie: Army of Darkness?", "max_tokens": 750, "model": model}

async def parallelized_benchmarking(session: aiohttp.ClientSession, ip: str, model: str = None, specify_target_pod: bool = False):
    try:
        json_data = create_json(IP, model)
        url = f"http://{IP}:{PORT}/v1/completions"
        headers = {'Content-Type': 'application/json'}
        if specify_target_pod:
            headers['target-pod'] = get_target_pods()
        async with session.post(url, json=json_data, headers=headers) as response:
            response_json = await response.json()
            metrics["total_tokens_generated"] += int(response_json['usage']['completion_tokens'])
            metrics["total_prompt_tokens"] += int(response_json['usage']['prompt_tokens'])
    except aiohttp.client_exceptions.ClientConnectorError as client_err:
        metrics["dropped_requests"] += 1
        print(client_err)
    except asyncio.TimeoutError as timeout_err:
        metrics["timeout_requests"] += 1
        print(timeout_err)
    except aiohttp.client_exceptions.ClientOSError:
        metrics["os_errors"] += 1
    except aiohttp.client_exceptions.ContentTypeError as e:
        text = await response.text()
        print(text)
        print(e)
        metrics["bad_content_type"] += 1
    except aiohttp.client_exceptions.ServerDisconnectedError:
        metrics["server_disconnected"] += 1
    except:
        metrics["unknown_error"] += 1

def ips(n_reqs: int):
    available_ips = [IP]
    for _ in range(n_reqs):
        yield random.choice(available_ips)

def get_target_pods():
    available_ips = ["10.8.2.194:8000","10.8.3.185:8000","10.8.0.94:8000" ]
    return random.choice(available_ips)

async def test_main(n_reqs: int, model: str = None, specify_target_pod: bool = False):
    async with aiohttp.ClientSession() as session:
        await asyncio.gather(
            *[parallelized_benchmarking(session, ip, model, specify_target_pod) for ip in ips(n_reqs)]
        )

def clear_metrics():
    global metrics
    for key in metrics:
        metrics[key] = 0

if __name__ == "__main__":
    print(f"Starting benchmark")
    # Warm-up phase

    # Clear metrics after warm-up
    clear_metrics()

    # Main benchmarking phase
    n_reqs = 1000
    specify_target_pod = False
    start = time.perf_counter()
    asyncio.run(test_main(n_reqs=n_reqs, specify_target_pod=specify_target_pod))
    end = time.perf_counter()
    
    bad_requests = sum(metrics[key] for key in ["dropped_requests", "timeout_requests", "os_errors", "bad_content_type", "server_disconnected", "unknown_error"])
    total_complete_reqs = n_reqs - bad_requests

    # Results output
    print(f"Total time: {end-start}")
    print(f"Total completed requests: {total_complete_reqs}")
    print(f"Requests per second: {total_complete_reqs / (end-start)}")
    print(f"Total generated tokens: {metrics['total_tokens_generated']}")
    print(f"Average output tokens per request: {metrics['total_tokens_generated'] / total_complete_reqs}")
    print(f"Total prompt tokens: {metrics['total_prompt_tokens']}")
    print(f"Average input tokens per request: {metrics['total_prompt_tokens'] / total_complete_reqs}")
    print(f"Output tokens per second: {metrics['total_tokens_generated'] / (end-start)}")
    print(f"Bad content type errors: {metrics['bad_content_type']}")
    print(f"Timeout requests: {metrics['timeout_requests']}")
    print(f"Dropped requests: {metrics['dropped_requests']}")
    print(f"Server disconnections: {metrics['server_disconnected']}")
    print(f"Unknown: {metrics['unknown_error']}")
    print(f"OS errors: {metrics['os_errors']}")
    print(f"Total dropped requests: {bad_requests}")

This is a refactor of the POC implementation in ./examples/ext-proc with the following notable changes: - Re-structured repo to make it more modular: "handlers" implements the ext proc API and handles request/response; "scheduling" implements the request scheduling algorithm; "backend" contains logic to interact with backend pods - Introduced a "filter" concept in the scheduling algorithm to make it easier to write flow chart style algorithm that's being proposed. - Removed metric update from response headers since benchmarking didn't show benefits. Current implementation scrapes metrics every 200ms. Will add response based metrics update back if more benchmarking confirms the benefits. - Simplifies POC code, such as: replaced the freecache package with a sync.Map; consolidated various metrics objects into a single PodMetric object; - Various impriovements including adding leveled logging, handling errors, simplifying code, etc. - The algorithm is simplified a bit from POC - it finds pods with least queuing first, then with active LoRA adapters, then the least KV cache percent. Intial benchmarking shows slightly better throughput than POC.

kfswain · 2024-10-01T16:08:33Z

This is the benchmark script used which was created by @kaushikmitr

I believe this is a modification of the benchmarking script I made.

liu-cong · 2024-10-01T16:33:16Z

This is the benchmark script used which was created by @kaushikmitr

I believe this is a modification of the benchmarking script I made.

Apologies I lost track of the script. Please take a look at the updated description.

liu-cong · 2024-10-01T17:06:18Z

This PR doesn't have a lot of tests yet, will follow up with more testing in a separate PR.

ahg-g · 2024-10-01T17:11:44Z

/assign

/hold
just in case it gets merged by mistake

pkg/ext-proc/benchmark/benchmark.go

pkg/ext-proc/handlers/request.go

pkg/ext-proc/scheduling/scheduler.go

kfswain · 2024-10-02T02:14:52Z

/lgtm just make sure we've squashed the commits we don't want history of before merge. And this is assuming we will have those tests in a PR shortly.

Thanks! This is great!

terrytangyuan · 2024-10-02T02:47:26Z

Added squash merge label

Joffref

Overall, this LGTM! Just a few minor notes that aren’t essential to address in this PR—just for the record.
Again, awesome work here! Congrats!

pkg/ext-proc/backend/pod_client.go

pkg/ext-proc/scheduling/filter.go

pkg/ext-proc/main.go

pkg/ext-proc/scheduling/scheduler.go

pkg/ext-proc/scheduling/filter.go

pkg/ext-proc/backend/pod_client.go

pkg/ext-proc/handlers/response.go

pkg/ext-proc/handlers/request.go

ahg-g · 2024-10-04T19:28:17Z

/lgtm
/approve

k8s-ci-robot · 2024-10-04T19:28:24Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, Joffref, liu-cong

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ahg-g]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ahg-g · 2024-10-04T19:28:30Z

/hold cancel

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 1, 2024

k8s-ci-robot requested review from ahg-g and kfswain October 1, 2024 05:03

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Oct 1, 2024

liu-cong marked this pull request as ready for review October 1, 2024 16:59

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 1, 2024

k8s-ci-robot requested a review from robscott October 1, 2024 16:59

k8s-ci-robot assigned ahg-g Oct 1, 2024

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 1, 2024

kfswain reviewed Oct 1, 2024

View reviewed changes

pkg/ext-proc/benchmark/benchmark.go Outdated Show resolved Hide resolved

pkg/ext-proc/handlers/request.go Outdated Show resolved Hide resolved

pkg/ext-proc/scheduling/scheduler.go Show resolved Hide resolved

Make target pod an optinal argument

45fe110

terrytangyuan added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Oct 2, 2024

Joffref approved these changes Oct 2, 2024

View reviewed changes

pkg/ext-proc/backend/pod_client.go Show resolved Hide resolved

pkg/ext-proc/scheduling/filter.go Outdated Show resolved Hide resolved

pkg/ext-proc/scheduling/filter.go Show resolved Hide resolved

Clarify nextOnSuccessOrFailure behavior

7e8a6fe

ahg-g reviewed Oct 4, 2024

View reviewed changes

Address comments

29ed758

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 4, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 4, 2024

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 4, 2024

k8s-ci-robot merged commit 6f9869d into kubernetes-sigs:main Oct 4, 2024
2 checks passed

liu-cong deleted the poc branch October 28, 2024 18:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add initial ext proc implementation with LoRA affinity #14

Add initial ext proc implementation with LoRA affinity #14

liu-cong commented Oct 1, 2024 •

edited

Loading

kfswain commented Oct 1, 2024

liu-cong commented Oct 1, 2024

liu-cong commented Oct 1, 2024

ahg-g commented Oct 1, 2024

kfswain commented Oct 2, 2024

terrytangyuan commented Oct 2, 2024

Joffref left a comment

ahg-g commented Oct 4, 2024

k8s-ci-robot commented Oct 4, 2024

ahg-g commented Oct 4, 2024

Add initial ext proc implementation with LoRA affinity #14

Add initial ext proc implementation with LoRA affinity #14

Conversation

liu-cong commented Oct 1, 2024 • edited Loading

kfswain commented Oct 1, 2024

liu-cong commented Oct 1, 2024

liu-cong commented Oct 1, 2024

ahg-g commented Oct 1, 2024

kfswain commented Oct 2, 2024

terrytangyuan commented Oct 2, 2024

Joffref left a comment

Choose a reason for hiding this comment

ahg-g commented Oct 4, 2024

k8s-ci-robot commented Oct 4, 2024

ahg-g commented Oct 4, 2024

liu-cong commented Oct 1, 2024 •

edited

Loading