Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial ext proc implementation with LoRA affinity #14

Merged
merged 4 commits into from
Oct 4, 2024

Conversation

liu-cong
Copy link
Contributor

@liu-cong liu-cong commented Oct 1, 2024

This is a refactor of the POC implementation in ./examples/ext-proc with the following notable changes:

  • Re-structured repo to make it more modular: "handlers" implements the ext proc API and handles request/response; "scheduling" implements the request scheduling algorithm; "backend" contains logic to interact with backend pods
  • Introduced a "filter" concept in the scheduling algorithm to make it easier to write flow chart style algorithm that's being proposed.
  • Removed metric update from response headers since benchmarking didn't show benefits. Current implementation scrapes metrics every 200ms. Will add response based metrics update back if more benchmarking confirms the benefits.
  • Simplifies POC code, such as: replaced the freecache package with a sync.Map; consolidated various metrics objects into a single PodMetric object;
  • Various impriovements including adding leveled logging, handling errors, simplifying code, etc.
  • The algorithm is simplified a bit from POC - it finds pods with least queuing first, then with active LoRA adapters, then the least KV cache percent.

Initial benchmarking shows slightly better throughput than POC.

Benchmarking setup:

  • vLLM deployment using the examples/poc/manifests with 6 vLLM replicas running in GCP g2-standard-24 instances (NVIDIA L4 GPU), with max-loras=4 and max-cpu-loras=12.
  • Control group is a LoadBalancer k8s service.
  • Ext proc is deployed the same way as examples/poc/
  • Simple benchmark script in `examples/poc/
  1. LoadBalancer service
Total time: 300.944089417
Total completed requests: 484
Requests per second: 1.608272157587885
Total generated tokens: 192596
Average output tokens per request: 397.92561983471074
Total prompt tokens: 8228
Average input tokens per request: 17.0
Output tokens per second: 639.9726951710668
Bad content type errors: 0
Timeout requests: 516
Dropped requests: 0
Server disconnections: 0
Unknown: 0
OS errors: 0
Total dropped requests: 516
  1. POC Ext Proc
Total time: 300.93758045799996
Total completed requests: 780
Requests per second: 2.591899618561796
Total generated tokens: 302094
Average output tokens per request: 387.3
Total prompt tokens: 13260
Average input tokens per request: 17.0
Output tokens per second: 1003.8427222689837
Bad content type errors: 0
Timeout requests: 220
Dropped requests: 0
Server disconnections: 0
Unknown: 0
OS errors: 0
Total dropped requests: 220
  1. Ext proc in this PR
Total time: 300.836648292
Total completed requests: 846
Requests per second: 2.812157377776826
Total generated tokens: 339110
Average output tokens per request: 400.83924349881795
Total prompt tokens: 14382
Average input tokens per request: 17.0
Output tokens per second: 1127.2230359076825
Bad content type errors: 0
Timeout requests: 154
Dropped requests: 0
Server disconnections: 0
Unknown: 0
OS errors: 0
Total dropped requests: 154

This is the benchmark script I used (Credits to @kfswain and @kaushikmitr)

import time
import random
import asyncio
import aiohttp

# Global metrics
metrics = {
    "total_tokens_generated": 0,
    "total_prompt_tokens": 0,
    "dropped_requests": 0,
    "timeout_requests": 0,
    "bad_content_type": 0,
    "os_errors": 0,
    "unknown_error": 0,
    "server_disconnected": 0,
}

# Model and IP configurations
models = [
    "tweet-summary", "sql-lora", "tweet-summary-0", "sql-lora-0",
    "tweet-summary-1", "sql-lora-1", "tweet-summary-2", "sql-lora-2",
    "tweet-summary-3", "sql-lora-3", "tweet-summary-4", "sql-lora-4",
]

# IP="35.224.164.47" # IP of the vllm Service
# PORT="8000"
IP="34.69.218.45" # IP of the gateway
PORT="8081"

model_map = {
    IP: models,
}

def create_json(ip: str, model: str = None) -> dict:
    if model is None:
        model = random.choice(model_map[ip])
    return {"prompt": "Is the Necronomicon in the movie: Army of Darkness?", "max_tokens": 750, "model": model}

async def parallelized_benchmarking(session: aiohttp.ClientSession, ip: str, model: str = None, specify_target_pod: bool = False):
    try:
        json_data = create_json(IP, model)
        url = f"http://{IP}:{PORT}/v1/completions"
        headers = {'Content-Type': 'application/json'}
        if specify_target_pod:
            headers['target-pod'] = get_target_pods()
        async with session.post(url, json=json_data, headers=headers) as response:
            response_json = await response.json()
            metrics["total_tokens_generated"] += int(response_json['usage']['completion_tokens'])
            metrics["total_prompt_tokens"] += int(response_json['usage']['prompt_tokens'])
    except aiohttp.client_exceptions.ClientConnectorError as client_err:
        metrics["dropped_requests"] += 1
        print(client_err)
    except asyncio.TimeoutError as timeout_err:
        metrics["timeout_requests"] += 1
        print(timeout_err)
    except aiohttp.client_exceptions.ClientOSError:
        metrics["os_errors"] += 1
    except aiohttp.client_exceptions.ContentTypeError as e:
        text = await response.text()
        print(text)
        print(e)
        metrics["bad_content_type"] += 1
    except aiohttp.client_exceptions.ServerDisconnectedError:
        metrics["server_disconnected"] += 1
    except:
        metrics["unknown_error"] += 1

def ips(n_reqs: int):
    available_ips = [IP]
    for _ in range(n_reqs):
        yield random.choice(available_ips)

def get_target_pods():
    available_ips = ["10.8.2.194:8000","10.8.3.185:8000","10.8.0.94:8000" ]
    return random.choice(available_ips)

async def test_main(n_reqs: int, model: str = None, specify_target_pod: bool = False):
    async with aiohttp.ClientSession() as session:
        await asyncio.gather(
            *[parallelized_benchmarking(session, ip, model, specify_target_pod) for ip in ips(n_reqs)]
        )

def clear_metrics():
    global metrics
    for key in metrics:
        metrics[key] = 0

if __name__ == "__main__":
    print(f"Starting benchmark")
    # Warm-up phase

    # Clear metrics after warm-up
    clear_metrics()

    # Main benchmarking phase
    n_reqs = 1000
    specify_target_pod = False
    start = time.perf_counter()
    asyncio.run(test_main(n_reqs=n_reqs, specify_target_pod=specify_target_pod))
    end = time.perf_counter()
    
    bad_requests = sum(metrics[key] for key in ["dropped_requests", "timeout_requests", "os_errors", "bad_content_type", "server_disconnected", "unknown_error"])
    total_complete_reqs = n_reqs - bad_requests

    # Results output
    print(f"Total time: {end-start}")
    print(f"Total completed requests: {total_complete_reqs}")
    print(f"Requests per second: {total_complete_reqs / (end-start)}")
    print(f"Total generated tokens: {metrics['total_tokens_generated']}")
    print(f"Average output tokens per request: {metrics['total_tokens_generated'] / total_complete_reqs}")
    print(f"Total prompt tokens: {metrics['total_prompt_tokens']}")
    print(f"Average input tokens per request: {metrics['total_prompt_tokens'] / total_complete_reqs}")
    print(f"Output tokens per second: {metrics['total_tokens_generated'] / (end-start)}")
    print(f"Bad content type errors: {metrics['bad_content_type']}")
    print(f"Timeout requests: {metrics['timeout_requests']}")
    print(f"Dropped requests: {metrics['dropped_requests']}")
    print(f"Server disconnections: {metrics['server_disconnected']}")
    print(f"Unknown: {metrics['unknown_error']}")
    print(f"OS errors: {metrics['os_errors']}")
    print(f"Total dropped requests: {bad_requests}")

This is a refactor of the POC implementation in ./examples/ext-proc with the following notable changes:
- Re-structured repo to make it more modular: "handlers" implements the ext proc API and handles request/response; "scheduling" implements the request scheduling algorithm; "backend" contains logic to interact with backend pods
- Introduced a "filter" concept in the scheduling algorithm to make it easier to write flow chart style algorithm that's being proposed.
- Removed metric update from response headers since benchmarking didn't show benefits. Current implementation scrapes metrics every 200ms. Will add response based metrics update back if more benchmarking confirms the benefits.
- Simplifies POC code, such as: replaced the freecache package with a sync.Map; consolidated various metrics objects into a single PodMetric object;
- Various impriovements including adding leveled logging, handling errors, simplifying code, etc.
- The algorithm is simplified a bit from POC - it finds pods with least queuing first, then with active LoRA adapters, then the least KV cache percent.

Intial benchmarking shows slightly better throughput than POC.
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 1, 2024
@k8s-ci-robot k8s-ci-robot requested review from ahg-g and kfswain October 1, 2024 05:03
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Oct 1, 2024
@kfswain
Copy link
Collaborator

kfswain commented Oct 1, 2024

This is the benchmark script used which was created by @kaushikmitr

I believe this is a modification of the benchmarking script I made.

@liu-cong
Copy link
Contributor Author

liu-cong commented Oct 1, 2024

This is the benchmark script used which was created by @kaushikmitr

I believe this is a modification of the benchmarking script I made.

Apologies I lost track of the script. Please take a look at the updated description.

@liu-cong liu-cong marked this pull request as ready for review October 1, 2024 16:59
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 1, 2024
@k8s-ci-robot k8s-ci-robot requested a review from robscott October 1, 2024 16:59
@liu-cong
Copy link
Contributor Author

liu-cong commented Oct 1, 2024

This PR doesn't have a lot of tests yet, will follow up with more testing in a separate PR.

@ahg-g
Copy link
Contributor

ahg-g commented Oct 1, 2024

/assign

/hold
just in case it gets merged by mistake

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 1, 2024
pkg/ext-proc/benchmark/benchmark.go Outdated Show resolved Hide resolved
pkg/ext-proc/handlers/request.go Outdated Show resolved Hide resolved
pkg/ext-proc/scheduling/scheduler.go Show resolved Hide resolved
@kfswain
Copy link
Collaborator

kfswain commented Oct 2, 2024

/lgtm just make sure we've squashed the commits we don't want history of before merge. And this is assuming we will have those tests in a PR shortly.

Thanks! This is great!

@terrytangyuan terrytangyuan added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Oct 2, 2024
@terrytangyuan
Copy link
Member

Added squash merge label

Copy link

@Joffref Joffref left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this LGTM! Just a few minor notes that aren’t essential to address in this PR—just for the record.
Again, awesome work here! Congrats!

pkg/ext-proc/backend/pod_client.go Show resolved Hide resolved
pkg/ext-proc/scheduling/filter.go Outdated Show resolved Hide resolved
pkg/ext-proc/scheduling/filter.go Show resolved Hide resolved
pkg/ext-proc/main.go Outdated Show resolved Hide resolved
pkg/ext-proc/main.go Show resolved Hide resolved
pkg/ext-proc/scheduling/scheduler.go Show resolved Hide resolved
pkg/ext-proc/scheduling/filter.go Show resolved Hide resolved
pkg/ext-proc/backend/pod_client.go Show resolved Hide resolved
pkg/ext-proc/handlers/response.go Show resolved Hide resolved
pkg/ext-proc/handlers/request.go Outdated Show resolved Hide resolved
pkg/ext-proc/handlers/request.go Show resolved Hide resolved
pkg/ext-proc/handlers/request.go Outdated Show resolved Hide resolved
@ahg-g
Copy link
Contributor

ahg-g commented Oct 4, 2024

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 4, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, Joffref, liu-cong

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 4, 2024
@ahg-g
Copy link
Contributor

ahg-g commented Oct 4, 2024

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 4, 2024
@k8s-ci-robot k8s-ci-robot merged commit 6f9869d into kubernetes-sigs:main Oct 4, 2024
2 checks passed
@liu-cong liu-cong deleted the poc branch October 28, 2024 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants