Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update setup.py #662

Open
wants to merge 425 commits into
base: main
Choose a base branch
from
Open

Update setup.py #662

wants to merge 425 commits into from

Conversation

anagnoko23
Copy link

I removed an empty line.

Pull Request Summary

What is this PR changing? Why is this change being made? Any caveats you'd like to highlight? Link any relevant documents, links, or screenshots here if applicable.

Test Plan and Usage Guide

How did you validate that your PR works correctly? How do you run or demo the code? Provide enough detail so a reviewer can reasonably reproduce the testing procedure. Paste example command line invocations if applicable.

song-william and others added 30 commits November 14, 2023 15:32
* Forward HTTP status code for sync requests

* don't return json response for celery forwarding results

* fix unit tests

* forward for all sync requests
* TRT-LLM WIP

* Integrate TensorRT-LLM

* fix

* revert

* formatting

* fix

* comments
* make test work

* add status checking

* fix

* test

* wget fix

* final fixes

* move namespace
* Found a bug in the codellama vllm model_len logic.

Also, let's just avoid the vLLM error by making sure max_num_batched_tokens >= max_model_len

* nevermind I realized that if statement will never happen here.
* count prompt tokens, use tokenizer if needed

* docstrings

* fix tests and code cov

* add download files from s3 fn

* use same helpers and add docstring

* change to namedtuple

* add s3 repo locations

* fallback read from s3

* refactor tokenizer laod

* edit tests

* refactor _SUPPORTED_MODELS_BY_FRAMEWORK

* updates for tests

* move to utils file

* move some fns over

* use lru cache

* move model info

* root to opt

* add log and adjust integration test

* refocus logs

* change empty string to optional

* mock count tokens for unit tests

* change 1 mock

* add unit tests

* config change

* comments pt 1

* move internal logic to plugins file

* replace usage of utils file

* rearrange test mock

* only return prompt tokens count on last token in stream

* fix mock

* reorganize imports

* inject in external interfaces

* make changes to tests

* fix tests

* adjust test

* oops test

* add more tests
* emit metrics on token counts

* remove print
* Some updates to integration tests

* fix

* comment

* better env var
* adding zephyr 7b

* update tokenizer repo
* update tensor-rt llm in enum

* fix to be the same as in the egp and spellbook-backend
* time use case

* name

* update fake
* update docs to show model len / context windows

* make title clearer

* make title clearer pt2
* change code-llama to codellama

* use both code-llama and codellama temporarily
* fix completions request id
* 4x sqlalchemy pool size

* don't update nullpool
seanshi-scale and others added 29 commits October 10, 2024 12:54
need to deduplicate some arguments
* Add hardware spec to client

* fix import and update version

* fix import and update version
…leapi#635)

* Add *.py files to model weights if trust_remote_code is provided

* Add to azure

* add test

* Add additional tests
* Refactor client data types + add vllm arg passthrough

* Bump client version

* fix dict assignment

* add test
* quick todo

* add prometheus metric for the 1-N scaling part

* todo prometheus server addr

* values sample

* autogen tpl

* add in a thing to the model endpoint service to say whether we can autoscale from zero

* pass through validation

* untested to get concurrency value

* fix some tests, add some tests

* clean up some things

* fix a few bugs

* autogen tpl

* cleanup

* comment dependency

* rename

* rename
* Revert "Remove model name override (scaleapi#641)"

This reverts commit 7cb43cd.

* Remove 'Annotated' type usage - python 3.8/pydantic doesn't like it

* Add request id on batch completions error
…pi#646)

* work on image cache gateway

* some service stuff

* rename

* upd test

* autogen tpl and add h100

* quick test

* quick test

* fix test

* black

* fix some config misnames

* rename

* another rename
* up storage limit + test

* actually bump it again

* bump recHardware also

* oops
* Bearer auth for oai compatibility

* fix test
* Updates to helm charts to sync with SGP

* bump versiong
…caleapi#655)

* balloons can take up more than 1 gpu

* setting to make balloons only for high priority

* values.yaml default

* bump helm chart version
…outs (scaleapi#660)

* add this new trace dimension

* try bumping ddtrace to the newest 1.x.y version

* reset reqs to main

* again

* remove thing that doesn't work (rip)

* emit sync call timeout metrics in monitoring metrics gateway

* initialize the sync/streaming inference gateways to use the monitoring metrics gateway

* Revert "initialize the sync/streaming inference gateways to use the monitoring metrics gateway"
Let's just emit in the use case instead

This reverts commit 0bf2a54.

* wip try emitting from use cases, will probably abandon it

* Revert "wip try emitting from use cases, will probably abandon it"

This reverts commit 6b599bd.

* Revert "Revert "initialize the sync/streaming inference gateways to use the monitoring metrics gateway""
ok let's actually just emit from the sync/streaming gateways

This reverts commit 432c0b5.

* small refactor

* thread the readable endpoint name through everywhere

* actually emit the metrics

* rename

* rename

* comment + small type thing
…#661)

Pass said parameter to vLLM engine if requested by user
I removed an empty line.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.