You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the paper, it mentions 'this latency can be reduced to 22 seconds on 8x A100 GPUs', how is this achieved, and is the current version already supported?
The text was updated successfully, but these errors were encountered:
Hi @XiongxiaoL, thanks for your interested in our work.
we implement vLLM w/ TP in hjiang/support_vllm_tp branch
Switch to the hjiang/support_vllm_tp branch.
Run pip install -e .
Copy minference_patch_vllm_tp and minference_patch_vllm_executor from minference/patch.py to the end of the Worker class in vllm/worker/worker.py. Make sure to indent minference_patch_vllm_tp.
When calling VLLM, ensure enable_chunked_prefill=False is set.
In the paper, it mentions 'this latency can be reduced to 22 seconds on 8x A100 GPUs', how is this achieved, and is the current version already supported?
The text was updated successfully, but these errors were encountered: