[Question]: How to apply MInference on multiple A100 GPUs? #95

XiongxiaoL · 2024-12-13T13:41:24Z

In the paper, it mentions 'this latency can be reduced to 22 seconds on 8x A100 GPUs', how is this achieved, and is the current version already supported?

iofu728 · 2024-12-15T05:16:17Z

Hi @XiongxiaoL, thanks for your interested in our work.

we implement vLLM w/ TP in hjiang/support_vllm_tp branch

Switch to the hjiang/support_vllm_tp branch.
Run pip install -e .
Copy minference_patch_vllm_tp and minference_patch_vllm_executor from minference/patch.py to the end of the Worker class in vllm/worker/worker.py. Make sure to indent minference_patch_vllm_tp.
When calling VLLM, ensure enable_chunked_prefill=False is set.
Refer to the script in https://github.com/microsoft/MInference/blob/hjiang/support_vllm_tp/experiments/benchmarks/run_e2e_vllm_tp.sh:

wget https://raw.githubusercontent.com/FranxYao/chain-of-thought-hub/main/gsm8k/lib_prompt/prompt_hardest.txt

VLLM_WORKER_MULTIPROC_METHOD=spawn python experiments/benchmarks/benchmark_e2e_vllm_tp.py \
    --attn_type minference \
    --context_window 500_000 --tensor_parallel_size 4

iofu728 self-assigned this Dec 15, 2024

iofu728 added the question Further information is requested label Dec 15, 2024

iofu728 changed the title ~~How to apply MInference on multiple A100 GPUs?~~ [Question]: How to apply MInference on multiple A100 GPUs? Dec 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: How to apply MInference on multiple A100 GPUs? #95

[Question]: How to apply MInference on multiple A100 GPUs? #95

XiongxiaoL commented Dec 13, 2024

iofu728 commented Dec 15, 2024

[Question]: How to apply MInference on multiple A100 GPUs? #95

[Question]: How to apply MInference on multiple A100 GPUs? #95

Comments

XiongxiaoL commented Dec 13, 2024

iofu728 commented Dec 15, 2024