You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi! I want to use vllm==0.6.6 to accelerate inference. Everything goes well when I use Qwen2VL-2B. But when I change it to InternVL2.5-4B, I get this error:
[rank0]: NotImplementedError: Error in model execution (input dumped to /tmp/err_execute_model_input_20250106-123649.pkl): No operator found for `memory_efficient_attention_forward` with inputs:
[rank0]: query : shape=(104, 1025, 16, 64) (torch.bfloat16)
[rank0]: key : shape=(104, 1025, 16, 64) (torch.bfloat16)
[rank0]: value : shape=(104, 1025, 16, 64) (torch.bfloat16)
[rank0]: attn_bias : <class 'NoneType'>
[rank0]: p : 0.0
[rank0]: `[email protected]` is not supported because:
[rank0]: xFormers wasn't build with CUDA support
[rank0]: `cutlassF-pt` is not supported because:
[rank0]: xFormers wasn't build with CUDA support
The output of python -m xformers.infoxFormers
xFormers 0.0.28.post3
memory_efficient_attention.ckF: unavailable
memory_efficient_attention.ckB: unavailable
memory_efficient_attention.ck_decoderF: unavailable
memory_efficient_attention.ck_splitKF: unavailable
memory_efficient_attention.cutlassF-pt: available
memory_efficient_attention.cutlassB-pt: available
[email protected]: available
[email protected]: available
[email protected]: unavailable
[email protected]: unavailable
memory_efficient_attention.triton_splitKF: available
indexing.scaled_index_addF: available
indexing.scaled_index_addB: available
indexing.index_select: available
sequence_parallel_fused.write_values: available
sequence_parallel_fused.wait_values: available
sequence_parallel_fused.cuda_memset_32b_async: available
sp24.sparse24_sparsify_both_ways: available
sp24.sparse24_apply: available
sp24.sparse24_apply_dense_output: available
sp24._sparse24_gemm: available
[email protected]: available
[email protected]: available
swiglu.dual_gemm_silu: available
swiglu.gemm_fused_operand_sum: available
swiglu.fused.p.cpp: available
is_triton_available: True
pytorch.version: 2.5.1+cu121
pytorch.cuda: available
gpu.compute_capability: 8.0
gpu.name: NVIDIA A100-SXM4-80GB
dcgm_profiler: unavailable
build.info: available
build.cuda_version: None
build.hip_version: None
build.python_version: 3.10.15
build.torch_version: 2.5.1+cu121
build.env.TORCH_CUDA_ARCH_LIST: None
build.env.PYTORCH_ROCM_ARCH: None
build.env.XFORMERS_BUILD_TYPE: None
build.env.XFORMERS_ENABLE_DEBUG_ASSERTIONS: None
build.env.NVCC_FLAGS: None
build.env.XFORMERS_PACKAGE_FROM: None
source.privacy: open source
How could I solve this?
The text was updated successfully, but these errors were encountered:
❓ Questions and Help
Hi! I want to use vllm==0.6.6 to accelerate inference. Everything goes well when I use Qwen2VL-2B. But when I change it to InternVL2.5-4B, I get this error:
The output of
python -m xformers.infoxFormers
How could I solve this?
The text was updated successfully, but these errors were encountered: