[Bug]: I started a qwen2vl-7b video processing service using vllm (0.6.6), but encountered an error during inference #11657

hyyuananran · 2024-12-31T10:03:32Z

Your current environment

Command：
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8088 --model /app/qwen2vl-7b --tensor-parrallel 1 --gpu-memory-utilization 0.95 --served-model-name qwen2vl-7b --trust-remote-code
GPU:
A800 80GB
Query:
query = {"model":"qwen2vl-7b",
"messages":[
{"role":"user",
"content":[
{"type":"text","text":"A prompt word of about 500 words"},
{"type":"video_url","video_url":{"url":“A downloadable URL, a video of about 5 seconds in mp4 format”}
}]
}]
}
Response：
{"object":"error","message":"The prompt (total length 43698) is too long to fit into the model (context length 32768). Make sure that `max
number of images, and pers than the r mber of text tokens plus multimodal tokens. For image inputs, the number of image tokens depends on the number of images, and possibly their aspect ratios as well.","type":"BadRequestError","param" :nuit, code :400｝

Model Input Dumps

No response

🐛 Describe the bug

Command：
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8088 --model /app/qwen2vl-7b --tensor-parrallel 1 --gpu-memory-utilization 0.95 --served-model-name qwen2vl-7b --trust-remote-code
GPU:
A800 80GB
Query:
query = {"model":"qwen2vl-7b",
"messages":[
{"role":"user",
"content":[
{"type":"text","text":"A prompt word of about 500 words"},
{"type":"video_url","video_url":{"url":“A downloadable URL, a video of about 5 seconds in mp4 format”}
}]
}]
}
Response：
{"object":"error","message":"The prompt (total length 43698) is too long to fit into the model (context length 32768). Make sure that `max
number of images, and pers than the r mber of text tokens plus multimodal tokens. For image inputs, the number of image tokens depends on the number of images, and possibly their aspect ratios as well.","type":"BadRequestError","param" :nuit, code :400｝

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

DarkLight1337 · 2024-12-31T10:39:12Z

Your video is probably too long to fit inside the model. Try using a shorter video or sample fewer frames from it.

hyyuananran · 2025-01-01T13:07:23Z

My video is only 5 seconds long, which is considered a very short video.
How to add parameters to the request body to achieve the low sampling frame rate you mentioned?

hyyuananran · 2025-01-01T13:08:51Z

The most crucial thing is that the same video, on the same GPU, does not have this problem without using VLLM

DarkLight1337 · 2025-01-01T13:33:13Z

You need to sample the frames outside of vLLM, since we only apply HF's preprocessing to the data which doesn't include video sampling. Alternatively, if you want to keep the full video, you can try increasing max_model_len beyond 32768.

hyyuananran · 2025-01-02T00:43:47Z

I also tried to set the value of max_model_len to be greater than 32768, but encountered an error message as follows:
ValueError: User-specified max_model_len (49000) is greater than the derived max_model_len (max_position_embeddings=32768.0 or model_max_length=None in model's config.json). This may lead to incorrect model outputs or CUDA errors. To allow overriding this maximum, set the env var VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
ERROR 12-31 09:13:45 engine.py:366] User-specified max_model_len (49000) is greater than the derived max_model_len (max_position_embeddings=32768.0 or model_max_length=None in model's config.json).
This may lead to incorrect model outputs or CUDA errors. To allow overriding this maximum, set the env varVLLM_ALLOW_LONG_MAX_MODEL_LEN=1
ERROR 12-31 09:13:45 engine.py:366] Traceback (most recent call last):
My GPU is 80GB. In this situation, how can I force a value greater than 32768 for max_model_len?

DarkLight1337 · 2025-01-02T06:33:41Z

You can try overriding rope scaling: https://qwen.readthedocs.io/en/latest/deployment/vllm.html#extended-context-support

I'm not 100% sure whether this is applicable to Qwen2-VL though. @fyabc any idea about this?

hyyuananran added the bug Something isn't working label Dec 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: I started a qwen2vl-7b video processing service using vllm (0.6.6), but encountered an error during inference #11657

[Bug]: I started a qwen2vl-7b video processing service using vllm (0.6.6), but encountered an error during inference #11657

hyyuananran commented Dec 31, 2024

DarkLight1337 commented Dec 31, 2024

hyyuananran commented Jan 1, 2025

hyyuananran commented Jan 1, 2025

DarkLight1337 commented Jan 1, 2025 •

edited

Loading

hyyuananran commented Jan 2, 2025

DarkLight1337 commented Jan 2, 2025

[Bug]: I started a qwen2vl-7b video processing service using vllm (0.6.6), but encountered an error during inference #11657

[Bug]: I started a qwen2vl-7b video processing service using vllm (0.6.6), but encountered an error during inference #11657

Comments

hyyuananran commented Dec 31, 2024

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

DarkLight1337 commented Dec 31, 2024

hyyuananran commented Jan 1, 2025

hyyuananran commented Jan 1, 2025

DarkLight1337 commented Jan 1, 2025 • edited Loading

hyyuananran commented Jan 2, 2025

DarkLight1337 commented Jan 2, 2025

DarkLight1337 commented Jan 1, 2025 •

edited

Loading