Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: I started a qwen2vl-7b video processing service using vllm (0.6.6), but encountered an error during inference #11657

Open
1 task done
hyyuananran opened this issue Dec 31, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@hyyuananran
Copy link

Your current environment

Command:
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8088 --model /app/qwen2vl-7b --tensor-parrallel 1 --gpu-memory-utilization 0.95 --served-model-name qwen2vl-7b --trust-remote-code
GPU:
A800 80GB
Query:
query = {"model":"qwen2vl-7b",
"messages":[
{"role":"user",
"content":[
{"type":"text","text":"A prompt word of about 500 words"},
{"type":"video_url","video_url":{"url":“A downloadable URL, a video of about 5 seconds in mp4 format”}
}]
}]
}
Response:
{"object":"error","message":"The prompt (total length 43698) is too long to fit into the model (context length 32768). Make sure that `max
number of images, and pers than the r mber of text tokens plus multimodal tokens. For image inputs, the number of image tokens depends on the number of images, and possibly their aspect ratios as well.","type":"BadRequestError","param" :nuit, code :400}

Model Input Dumps

No response

🐛 Describe the bug

Command:
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8088 --model /app/qwen2vl-7b --tensor-parrallel 1 --gpu-memory-utilization 0.95 --served-model-name qwen2vl-7b --trust-remote-code
GPU:
A800 80GB
Query:
query = {"model":"qwen2vl-7b",
"messages":[
{"role":"user",
"content":[
{"type":"text","text":"A prompt word of about 500 words"},
{"type":"video_url","video_url":{"url":“A downloadable URL, a video of about 5 seconds in mp4 format”}
}]
}]
}
Response:
{"object":"error","message":"The prompt (total length 43698) is too long to fit into the model (context length 32768). Make sure that `max
number of images, and pers than the r mber of text tokens plus multimodal tokens. For image inputs, the number of image tokens depends on the number of images, and possibly their aspect ratios as well.","type":"BadRequestError","param" :nuit, code :400}

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@hyyuananran hyyuananran added the bug Something isn't working label Dec 31, 2024
@DarkLight1337
Copy link
Member

Your video is probably too long to fit inside the model. Try using a shorter video or sample fewer frames from it.

@hyyuananran
Copy link
Author

My video is only 5 seconds long, which is considered a very short video.
How to add parameters to the request body to achieve the low sampling frame rate you mentioned?

@hyyuananran
Copy link
Author

The most crucial thing is that the same video, on the same GPU, does not have this problem without using VLLM

@DarkLight1337
Copy link
Member

DarkLight1337 commented Jan 1, 2025

You need to sample the frames outside of vLLM, since we only apply HF's preprocessing to the data which doesn't include video sampling. Alternatively, if you want to keep the full video, you can try increasing max_model_len beyond 32768.

@hyyuananran
Copy link
Author

I also tried to set the value of max_model_len to be greater than 32768, but encountered an error message as follows:
ValueError: User-specified max_model_len (49000) is greater than the derived max_model_len (max_position_embeddings=32768.0 or model_max_length=None in model's config.json). This may lead to incorrect model outputs or CUDA errors. To allow overriding this maximum, set the env var VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
ERROR 12-31 09:13:45 engine.py:366] User-specified max_model_len (49000) is greater than the derived max_model_len (max_position_embeddings=32768.0 or model_max_length=None in model's config.json).
This may lead to incorrect model outputs or CUDA errors. To allow overriding this maximum, set the env varVLLM_ALLOW_LONG_MAX_MODEL_LEN=1
ERROR 12-31 09:13:45 engine.py:366] Traceback (most recent call last):
My GPU is 80GB. In this situation, how can I force a value greater than 32768 for max_model_len?

@DarkLight1337
Copy link
Member

You can try overriding rope scaling: https://qwen.readthedocs.io/en/latest/deployment/vllm.html#extended-context-support

I'm not 100% sure whether this is applicable to Qwen2-VL though. @fyabc any idea about this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants