You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Great work! I notice the LLaVA-NeXT-Qwen2 (image model) can achieve a surprising 49.5 Video-MME results. In contrast, the LLaVA-NeXT-Video (Llama3) can only achieve a 30+ Video-MME score (according to https://arxiv.org/pdf/2406.07476 reproduction). The LLaVA-NeXT-Video (Llama3) also cover a normal LLaVA recipe and even with more video data. I am curious that what is the key factor of LLaVA-NeXT-Qwen2's strong performance compared with LLaVA-NeXT-Video (Llama3). Is the main improvement from Qwen2 LLM?
The text was updated successfully, but these errors were encountered:
Hi I believe this blog post is
a good read about how better base LM enables stronger multimodal capabilities. I believe Qwen2 is just significantly better than Vicuna-1.5(llama2)
Great work! I notice the LLaVA-NeXT-Qwen2 (image model) can achieve a surprising 49.5 Video-MME results. In contrast, the LLaVA-NeXT-Video (Llama3) can only achieve a 30+ Video-MME score (according to https://arxiv.org/pdf/2406.07476 reproduction). The LLaVA-NeXT-Video (Llama3) also cover a normal LLaVA recipe and even with more video data. I am curious that what is the key factor of LLaVA-NeXT-Qwen2's strong performance compared with LLaVA-NeXT-Video (Llama3). Is the main improvement from Qwen2 LLM?
The text was updated successfully, but these errors were encountered: