You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First, I want to thank you for your remarkable contributions to the open-source community with Qwen2-VL. The model’s outstanding performance and innovative design are truly impressive.
While exploring the implementation, I noticed that the PatchMerger module seems to reduce token count by merging adjacent patches based on their order in the linear patchify sequence.
However, this approach doesn’t seem to account for the 2D spatial arrangement of patches in the original image. For example, the last patch of one row might be merged with the first patch of the next row.
I’m curious if this design choice might lead to potential issues where merged patches could represent entirely different semantics, given their distant spatial relationship in the image. Would this have any impact on the model’s ability to encode accurate image representations?
I’d appreciate it if you could share the reasoning behind this implementation and whether alternative approaches preserving 2D spatial locality were considered.
Thank you again for this incredible project, and I look forward to your insights!
The text was updated successfully, but these errors were encountered:
Dear Qwen2-VL Team,
First, I want to thank you for your remarkable contributions to the open-source community with Qwen2-VL. The model’s outstanding performance and innovative design are truly impressive.
While exploring the implementation, I noticed that the PatchMerger module seems to reduce token count by merging adjacent patches based on their order in the linear patchify sequence.
However, this approach doesn’t seem to account for the 2D spatial arrangement of patches in the original image. For example, the last patch of one row might be merged with the first patch of the next row.
I’m curious if this design choice might lead to potential issues where merged patches could represent entirely different semantics, given their distant spatial relationship in the image. Would this have any impact on the model’s ability to encode accurate image representations?
I’d appreciate it if you could share the reasoning behind this implementation and whether alternative approaches preserving 2D spatial locality were considered.
Thank you again for this incredible project, and I look forward to your insights!
The text was updated successfully, but these errors were encountered: