Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inquiry About PatchMerger Implementation in Image Embedding #633

Open
HeimingX opened this issue Jan 5, 2025 · 0 comments
Open

Inquiry About PatchMerger Implementation in Image Embedding #633

HeimingX opened this issue Jan 5, 2025 · 0 comments

Comments

@HeimingX
Copy link

HeimingX commented Jan 5, 2025

Dear Qwen2-VL Team,

First, I want to thank you for your remarkable contributions to the open-source community with Qwen2-VL. The model’s outstanding performance and innovative design are truly impressive.

While exploring the implementation, I noticed that the PatchMerger module seems to reduce token count by merging adjacent patches based on their order in the linear patchify sequence.

class PatchMerger(nn.Module):
    def __init__(self, dim: int, context_dim: int, spatial_merge_size: int = 2) -> None:
        super().__init__()
        self.hidden_size = context_dim * (spatial_merge_size**2)
        self.ln_q = LayerNorm(context_dim, eps=1e-6)
        self.mlp = nn.Sequential(
            nn.Linear(self.hidden_size, self.hidden_size),
            nn.GELU(),
            nn.Linear(self.hidden_size, dim),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.mlp(self.ln_q(x).view(-1, self.hidden_size))
        return x

https://github.com/huggingface/transformers/blob/e5fd865ebae062b7cf03a81b8c6affeb39f30bec/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py#L310

However, this approach doesn’t seem to account for the 2D spatial arrangement of patches in the original image. For example, the last patch of one row might be merged with the first patch of the next row.

I’m curious if this design choice might lead to potential issues where merged patches could represent entirely different semantics, given their distant spatial relationship in the image. Would this have any impact on the model’s ability to encode accurate image representations?

I’d appreciate it if you could share the reasoning behind this implementation and whether alternative approaches preserving 2D spatial locality were considered.

Thank you again for this incredible project, and I look forward to your insights!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant