Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in extracting the electronic signature from the PDF #4190

Open
kar9999 opened this issue Jan 3, 2025 · 7 comments
Open

Error in extracting the electronic signature from the PDF #4190

kar9999 opened this issue Jan 3, 2025 · 7 comments

Comments

@kar9999
Copy link

kar9999 commented Jan 3, 2025

Description of the bug

There are omissions in extracting the electronic signatures.

How to reproduce the bug

Why can't this function extract the electronic signature on the first page, while it can extract those on other pages?
the signature of first page:
a48b9bb3-9bde-4b47-aa6d-4c0044da3eb6

image
the d information:
image

the signature of other page: (have image information)
image

Moreover, the image information of the electronic signature obtained by function ima_info=page.get_image_info() is incorrect. The ima_info doesn't have image
image
There is no binary stream of the picture in this field and the cross-reference (xref) equals 0, so it cannot be extracted.

PyMuPDF version

1.24.11

Operating system

Linux

Python version

3.9

@JorjMcKie
Copy link
Collaborator

We cannot deal with issues that we cannot reproduce. Please provide the respective file.

@JorjMcKie
Copy link
Collaborator

Please be aware that we will close issues that we cannot reproduce within at most five days.

@kar9999
Copy link
Author

kar9999 commented Jan 6, 2025

We cannot deal with issues that we cannot reproduce. Please provide the respective file.

Sorry, my files involve confidential information and cannot be disclosed. I tried to create a non-confidential PDF with an electronic signature on my own (which is not consistent with the confidential files). I found that d = page.get_text("dict")
blocks = d["blocks"]
imgblocks = [b for b in blocks if b["type"] == 1] couldn't find the missing images on the first page, but page.get_image_info(xrefs=True) can. What's the difference between these two functions?
dzqz2.pdf

@JorjMcKie
Copy link
Collaborator

Method Page.get_text("dict") only reports images that are fully contained in the page. Even small portions outside will disqualify an image for showing up here. You can influence this by using clip=pymupdf.INFINITE_RECT(). Then all images (and all text) will appear.
Page.get_image_info() will always report all images.

@JorjMcKie
Copy link
Collaborator

Please close the issue if our above discussion explains your observations.

@kar9999
Copy link
Author

kar9999 commented Jan 6, 2025

Method Page.get_text("dict") only reports images that are fully contained in the page. Even small portions outside will disqualify an image for showing up here. You can influence this by using clip=pymupdf.INFINITE_RECT(). Then all images (and all text) will appear. Page.get_image_info() will always report all images.

Thank you for your answer. I'd like to ask if there is a situation where the method Page.get_text("dict") can obtain images, but in the Page.get_image_info() method, xref = 0.

@JorjMcKie
Copy link
Collaborator

Yes, this is very possible:
A PDF page can contain images inside its /Contents objects. These are not known anywhere else and hence have no xref.
PyMuPDF also does not include images contained in annotations in its get_images() method.
In both cases, get_image_info(xrefs=True) would contain xref=0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants