Error in extracting the electronic signature from the PDF #4190

kar9999 · 2025-01-03T08:51:35Z

Description of the bug

There are omissions in extracting the electronic signatures.

How to reproduce the bug

Why can't this function extract the electronic signature on the first page, while it can extract those on other pages?
the signature of first page:

the d information:

the signature of other page: (have image information)

Moreover, the image information of the electronic signature obtained by function ima_info=page.get_image_info() is incorrect. The ima_info doesn't have image

There is no binary stream of the picture in this field and the cross-reference (xref) equals 0, so it cannot be extracted.

PyMuPDF version

1.24.11

Operating system

Linux

Python version

3.9

JorjMcKie · 2025-01-03T08:58:16Z

We cannot deal with issues that we cannot reproduce. Please provide the respective file.

JorjMcKie · 2025-01-04T12:44:49Z

Please be aware that we will close issues that we cannot reproduce within at most five days.

kar9999 · 2025-01-06T06:54:31Z

We cannot deal with issues that we cannot reproduce. Please provide the respective file.

Sorry, my files involve confidential information and cannot be disclosed. I tried to create a non-confidential PDF with an electronic signature on my own (which is not consistent with the confidential files). I found that d = page.get_text("dict")
blocks = d["blocks"]
imgblocks = [b for b in blocks if b["type"] == 1] couldn't find the missing images on the first page, but page.get_image_info(xrefs=True) can. What's the difference between these two functions?
dzqz2.pdf

JorjMcKie · 2025-01-06T08:02:38Z

Method Page.get_text("dict") only reports images that are fully contained in the page. Even small portions outside will disqualify an image for showing up here. You can influence this by using clip=pymupdf.INFINITE_RECT(). Then all images (and all text) will appear.
Page.get_image_info() will always report all images.

JorjMcKie · 2025-01-06T09:50:04Z

Please close the issue if our above discussion explains your observations.

kar9999 · 2025-01-06T10:27:16Z

Method Page.get_text("dict") only reports images that are fully contained in the page. Even small portions outside will disqualify an image for showing up here. You can influence this by using clip=pymupdf.INFINITE_RECT(). Then all images (and all text) will appear. Page.get_image_info() will always report all images.

Thank you for your answer. I'd like to ask if there is a situation where the method Page.get_text("dict") can obtain images, but in the Page.get_image_info() method, xref = 0.

JorjMcKie · 2025-01-06T11:41:08Z

Yes, this is very possible:
A PDF page can contain images inside its /Contents objects. These are not known anywhere else and hence have no xref.
PyMuPDF also does not include images contained in annotations in its get_images() method.
In both cases, get_image_info(xrefs=True) would contain xref=0.

JorjMcKie added example required Waiting for information labels Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in extracting the electronic signature from the PDF #4190

Error in extracting the electronic signature from the PDF #4190

kar9999 commented Jan 3, 2025

JorjMcKie commented Jan 3, 2025

JorjMcKie commented Jan 4, 2025

kar9999 commented Jan 6, 2025

JorjMcKie commented Jan 6, 2025

JorjMcKie commented Jan 6, 2025

kar9999 commented Jan 6, 2025

JorjMcKie commented Jan 6, 2025

Error in extracting the electronic signature from the PDF #4190

Error in extracting the electronic signature from the PDF #4190

Comments

kar9999 commented Jan 3, 2025

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

JorjMcKie commented Jan 3, 2025

JorjMcKie commented Jan 4, 2025

kar9999 commented Jan 6, 2025

JorjMcKie commented Jan 6, 2025

JorjMcKie commented Jan 6, 2025

kar9999 commented Jan 6, 2025

JorjMcKie commented Jan 6, 2025