Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect handling of JPEG with color space CMYK image extraction #4186

Closed
pdc1 opened this issue Dec 30, 2024 · 2 comments
Closed

Incorrect handling of JPEG with color space CMYK image extraction #4186

pdc1 opened this issue Dec 30, 2024 · 2 comments
Labels
fix developed release schedule to be determined Fixed in next release

Comments

@pdc1
Copy link

pdc1 commented Dec 30, 2024

Description of the bug

I have a PDF with CMYK colorspace images. I want to convert the raw image bytes (e.g. from extract_image or get_text(dict)) to an RGB image.

For images with decode filter = 'DCTDecode', the colorspace conversion does not appear to work when given raw images bytes. If the Pixmap is loaded using xref directly, it works.

The document images look like this:

>>> doc.get_page_images(0)
[(44, 0, 1350, 1125, 8, 'DeviceCMYK', '', 'X10', 'DCTDecode'), (46, 45, 1221, 1357, 8, 'DeviceCMYK', '', 'X11', 'FlateDecode'), (52, 51, 500, 500, 8, 'DeviceCMYK', '', 'X7', 'FlateDecode'), (53, 0, 1650, 1275, 8, 'DeviceCMYK', '', 'X9', 'FlateDecode'), (48, 0, 1024, 683, 8, 'DeviceCMYK', '', 'X4', 'FlateDecode')]
>>> doc.get_page_images(1)
[(7, 0, 2848, 4288, 8, 'DeviceCMYK', '', 'X15', 'DCTDecode')]

See sample code below.

Sample PDF is Seven Deadly Sins Program-1.pdf

Correct image (using Pixmap(xref))
temp

Incorrect image (using extract_image(xref)["image"] bytes)
temp2

How to reproduce the bug

Here is the code I used to generate the two images:

import pymupdf

doc = pymupdf.open("Seven Deadly Sins Program-1.pdf")
images = doc.get_page_images(0)
xref = images[0][0]

pix = pymupdf.Pixmap(doc, xref)
pix = pymupdf.Pixmap(pymupdf.csRGB, pix)
pix.save("temp.jpeg")

img = doc.extract_image(xref)
pix2 = pymupdf.Pixmap(img["image"])
pix2 = pymupdf.Pixmap(pymupdf.csRGB, pix2)
pix2.save("temp2.jpeg")

PyMuPDF version

1.25.1

Operating system

Windows

Python version

3.11

@JorjMcKie JorjMcKie changed the title Colorspace conversion using Pixmap(bytes) doesn't work for 'DCTDecode' filter images. Incorrect handling of JPEG with color space CMYK image extraction Jan 1, 2025
@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Jan 1, 2025

The problem only occurs for embedded JPEG images with CMYK color space. Internal conversion to PNG in these cases avoids the problem.
We will implement this in PyMuPDF.
When using MuPDF native functions, the problem can be reproduced by e.g. mutool extract input.pdf. When using mutool extract -r input.pdf, the problem will be avoided in a similar way.

Update:
The problem can be solved by simply inverting the color in this case. There is a fix underway that does this for both cases, Document.extract_image() and Page.get_text("dict").

@JorjMcKie JorjMcKie added the fix developed release schedule to be determined label Jan 2, 2025
@julian-smith-artifex-com
Copy link
Collaborator

Fixed in PyMuPDF-1.25.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix developed release schedule to be determined Fixed in next release
Projects
None yet
Development

No branches or pull requests

3 participants