Incorrect handling of JPEG with color space CMYK image extraction #4186

pdc1 · 2024-12-30T19:55:43Z

Description of the bug

I have a PDF with CMYK colorspace images. I want to convert the raw image bytes (e.g. from extract_image or get_text(dict)) to an RGB image.

For images with decode filter = 'DCTDecode', the colorspace conversion does not appear to work when given raw images bytes. If the Pixmap is loaded using xref directly, it works.

The document images look like this:

>>> doc.get_page_images(0)
[(44, 0, 1350, 1125, 8, 'DeviceCMYK', '', 'X10', 'DCTDecode'), (46, 45, 1221, 1357, 8, 'DeviceCMYK', '', 'X11', 'FlateDecode'), (52, 51, 500, 500, 8, 'DeviceCMYK', '', 'X7', 'FlateDecode'), (53, 0, 1650, 1275, 8, 'DeviceCMYK', '', 'X9', 'FlateDecode'), (48, 0, 1024, 683, 8, 'DeviceCMYK', '', 'X4', 'FlateDecode')]
>>> doc.get_page_images(1)
[(7, 0, 2848, 4288, 8, 'DeviceCMYK', '', 'X15', 'DCTDecode')]

See sample code below.

Sample PDF is Seven Deadly Sins Program-1.pdf

Correct image (using Pixmap(xref))

Incorrect image (using extract_image(xref)["image"] bytes)

How to reproduce the bug

Here is the code I used to generate the two images:

import pymupdf

doc = pymupdf.open("Seven Deadly Sins Program-1.pdf")
images = doc.get_page_images(0)
xref = images[0][0]

pix = pymupdf.Pixmap(doc, xref)
pix = pymupdf.Pixmap(pymupdf.csRGB, pix)
pix.save("temp.jpeg")

img = doc.extract_image(xref)
pix2 = pymupdf.Pixmap(img["image"])
pix2 = pymupdf.Pixmap(pymupdf.csRGB, pix2)
pix2.save("temp2.jpeg")

PyMuPDF version

1.25.1

Operating system

Windows

Python version

3.11

The text was updated successfully, but these errors were encountered:

JorjMcKie · 2025-01-01T18:53:51Z

The problem only occurs for embedded JPEG images with CMYK color space. Internal conversion to PNG in these cases avoids the problem.
We will implement this in PyMuPDF.
When using MuPDF native functions, the problem can be reproduced by e.g. mutool extract input.pdf. When using mutool extract -r input.pdf, the problem will be avoided in a similar way.

Update:
The problem can be solved by simply inverting the color in this case. There is a fix underway that does this for both cases, Document.extract_image() and Page.get_text("dict").

julian-smith-artifex-com · 2025-01-17T16:38:18Z

Fixed in PyMuPDF-1.25.2.

JorjMcKie changed the title ~~Colorspace conversion using Pixmap(bytes) doesn't work for 'DCTDecode' filter images.~~ Incorrect handling of JPEG with color space CMYK image extraction Jan 1, 2025

JorjMcKie added the fix developed release schedule to be determined label Jan 2, 2025

julian-smith-artifex-com added the Fixed in next release label Jan 17, 2025

julian-smith-artifex-com closed this as completed Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect handling of JPEG with color space CMYK image extraction #4186

Incorrect handling of JPEG with color space CMYK image extraction #4186

pdc1 commented Dec 30, 2024

JorjMcKie commented Jan 1, 2025 •

edited

Loading

julian-smith-artifex-com commented Jan 17, 2025

Incorrect handling of JPEG with color space CMYK image extraction #4186

Incorrect handling of JPEG with color space CMYK image extraction #4186

Comments

pdc1 commented Dec 30, 2024

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

JorjMcKie commented Jan 1, 2025 • edited Loading

julian-smith-artifex-com commented Jan 17, 2025

JorjMcKie commented Jan 1, 2025 •

edited

Loading