Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text color numbers change between 1.24.14 and 1.25.0 #4139

Open
stevesimmons opened this issue Dec 11, 2024 · 1 comment
Open

Text color numbers change between 1.24.14 and 1.25.0 #4139

stevesimmons opened this issue Dec 11, 2024 · 1 comment
Labels
bug fix developed release schedule to be determined

Comments

@stevesimmons
Copy link

Description of the bug

When upgrading PyMuPDF from 1.24.14 to 1.25.0, the reported text color codes have changed.

I tested this with this code:

import pymupdf
print(pymupdf.__version__)
flags = pymupdf.TEXT_PRESERVE_IMAGES | pymupdf.TEXT_PRESERVE_WHITESPACE | pymupdf.TEXT_CID_FOR_UNKNOWN_UNICODE
doc = pymupdf.open("0d4cb925de9d383e.pdf")
page = doc[0]
dicts = page.get_text('dict', flags=flags, sort=True)
seen = set()
for b_ctr, b in enumerate(dicts['blocks']):
     for l_ctr, l in enumerate(b.get('lines', [])):
        for s_ctr, s in enumerate(l['spans']):
            color = s.get('color')
            if color is not None and color  not in seen:
                seen.add(color)
                print(f"B{b_ctr}.L{l_ctr}.S{s_ctr}: {color:8}, hex {hex(color):6}")

With output for PyMuPDF version 1.24.14 having positive colours numbers:

1.24.14
B0.L0.S0:    44526, hex 0xadee
B2.L0.S0:        0, hex 0x0   
B6.L1.S0: 16777215, hex 0xffffff

and output for PyMuPDF version 1.25.0 having negative colour numbers:

1.25.0
B0.L0.S0: -16732433, hex -0xff5111
B2.L0.S0: -16777216, hex -0x1000000
B6.L0.S0:       -1, hex -0x1  

This will break any code using PyMuPDF to find text based on predetermined color codes!

As to what caused this, the MuPDF release notes (https://mupdf.com/releases/history) for 1.25.0 RC2 do say that color changed to rgba with the addition of alpha channel.... I can't find anything that seems related in the PyMuPDF github.

(Note also that grouping of text into blocks/lines/spans changed between 1.24.14 and 1.25.0.)

How to reproduce the bug

See description above.

PyMuPDF version

1.25.0

Operating system

Linux

Python version

3.11

@JorjMcKie JorjMcKie added bug fix developed release schedule to be determined labels Dec 11, 2024
@JorjMcKie
Copy link
Collaborator

Will be corrected in next version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug fix developed release schedule to be determined
Projects
None yet
Development

No branches or pull requests

2 participants