Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read page numbers from outline "action" #310

Open
oicerid opened this issue Feb 21, 2022 · 7 comments
Open

Read page numbers from outline "action" #310

oicerid opened this issue Feb 21, 2022 · 7 comments

Comments

@oicerid
Copy link

oicerid commented Feb 21, 2022

I'm trying to figure out how to extract the page number of an OutlineItem when <Action> is returned, since this feature doesn't seem to be implemented yet (?).

Is there a workaround until it's implemented? Any idea of when that might be?

Is it possible to access the "raw" Pdf outline somehow to look for the /Page entry?

Take outlines.pdf as an example:

from pikepdf import Pdf
reader = Pdf.open("outlines.pdf")

with reader.open_outline() as outlines:
    for outline in outlines.root:
        print(outline)

returns:

[+] One -> <Action>
[ ] Two -> <Action>
[+] Three -> <Action>
@mara004
Copy link
Contributor

mara004 commented Feb 22, 2022

Well, each object that you get when iterating over the outline root is an OutlineItem and you may directly access its action dictionary (item.action), which usually has a /D key containing a destination (can be direct or indirect). Assuming it is direct, you'll get an array of a page object, a page location type and between 0 to 4 coordinates. The page index may then be determined using pikepdf.Page(direct_dest[0]).index. (If it's an indirect destination, things would become more complicated.)

I'm trying to figure out how to extract the page number of an OutlineItem when is returned, since this feature doesn't seem to be implemented yet (?).

I believe that libqpdf recently added QPDFOutlineObjectHelper::getDestPage() and some other useful methods releated to parsing the PDF table of contents, but (as far as I can see) pikepdf doesn't have bidings for it yet.
Technically, you could of course implement a bookmark page resolver manually using the means pikepdf currently provides, but depending on your needs this may be rather cumbersome.

(In the meantime, I can also suggest using pymupdf.Document.get_toc() if you don't mind the AGPL3.)

@oicerid
Copy link
Author

oicerid commented Feb 22, 2022

Thanks for the answer.

In the above example item.action returns NotImplementedError: don't know how to __str__ this object - so it doesn't seem to be possible to access anything in that case. Is that because it's an indirect destination?

I'm currently using PdfFileReader.getDestinationPageNumber() from PyPDF2 to get this info, but since it's not maintained I felt it was time to try and convert to something else.

Will have a look at qpdf and pymupdf and if it may be an option.

@jbarlow83
Copy link
Member

Yes, the outline code unfortunately doesn't handle actions at this time, only outline entries explicitly defined with a page destinations. Actions can be a lot of things other than going to a page.

@mara004
Copy link
Contributor

mara004 commented Feb 23, 2022

In the above example item.action returns NotImplementedError: don't know how to __str__ this object - so it doesn't seem to be possible to access anything in that case. Is that because it's an indirect destination?

That the object doesn't implement __str__ does not mean it can't be accessed. If you wish to print the action, I think you need to use print(repr(item.action)). That said, it should be possible to work with the action as with any other PDF dictionary. For example, you could do something like this:

if '/D' in item.action:
    dest = item.action.D
    # assuming a direct destination
    assert isinstance(dest, pikepdf.Array)
    page_obj = dest[0]
    page_index = pikepdf.Page(page_obj).index
    print(page_index)

@oicerid
Copy link
Author

oicerid commented Feb 23, 2022

That said, it should be possible to work with the action as with any other PDF dictionary. For example, you could do something like this:

if '/D' in item.action:
    dest = item.action.D
    # assuming a direct destination
    assert isinstance(dest, pikepdf.Array)
    page_obj = dest[0]
    page_index = pikepdf.Page(page_obj).index
    print(page_index)

Thanks, haven't really understood all of how to work with pikepdf yet but this is atleast one step closer :)

After trying your code snippet I can conclude that its not a direct destination but an indirect one. When printing repr(item.action) I get:

pikepdf.Dictionary({
  "/D": "0",
  "/S": "/GoTo"
})

Is it possible to look up the "/D" value somewhere within pikepdf in this case? Cause I'm guessing it can be resolved to a "/Page" entry somewhere.

@mara004
Copy link
Contributor

mara004 commented Feb 24, 2022

Is it possible to look up the "/D" value somewhere within pikepdf in this case? Cause I'm guessing it can be resolved to a "/Page" entry somewhere.

Yes, it should be possible to resolve the indirect/named destination to a direct one. I suppose the document has a name tree at pdf.Root.Names.Dests which can basically be used like a dictionary to map from indirect to direct destinations, thanks to the NameTree support model of pikepdf/qpdf:

named_dest = item.action.D
assert isinstance(named_dest, pikepdf.Dictionary)
name_mapping = pikepdf.NameTree(pdf.Root.Names.Dests)
direct_dest = name_mapping[named_dest]
page = pikepdf.Page(direct_dest[0])
print(page_obj.index)

@oicerid
Copy link
Author

oicerid commented Feb 26, 2022

Thank you so much for your help! Will test your code when I get a chance, seems to support indirect destinations? If it doesnt work I atleast now know where to look :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants