Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Certain epubs only read the first page #146

Open
luky92 opened this issue Dec 28, 2024 · 22 comments
Open

Certain epubs only read the first page #146

luky92 opened this issue Dec 28, 2024 · 22 comments

Comments

@luky92
Copy link

luky92 commented Dec 28, 2024

for some epubs only the first page is read you can download a sample epub that does this here:
https://wolnelektury.pl/katalog/lektura/brzydkie-kaczatko/
language is polish
let me know if you need any help or debug logs im happy to provide these
( note this ebook is in public domain just in case thats important)

@ROBERT-MCDOWELL
Copy link
Collaborator

ROBERT-MCDOWELL commented Dec 28, 2024

the link you provided is a sample of the book right? I see only one page in this sample.
your issue comes from the full epub?

@luky92
Copy link
Author

luky92 commented Dec 28, 2024

no it is a full "book" I assure its not only one page:
screenshhot of an epub reader view:
image

@luky92
Copy link
Author

luky92 commented Dec 28, 2024

just for convinience here is zip with the file
brzydkie-kaczatko.zip

@ROBERT-MCDOWELL
Copy link
Collaborator

when I edit the book with calibre, I can see only part1.xhtml as a part of the epub with some real text right?
other pages are fund1, fund2 etc... and fund1 contains
"Książka, którą czytasz, pochodzi z biblioteki fundacji Wolne Lektury. Naszą misją jest wspieranie dzieciaków w dostępie do lektur szkolnych oraz zachęcanie ich do czytania. Miło Cię poznać!"
is it this page it is conveted only?

@luky92
Copy link
Author

luky92 commented Dec 28, 2024

yes you are right part 1 is the whole book the resto of it is like title page and some founding stuff

@ROBERT-MCDOWELL
Copy link
Collaborator

here is the issue: in the ebook world there is absolutely no rules of how to name or separate the real book text from legal, annotations, technical text, becoming very difficult to filter the ebook pages to just keep the real text of the book and avoid the A.I speech to say all the text from technical stuff which is very ennoying and burden to hear.
for now eb2ab is selecting the pages with a repetitive name. However this ebook shows 1 valid page (part1.xhtml) but 5 pages named fundX.html which are not real text, causing a false positive.... I'm working to find another way to filter it.

@luky92
Copy link
Author

luky92 commented Dec 28, 2024

sure and I dont have a problem with having the whole book including this type of metdata problem is it SKIPS the actual story

@luky92
Copy link
Author

luky92 commented Dec 28, 2024

so for some reson part1 is totally skiped

@ROBERT-MCDOWELL
Copy link
Collaborator

ROBERT-MCDOWELL commented Dec 28, 2024

the script cannot filter the text ifself but only the pages. as part1 is only one page, and fundX several pages so the script thinks that the real text is from fundX since a book usually have more than one page. That's the issue to solve

@ROBERT-MCDOWELL ROBERT-MCDOWELL self-assigned this Dec 28, 2024
@ROBERT-MCDOWELL ROBERT-MCDOWELL added the bug Something isn't working label Dec 28, 2024
@danieltomasz
Copy link

@luky92 the workaround would be to use directly other format provided by the portal (pdf of txt for example)

@luky92
Copy link
Author

luky92 commented Dec 28, 2024

Yeah I already figured that out on the plus side this file was only for testing so no biggie and I assume normal full books will not have this problem

@danieltomasz
Copy link

danieltomasz commented Dec 28, 2024

the actual content has <p class="paragraph"></p>the fundrasing pages have either <div class="fundraising"> or <p class="info"></div>,` maybe some sort of heuristic for filtering is possible but this will work only for this particular portal

@ROBERT-MCDOWELL
Copy link
Collaborator

ROBERT-MCDOWELL commented Dec 28, 2024

@danieltomasz ok but understand that the millions ebook have their own class, own pages, own xhtml mark etc...
there is absolutely no standards followed by any kind of epub consortium (unless epub3 that started a few),
it's a complete mess of 16 years of ebook.... you can imagine we cannot code a filter per ebook or website, it's just a madness for a long term software... the only way for now is to find a char length average of each incrementing xhtml page name as usually the real text has more char length per page, but it's not ideal since you can have poetry books with not a lot of text on each page and have a full page of technical text or preface or else.

@luky92
Copy link
Author

luky92 commented Dec 28, 2024

Now that I read up on it I'm telling you from experience with parsing pdfs before don't try to do any kind of fancy filtering or heuristics here the best you can do if you want is display pages to the user as a list with preview and let them choose which ones should be read otherwise you are going to have bugs about this sort of thing all the time

@ROBERT-MCDOWELL
Copy link
Collaborator

you forgot here it's TTS (text to speech)!, tell me how to let the user choose which text to listen, creating hundreds of index on the
audio file? with the text on each index? how a blind person will choose from it?

@luky92
Copy link
Author

luky92 commented Dec 28, 2024 via email

@ROBERT-MCDOWELL
Copy link
Collaborator

you are talking about something that has nothing to do with the issue we have here...
an audiobook is AUDIO, some listeners will use it with an AUDIO player, no screen no keyboard. that's why the challenge is much complex than filter/parse a pdf.....

@luky92
Copy link
Author

luky92 commented Dec 28, 2024 via email

@ROBERT-MCDOWELL
Copy link
Collaborator

there is always a way, but not 100% bullet proof. the most complex would be to add another A.I. with a model that recognizes a boo text than legal and technical text but in +1124 languages.... forget it....

@luky92
Copy link
Author

luky92 commented Dec 29, 2024

Thinking about it maybe you could use a text LLM to help with that something small like one of the phi Microsoft models or something like that just give it a snippet of each page in the file and ask it if it's an attribution text or something that is a part of the main book now granted you would have to make sure you are giving it enough to assess it but it could be something

@ROBERT-MCDOWELL
Copy link
Collaborator

you will reach maybe 3 to 5% accuracy than text length. text length is still not a solution, I saw poems book with 2 sentences per page... and a legal printer stuff with more than 1000 chars.... you can search by keyword indeed, but with so many languages it's another project into the project.

@ROBERT-MCDOWELL
Copy link
Collaborator

ok I made test on your ebook with my last update and it seems to work now..
you will confirm when you will test on your side and close this issue then.

@ROBERT-MCDOWELL ROBERT-MCDOWELL added fixed in next update (pending) and removed bug Something isn't working labels Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants