Certain epubs only read the first page #146

luky92 · 2024-12-28T14:27:30Z

for some epubs only the first page is read you can download a sample epub that does this here:
https://wolnelektury.pl/katalog/lektura/brzydkie-kaczatko/
language is polish
let me know if you need any help or debug logs im happy to provide these
( note this ebook is in public domain just in case thats important)

ROBERT-MCDOWELL · 2024-12-28T14:35:27Z

the link you provided is a sample of the book right? I see only one page in this sample.
your issue comes from the full epub?

luky92 · 2024-12-28T14:44:39Z

no it is a full "book" I assure its not only one page:
screenshhot of an epub reader view:

luky92 · 2024-12-28T14:47:25Z

just for convinience here is zip with the file
brzydkie-kaczatko.zip

ROBERT-MCDOWELL · 2024-12-28T15:08:29Z

when I edit the book with calibre, I can see only part1.xhtml as a part of the epub with some real text right?
other pages are fund1, fund2 etc... and fund1 contains
"Książka, którą czytasz, pochodzi z biblioteki fundacji Wolne Lektury. Naszą misją jest wspieranie dzieciaków w dostępie do lektur szkolnych oraz zachęcanie ich do czytania. Miło Cię poznać!"
is it this page it is conveted only?

luky92 · 2024-12-28T15:23:08Z

yes you are right part 1 is the whole book the resto of it is like title page and some founding stuff

ROBERT-MCDOWELL · 2024-12-28T15:31:38Z

here is the issue: in the ebook world there is absolutely no rules of how to name or separate the real book text from legal, annotations, technical text, becoming very difficult to filter the ebook pages to just keep the real text of the book and avoid the A.I speech to say all the text from technical stuff which is very ennoying and burden to hear.
for now eb2ab is selecting the pages with a repetitive name. However this ebook shows 1 valid page (part1.xhtml) but 5 pages named fundX.html which are not real text, causing a false positive.... I'm working to find another way to filter it.

luky92 · 2024-12-28T15:32:56Z

sure and I dont have a problem with having the whole book including this type of metdata problem is it SKIPS the actual story

luky92 · 2024-12-28T15:33:29Z

so for some reson part1 is totally skiped

ROBERT-MCDOWELL · 2024-12-28T15:48:22Z

the script cannot filter the text ifself but only the pages. as part1 is only one page, and fundX several pages so the script thinks that the real text is from fundX since a book usually have more than one page. That's the issue to solve

danieltomasz · 2024-12-28T16:31:25Z

@luky92 the workaround would be to use directly other format provided by the portal (pdf of txt for example)

luky92 · 2024-12-28T16:34:59Z

Yeah I already figured that out on the plus side this file was only for testing so no biggie and I assume normal full books will not have this problem

danieltomasz · 2024-12-28T16:37:27Z

the actual content has <p class="paragraph"></p>the fundrasing pages have either <div class="fundraising"> or <p class="info"></div>,` maybe some sort of heuristic for filtering is possible but this will work only for this particular portal

ROBERT-MCDOWELL · 2024-12-28T18:49:38Z

@danieltomasz ok but understand that the millions ebook have their own class, own pages, own xhtml mark etc...
there is absolutely no standards followed by any kind of epub consortium (unless epub3 that started a few),
it's a complete mess of 16 years of ebook.... you can imagine we cannot code a filter per ebook or website, it's just a madness for a long term software... the only way for now is to find a char length average of each incrementing xhtml page name as usually the real text has more char length per page, but it's not ideal since you can have poetry books with not a lot of text on each page and have a full page of technical text or preface or else.

luky92 · 2024-12-28T23:22:47Z

Now that I read up on it I'm telling you from experience with parsing pdfs before don't try to do any kind of fancy filtering or heuristics here the best you can do if you want is display pages to the user as a list with preview and let them choose which ones should be read otherwise you are going to have bugs about this sort of thing all the time

ROBERT-MCDOWELL · 2024-12-28T23:28:03Z

you forgot here it's TTS (text to speech)!, tell me how to let the user choose which text to listen, creating hundreds of index on the
audio file? with the text on each index? how a blind person will choose from it?

luky92 · 2024-12-28T23:32:05Z

That's the easy part at least for pdfs and from what I understand epubs as well you just display to the user the list of pages you can read to choose you can let them choose the order too and by default just read everything always niedz., 29 gru 2024, 00:28 użytkownik ROBERT MCDOWELL < ***@***.***> napisał:

…

you forgot here it's TTS (text to speech)!, tell me how to let the user choose which text to listen, creating hundreds of index on the audio file? with the text on each index? how a blind person will choose from it? — Reply to this email directly, view it on GitHub <#146 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADTM4X72B5BY6FTEIJJGIST2H4XZVAVCNFSM6AAAAABUJ3JXUGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRUGU2TMMBQGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

ROBERT-MCDOWELL · 2024-12-28T23:42:38Z

you are talking about something that has nothing to do with the issue we have here...
an audiobook is AUDIO, some listeners will use it with an AUDIO player, no screen no keyboard. that's why the challenge is much complex than filter/parse a pdf.....

luky92 · 2024-12-28T23:47:16Z

Oh different goals yep yep I assumed that we are talking about something that will just be an ebook2audiobook converter so that I can listen to them later in another player but if we are talking about staying in the same app I don't see a way to skip the legalese or some other parts of the book you could do it automatically with the length of the page but that's not foolproof at all niedz., 29 gru 2024, 00:43 użytkownik ROBERT MCDOWELL < ***@***.***> napisał:

…

you are talking about something that has nothing to do with the issue we have here... an audiobook is AUDIO, some listeners will use it with an AUDIO player, no screen no keyboard. that's why the challenge is much complex than filter/parse a pdf..... — Reply to this email directly, view it on GitHub <#146 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADTM4XY43PV3JMBVXPCEV3T2H4ZQJAVCNFSM6AAAAABUJ3JXUGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRUGU2TQNJVGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

ROBERT-MCDOWELL · 2024-12-29T00:00:01Z

there is always a way, but not 100% bullet proof. the most complex would be to add another A.I. with a model that recognizes a boo text than legal and technical text but in +1124 languages.... forget it....

luky92 · 2024-12-29T00:00:25Z

Thinking about it maybe you could use a text LLM to help with that something small like one of the phi Microsoft models or something like that just give it a snippet of each page in the file and ask it if it's an attribution text or something that is a part of the main book now granted you would have to make sure you are giving it enough to assess it but it could be something

ROBERT-MCDOWELL · 2024-12-29T00:43:43Z

you will reach maybe 3 to 5% accuracy than text length. text length is still not a solution, I saw poems book with 2 sentences per page... and a legal printer stuff with more than 1000 chars.... you can search by keyword indeed, but with so many languages it's another project into the project.

ROBERT-MCDOWELL · 2025-01-05T18:36:23Z

ok I made test on your ebook with my last update and it seems to work now..
you will confirm when you will test on your side and close this issue then.

ROBERT-MCDOWELL self-assigned this Dec 28, 2024

ROBERT-MCDOWELL added the bug Something isn't working label Dec 28, 2024

ROBERT-MCDOWELL added fixed in next update (pending) and removed bug Something isn't working labels Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Certain epubs only read the first page #146

Certain epubs only read the first page #146

luky92 commented Dec 28, 2024 •

edited

Loading

ROBERT-MCDOWELL commented Dec 28, 2024 •

edited

Loading

luky92 commented Dec 28, 2024

luky92 commented Dec 28, 2024

ROBERT-MCDOWELL commented Dec 28, 2024

luky92 commented Dec 28, 2024

ROBERT-MCDOWELL commented Dec 28, 2024

luky92 commented Dec 28, 2024

luky92 commented Dec 28, 2024

ROBERT-MCDOWELL commented Dec 28, 2024 •

edited

Loading

danieltomasz commented Dec 28, 2024

luky92 commented Dec 28, 2024

danieltomasz commented Dec 28, 2024 •

edited

Loading

ROBERT-MCDOWELL commented Dec 28, 2024 •

edited

Loading

luky92 commented Dec 28, 2024

ROBERT-MCDOWELL commented Dec 28, 2024

luky92 commented Dec 28, 2024 via email

ROBERT-MCDOWELL commented Dec 28, 2024

luky92 commented Dec 28, 2024 via email

ROBERT-MCDOWELL commented Dec 29, 2024

luky92 commented Dec 29, 2024

ROBERT-MCDOWELL commented Dec 29, 2024

ROBERT-MCDOWELL commented Jan 5, 2025

Certain epubs only read the first page #146

Certain epubs only read the first page #146

Comments

luky92 commented Dec 28, 2024 • edited Loading

ROBERT-MCDOWELL commented Dec 28, 2024 • edited Loading

luky92 commented Dec 28, 2024

luky92 commented Dec 28, 2024

ROBERT-MCDOWELL commented Dec 28, 2024

luky92 commented Dec 28, 2024

ROBERT-MCDOWELL commented Dec 28, 2024

luky92 commented Dec 28, 2024

luky92 commented Dec 28, 2024

ROBERT-MCDOWELL commented Dec 28, 2024 • edited Loading

danieltomasz commented Dec 28, 2024

luky92 commented Dec 28, 2024

danieltomasz commented Dec 28, 2024 • edited Loading

ROBERT-MCDOWELL commented Dec 28, 2024 • edited Loading

luky92 commented Dec 28, 2024

ROBERT-MCDOWELL commented Dec 28, 2024

luky92 commented Dec 28, 2024 via email

ROBERT-MCDOWELL commented Dec 28, 2024

luky92 commented Dec 28, 2024 via email

ROBERT-MCDOWELL commented Dec 29, 2024

luky92 commented Dec 29, 2024

ROBERT-MCDOWELL commented Dec 29, 2024

ROBERT-MCDOWELL commented Jan 5, 2025

luky92 commented Dec 28, 2024 •

edited

Loading

ROBERT-MCDOWELL commented Dec 28, 2024 •

edited

Loading

ROBERT-MCDOWELL commented Dec 28, 2024 •

edited

Loading

danieltomasz commented Dec 28, 2024 •

edited

Loading

ROBERT-MCDOWELL commented Dec 28, 2024 •

edited

Loading