-
-
Notifications
You must be signed in to change notification settings - Fork 508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Certain epubs only read the first page #146
Comments
the link you provided is a sample of the book right? I see only one page in this sample. |
just for convinience here is zip with the file |
when I edit the book with calibre, I can see only part1.xhtml as a part of the epub with some real text right? |
yes you are right part 1 is the whole book the resto of it is like title page and some founding stuff |
here is the issue: in the ebook world there is absolutely no rules of how to name or separate the real book text from legal, annotations, technical text, becoming very difficult to filter the ebook pages to just keep the real text of the book and avoid the A.I speech to say all the text from technical stuff which is very ennoying and burden to hear. |
sure and I dont have a problem with having the whole book including this type of metdata problem is it SKIPS the actual story |
so for some reson part1 is totally skiped |
the script cannot filter the text ifself but only the pages. as part1 is only one page, and fundX several pages so the script thinks that the real text is from fundX since a book usually have more than one page. That's the issue to solve |
@luky92 the workaround would be to use directly other format provided by the portal (pdf of txt for example) |
Yeah I already figured that out on the plus side this file was only for testing so no biggie and I assume normal full books will not have this problem |
the actual content has |
@danieltomasz ok but understand that the millions ebook have their own class, own pages, own xhtml mark etc... |
Now that I read up on it I'm telling you from experience with parsing pdfs before don't try to do any kind of fancy filtering or heuristics here the best you can do if you want is display pages to the user as a list with preview and let them choose which ones should be read otherwise you are going to have bugs about this sort of thing all the time |
you forgot here it's TTS (text to speech)!, tell me how to let the user choose which text to listen, creating hundreds of index on the |
That's the easy part at least for pdfs and from what I understand epubs as
well you just display to the user the list of pages you can read to choose
you can let them choose the order too and by default just read everything
always
niedz., 29 gru 2024, 00:28 użytkownik ROBERT MCDOWELL <
***@***.***> napisał:
… you forgot here it's TTS (text to speech)!, tell me how to let the user
choose which text to listen, creating hundreds of index on the
audio file? with the text on each index? how a blind person will choose
from it?
—
Reply to this email directly, view it on GitHub
<#146 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADTM4X72B5BY6FTEIJJGIST2H4XZVAVCNFSM6AAAAABUJ3JXUGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRUGU2TMMBQGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
you are talking about something that has nothing to do with the issue we have here... |
Oh different goals yep yep I assumed that we are talking about something
that will just be an ebook2audiobook converter so that I can listen to them
later in another player but if we are talking about staying in the same app
I don't see a way to skip the legalese or some other parts of the book you
could do it automatically with the length of the page but that's not
foolproof at all
niedz., 29 gru 2024, 00:43 użytkownik ROBERT MCDOWELL <
***@***.***> napisał:
… you are talking about something that has nothing to do with the issue we
have here...
an audiobook is AUDIO, some listeners will use it with an AUDIO player, no
screen no keyboard. that's why the challenge is much complex than
filter/parse a pdf.....
—
Reply to this email directly, view it on GitHub
<#146 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADTM4XY43PV3JMBVXPCEV3T2H4ZQJAVCNFSM6AAAAABUJ3JXUGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRUGU2TQNJVGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
there is always a way, but not 100% bullet proof. the most complex would be to add another A.I. with a model that recognizes a boo text than legal and technical text but in +1124 languages.... forget it.... |
Thinking about it maybe you could use a text LLM to help with that something small like one of the phi Microsoft models or something like that just give it a snippet of each page in the file and ask it if it's an attribution text or something that is a part of the main book now granted you would have to make sure you are giving it enough to assess it but it could be something |
you will reach maybe 3 to 5% accuracy than text length. text length is still not a solution, I saw poems book with 2 sentences per page... and a legal printer stuff with more than 1000 chars.... you can search by keyword indeed, but with so many languages it's another project into the project. |
ok I made test on your ebook with my last update and it seems to work now.. |
for some epubs only the first page is read you can download a sample epub that does this here:
https://wolnelektury.pl/katalog/lektura/brzydkie-kaczatko/
language is polish
let me know if you need any help or debug logs im happy to provide these
( note this ebook is in public domain just in case thats important)
The text was updated successfully, but these errors were encountered: