Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XTTS can only generate text with a maximum of 400 tokens. #140

Open
Code4SAFrankie opened this issue Dec 28, 2024 · 36 comments
Open

XTTS can only generate text with a maximum of 400 tokens. #140

Code4SAFrankie opened this issue Dec 28, 2024 · 36 comments

Comments

@Code4SAFrankie
Copy link

Processing 9.85%: : 798/8093 Traceback (most recent call last):
File "G:\ebook2audiobook-main\lib\functions.py", line 625, in convert_sentence_to_audio
output = params['tts'].inference(
^^^^^^^^^^^^^^^^^^^^^^^^
File "G:\ebook2audiobook-main\python_env\Lib\site-packages\torch\utils_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "G:\ebook2audiobook-main\python_env\Lib\site-packages\TTS\tts\models\xtts.py", line 528, in inference
text_tokens.shape[-1] < self.args.gpt_max_text_tokens
AssertionError: ❗ XTTS can only generate text with a maximum of 400 tokens.
Caught DependencyError: ❗ XTTS can only generate text with a maximum of 400 tokens.
Processing 9.85%: : 798/8093
Traceback (most recent call last):
File "G:\ebook2audiobook-main\lib\functions.py", line 625, in convert_sentence_to_audio
output = params['tts'].inference(
^^^^^^^^^^^^^^^^^^^^^^^^
File "G:\ebook2audiobook-main\python_env\Lib\site-packages\torch\utils_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "G:\ebook2audiobook-main\python_env\Lib\site-packages\TTS\tts\models\xtts.py", line 528, in inference
text_tokens.shape[-1] < self.args.gpt_max_text_tokens
AssertionError: ❗ XTTS can only generate text with a maximum of 400 tokens.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "G:\ebook2audiobook-main\lib\functions.py", line 584, in convert_chapters_to_audio
if convert_sentence_to_audio(params, session):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "G:\ebook2audiobook-main\lib\functions.py", line 657, in convert_sentence_to_audio
raise DependencyError(e)
lib.functions.DependencyError: ❗ XTTS can only generate text with a maximum of 400 tokens.
Caught DependencyError: ❗ XTTS can only generate text with a maximum of 400 tokens.
convert_ebook() Exception: ❗ XTTS can only generate text with a maximum of 400 tokens.

@ROBERT-MCDOWELL
Copy link
Collaborator

ROBERT-MCDOWELL commented Dec 28, 2024

there is a bug on how sentences are splitted for now. next git update will solve your issue.
which language are you using?

@Code4SAFrankie
Copy link
Author

English

@ROBERT-MCDOWELL
Copy link
Collaborator

could yo provide which part of the text generates the error? I want to be sure my patch is fixing it.

@Code4SAFrankie
Copy link
Author

How would I know that?

@DrewThomasson
Copy link
Owner

Just send a full log of the terminal as a txt file dude...

@Code4SAFrankie
Copy link
Author

tts.log
This is the log

@Code4SAFrankie
Copy link
Author

tts
This is an image of the page where it crashed.

@ROBERT-MCDOWELL
Copy link
Collaborator

ROBERT-MCDOWELL commented Dec 31, 2024

well, do you really think the A.I. speaker is going to say the example code, and a technical table without issue?
the day an A.I. will voice an entire math book without glitches is not for today I tell you.

@majormer
Copy link

Bleh. Hit me, too. Seems to hit even non-technical books. It just happened to have a really crappily formatted number as a gag...

Processing 57.16%: : 23464/41046 Sentence: Why? Why…well, the rules said so;
Like Kevin;
Like Tom,
Processing 57.17%: : 23465/41046 Sentence: the darling [Clown];
Like…
Multiply them? By what? It was just…why did the rules look different here? As if they had been written differently? Just a little word;
Processing 57.17%: : 23466/41046 Sentence: What…what did you multiply them by?
Every time you tried to figure it out,
Processing 57.17%: : 23467/41046 Sentence: it slipped away;
Which was why self-analysis never caught it;
S-strange;
Was this wrong? Why was this rule here?
It was so deep down it would be bad…to change it;
But why? Why—
Processing 57.17%: : 23468/41046 Sentence: did it look like something had been changed?
Focus;
What did you multiply Erin’s achievements by? Oh,
Processing 57.17%: : 23469/41046 Sentence: it was simple if you looked;
Though it was so long and precise; no wonder it was always skipped;
You had to round it because it was like counting…multiply Erin’s deeds by…
3;
Processing 57.17%: : 23469/41046 Traceback (most recent call last):
File "W:\Personal\Repos\ebook2audiobook\lib\functions.py", line 625, in convert_sentence_to_audio
output = params['tts'].inference(
^^^^^^^^^^^^^^^^^^^^^^^^
File "W:\Personal\Repos\ebook2audiobook\python_env\Lib\site-packages\torch\utils_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "W:\Personal\Repos\ebook2audiobook\python_env\Lib\site-packages\TTS\tts\models\xtts.py", line 528, in inference
text_tokens.shape[-1] < self.args.gpt_max_text_tokens
AssertionError: ❗ XTTS can only generate text with a maximum of 400 tokens.
Caught DependencyError: ❗ XTTS can only generate text with a maximum of 400 tokens.
Processing 57.17%: : 23469/41046
Traceback (most recent call last):
File "W:\Personal\Repos\ebook2audiobook\lib\functions.py", line 625, in convert_sentence_to_audio
output = params['tts'].inference(
^^^^^^^^^^^^^^^^^^^^^^^^
File "W:\Personal\Repos\ebook2audiobook\python_env\Lib\site-packages\torch\utils_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "W:\Personal\Repos\ebook2audiobook\python_env\Lib\site-packages\TTS\tts\models\xtts.py", line 528, in inference
text_tokens.shape[-1] < self.args.gpt_max_text_tokens
AssertionError: ❗ XTTS can only generate text with a maximum of 400 tokens.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "W:\Personal\Repos\ebook2audiobook\lib\functions.py", line 584, in convert_chapters_to_audio
if convert_sentence_to_audio(params, session):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "W:\Personal\Repos\ebook2audiobook\lib\functions.py", line 657, in convert_sentence_to_audio
raise DependencyError(e)
lib.functions.DependencyError: ❗ XTTS can only generate text with a maximum of 400 tokens.
Caught DependencyError: ❗ XTTS can only generate text with a maximum of 400 tokens.
convert_ebook() Exception: ❗ XTTS can only generate text with a maximum of 400 tokens.

=====================================================================

It was so deep down it would be bad…to change it. But why? Why—did it look like something had been changed?

Focus. What did you multiply Erin’s achievements by? Oh, it was simple if you looked. Though it was so long and precise; no wonder it was always skipped. You had to round it because it was like counting…multiply Erin’s deeds by…

3.1415926535 8979323846 2643383279 5028841971 6939937510 5820974944 5923078164 0628620899 8628034825 3421170679 8214808651 3282306647 0938446095 5058223172 5359408128 4811174502 8410270193 8521105559 6446229489 5493038196 4428810975 6659334461 2847564823 3786783165 2712019091 4564856692 3460348610 4543266482 1339360726 0249141273 7245870066 0631558817 4881520920 9628292540 9171536436 7892590360 0113305305 4882046652 1384146951 9415116094 3305727036 5759591953 0921861173 8193261179 3105118548 0744623799 6274956735 1885752724 8912279381 8301194912 9833673362 4406566430 8602139494 6395224737 1907021798 6094370277 0539217176 2931767523 8467481846 7669405132 0005681271 4526356082 7785771342 7577896091 7363717872 1468440901 2249534301 4654958537 1050792279 6892589235 4201995611 2129021960 8640344181 5981362977 4771309960 5187072113 4999999837 2978049951 0597317328 1609631859 5024459455 3469083026 4252230825 3344685035 2619311881 7101000313 7838752886 5875332083 8142061717 7669147303 5982534904 2875546873 1159562863 8823537875 9375195778 1857780532 1712268066 1300192787 6611195909…

It was counting infinity each time. But just a symbol. What was it? Oh yes. Multiply by π. What the—

<Red|Who wrote that?|Red>

@majormer
Copy link

For what it is worth, I am using an RTX 4090 with 24GB of VRAM. Would it be possible to modify the value so I can do bigger than 400 tokens, or is it a function of how the voice stuff is generated?

@majormer
Copy link

And I did just end up editing the ebook to modify that line to truncate the number, since it didn't add anything to the story... but wanted to provide the context of my issue. I wish there was a resume...

@ROBERT-MCDOWELL
Copy link
Collaborator

it's a TTS limitation. all A.I. today cannot or don't want more than a certain number of token for accuracy. maybe in 5 years we will swallow to it the entire book once... but it's not the case today.
this is normally fixed next update, I don't know when as I need to fix other things.

@majormer
Copy link

An interesting consideration is to add the ability to optionally scan the file before committing to the conversion. Check tokens per the settings selected and validate that everything is compliant.

@majormer
Copy link

I think text splitting might be the only factor I can think of for my above suggestion.

@ROBERT-MCDOWELL
Copy link
Collaborator

it's not as easy you think to split text when you have 1124 languages to manage....

@Digital-Yeti
Copy link

Digital-Yeti commented Jan 2, 2025

I also cannot convert my book because of this splitting error.

53;
 Other ways in which machinery affects the production of raw material will be mentioned in Volume 03;
*
*As it turned out, Volume 03 of Capital, when published, 
                               Sentence: contained nothing on this subject, 
Processing 58.34%: : 2769/4745 Sentence: although Chapters 40–44 (on the second form of differential rent) did deal with the related topic of the impact of extra amounts of capital directly invested in land;

54; 
                               Sentence: Export of cotton from India to Great Britain: 34,540,143 lb;
 in 1846; 204,141,168 lb;/4745 
 in 1860; 445,947,600 lb;
 in 1865;
 Export of wool from India to Great Britain: 4,570, 
                               Traceback (most recent call last):
  File "/home/user/app/lib/functions.py", line 583, in convert_sentence_to_audio
    output = params['tts'].inference(
  File "/usr/local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/TTS/tts/models/xtts.py", line 528, in inference
    text_tokens.shape[-1] < self.args.gpt_max_text_tokens
AssertionError:  ❗ XTTS can only generate text with a maximum of 400 tokens.
Caught DependencyError:  ❗ XTTS can only generate text with a maximum of 400 tokens.

@majormer
Copy link

majormer commented Jan 2, 2025

it's not as easy you think to split text when you have 1124 languages to manage....

I think I understand. However, you have a current process for doing that, correct? My thought is that you do a dry-run of the current text split without the actual audio conversion just to see where the text-splits land, then check the tokens for them to make sure they are compliant. I'm not sure that it would require a rewrite of the text-splitting system. Just a thought, as it might save days of time and energy for users... especially those without GPUs or slower GPUs.

I guess this also depends on if you are using GenAI to actually come up with the text splits vs. an algorithm done purely on CPU. I suppose I could look at the code myself and see, but don't know that I have the time to come up with a solution and submit a pull request.

@ROBERT-MCDOWELL
Copy link
Collaborator

ROBERT-MCDOWELL commented Jan 2, 2025

text splitting should be fixed in the next update.
"just to see where the text-splits land, then check the tokens for them to make sure they are compliant."
who will make sure? a script? the user? as I said, each language has it's own exceptions, punctuation etc... then cut all languages like English will be a total mess of how the A.I. is reacting.

@ROBERT-MCDOWELL
Copy link
Collaborator

@majormer could you please provide your original text to make my last test for the fix of this issue?

@majormer
Copy link

majormer commented Jan 8, 2025

@majormer could you please provide your original text to make my last test for the fix of this issue?

It was posted above, but this is the original text that caused the crash:

=================================

It was so deep down it would be bad…to change it. But why? Why—did it look like something had been changed?

Focus. What did you multiply Erin’s achievements by? Oh, it was simple if you looked. Though it was so long and precise; no wonder it was always skipped. You had to round it because it was like counting…multiply Erin’s deeds by…

3.1415926535 8979323846 2643383279 5028841971 6939937510 5820974944 5923078164 0628620899 8628034825 3421170679 8214808651 3282306647 0938446095 5058223172 5359408128 4811174502 8410270193 8521105559 6446229489 5493038196 4428810975 6659334461 2847564823 3786783165 2712019091 4564856692 3460348610 4543266482 1339360726 0249141273 7245870066 0631558817 4881520920 9628292540 9171536436 7892590360 0113305305 4882046652 1384146951 9415116094 3305727036 5759591953 0921861173 8193261179 3105118548 0744623799 6274956735 1885752724 8912279381 8301194912 9833673362 4406566430 8602139494 6395224737 1907021798 6094370277 0539217176 2931767523 8467481846 7669405132 0005681271 4526356082 7785771342 7577896091 7363717872 1468440901 2249534301 4654958537 1050792279 6892589235 4201995611 2129021960 8640344181 5981362977 4771309960 5187072113 4999999837 2978049951 0597317328 1609631859 5024459455 3469083026 4252230825 3344685035 2619311881 7101000313 7838752886 5875332083 8142061717 7669147303 5982534904 2875546873 1159562863 8823537875 9375195778 1857780532 1712268066 1300192787 6611195909…

It was counting infinity each time. But just a symbol. What was it? Oh yes. Multiply by π. What the—

<Red|Who wrote that?|Red>

===================================

@ROBERT-MCDOWELL
Copy link
Collaborator

btw is special mathematic chars like "π" were ok or not?

@majormer
Copy link

majormer commented Jan 8, 2025 via email

@ROBERT-MCDOWELL
Copy link
Collaborator

I mean you don't know if the voice pronounced it well as it crashed......

@majormer
Copy link

majormer commented Jan 8, 2025 via email

@ROBERT-MCDOWELL
Copy link
Collaborator

ok so do you confirm the voice is pronouncing well special chars like Pi ?

@majormer
Copy link

majormer commented Jan 8, 2025

Well, I just went back to verify, and there is a gap. I only spot-checked the generated file.

This is the original text:

image

And it actually SAID:

Three fourteen fifteen (fade-out)
GAP
getting mad about the unfairness...

Although in finding this sample, I found other gaps in the speech as well. When I processed the e-book, I unchecked the "Enable text splitting" option, so not sure if that lead to gaps or not. I'm now going through the sentences that are still in the temp directory to see if it rendered, but it might take a bit as there are 41038 un-labeled wav files to pick at, and I haven't read the book to know the general location of a given sentence. I'll follow up on that soon, to see if it processed the voice at the sentence level.

@majormer
Copy link

majormer commented Jan 8, 2025

I guess I'm saying that it didn't crash, but it also didn't bother with even the entire shortened number.

@ROBERT-MCDOWELL
Copy link
Collaborator

ok despite of your issue which is already solved on the next update.
I'm talking about special mathematics characters like "π" . Is the voice pronounce it or not?

@majormer
Copy link

majormer commented Jan 8, 2025

Yeah. That is what I'm trying to figure out. In the chapter audio file, the entire sentence is skipped. I'm looking for the actual sentence files which I still have in my temp directory and trying to figure out if it was pronounced there, or if that was just skipped entirely in the processing. Unfortunately, I am trying to figure out which of thousands of sentence files it might be in, and I am working on that now.

@ROBERT-MCDOWELL
Copy link
Collaborator

create a simple text with a sentence with "π" in it. and tell me if "π" is said or not... I need test from others than only my tests.

@majormer
Copy link

majormer commented Jan 8, 2025

Well, I did find the .wav file, or rather not:

image

There should be a wav file with "π" in it, but it is missing. All the text between the equal signs (markdown didn't like the bolding I tried to do) was skipped between those two files:

3.1415=========926535…

It was counting infinity each time. But just a symbol. What was it? Oh yes. Multiply by π. What the—

<Red|Who wrote that?|Red>

 

——

 

If you were—========etting mad about the unfairness of the world, of iniquitous sides,

I'll work on getting a file created with pi in it and follow up. Gimme a few.

@majormer
Copy link

majormer commented Jan 8, 2025

It does not say the pi character. I used the sentence "Wideacre Hall faces due π and the π shines all day on the yellow stone until it is warm and powdery to the touch"

It skipped the first π entirely, and the second one almost sounds like it was pronounced like you would say the long U sound.

I would attach the .wav file here, but it won't let me.

@majormer
Copy link

majormer commented Jan 8, 2025

3.zip

@majormer
Copy link

majormer commented Jan 8, 2025

There. I attached the generated .wav file with the sentence above.

@majormer
Copy link

majormer commented Jan 8, 2025

I have not pulled a newer version, however, so if I need to do that to validate testing, or run it against a different branch, please let me know. I will keep it handy.

@ROBERT-MCDOWELL
Copy link
Collaborator

ok so it confirms that mathematics signs are not spoken.... we must find a way for this issue, but for +1100 languages it's not possible. the main languages will have a fix.

@ROBERT-MCDOWELL ROBERT-MCDOWELL removed the duplicate This issue or pull request already exists label Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants