-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Regression in Whisper models when timestamp generation is enabled #1783
Comments
FYI, when using "without_timestamps=True" on faster-whisper 1.0.3 I get a faster speed but with a lot of skipped sentences. |
I tested this completely independent from faster whisper to make sure it's purely related to CT2, in FW v1.0.3 it uses a batch size of 1 so you shouldn't notice slowdowns related to this options as they start to appear at larger batch sizes, so the missing sentences are probably caused by something else |
What's the verdict on this? Can you provide a WAV file and outputs which manifest this issue? |
This is the code to reproduce the problem with this file. from faster_whisper import WhisperModel, decode_audio
from faster_whisper.vad import VadOptions, get_speech_timestamps, collect_chunks, merge_segments
from faster_whisper.transcribe import pad_or_trim, get_ctranslate2_storage
import torch
audio = decode_audio("tests/data/physicsworks.wav")
vad_options = VadOptions(min_silence_duration_ms=160, max_speech_duration_s=30)
vad_chunks = get_speech_timestamps(audio, vad_options=vad_options)
clip_timestamps = merge_segments(vad_chunks, vad_options)
audio_chunks, chunk_metadata = collect_chunks(audio, clip_timestamps)
model = WhisperModel("large-v2")
features = torch.stack(
[
pad_or_trim(
model.feature_extractor(chunk)[
...,
: chunk.shape[0] // model.feature_extractor.hop_length,
]
)
for chunk in audio_chunks
]
)
prompt_text = "<|startoftranscript|><|en|><|transcribe|>"
no_ts_token_text = "<|notimestamps|>"
prompt_tokens = model.hf_tokenizer.encode(prompt_text,add_special_tokens=False).ids
no_ts_token = model.hf_tokenizer.encode(no_ts_token_text,add_special_tokens=False).ids
eot_token = model.hf_tokenizer.encode('<|endoftext|>', add_special_tokens=False).ids[0]
encoder_output = model.encode(features)
generation_results_with_ts = model.model.generate(encoder_output, [prompt_tokens] * len(features))
generation_results_without_ts = model.model.generate(encoder_output, [prompt_tokens + no_ts_token] * len(features))
for result in generation_results_with_ts:
tokens = [token for token in result.sequences_ids[0] if token < eot_token] # remove timestamp tokens
print(model.hf_tokenizer.decode(tokens))
# Now I want to return to the conservation of mechanical energy. I have here a pendulum. I have an object that weighs 15 kilograms
# that will be converted to kinetic energy. If I would let it swing from one meter height and you would be there and it would hit you, you'd be dead. 150 joules is enough to kill you.
# You let it go, you swing it, thereby converting gravitational potential energy into kinetic energy and that way you can demolish a building. You just let it hit and it breaks a building
# I am such a strong believer of the conservation of mechanical energy that I am willing to put my life on the line. If I release that bulb from a certain height
# and it swings, then when it reaches here it could not be higher. There is a conversion from gravitational potential energy to kinetic energy back to gravitational potential energy
# For 100%, I may not trust myself. I'm going to release this object and I hope I will be able to do it at zero speed so that when it comes back, it may touch my chin
# I will close my eyes. I don't want to see this. So please be very quiet. I almost didn't sleep all night. Three, two, one, zero.
for result in generation_results_without_ts:
print(model.hf_tokenizer.decode(result.sequences_ids[0]))
# Now I want to return to the conservation of mechanical energy. I have here a pendulum. I have an object that weighs 15 kilograms and I can lift it up one meter, which I have done now. That means I've done work. Mgh is the work I have done, believe me. I've increased the potential energy of this object. 15 times 10 is about 150 joules.
# that will be converted to kinetic energy. If I would let it swing from one meter height and you would be there and it would hit you, you'd be dead. 150 joules is enough to kill you. They use these devices called a wrecker ball. They use them to demolish buildings. You lift up a very heavy object, even heavier than this
# You let it go, you swing it, thereby converting gravitational potential energy into kinetic energy and that way you can demolish a building. You just let it hit... and it breaks a building. And that's the whole idea of wrecking. So you're using, then, the conversion of gravitational potential energy to kinetic energy.
# I am such a strong believer of the conservation of mechanical energy that I am willing to put my life on the line. If I release that bulb from a certain height then that bulb can never come back to a point where the height is any larger.
# and it swings, then when it reaches here, it could not be higher. There is a conversion from gravitational potential energy to kinetic energy, back to gravitational potential energy, and it will come to a stop here. And when it swings back, it should not be able to reach any higher, provided that I do not give this object an initial speed when I stand here.
# For 100%, I may not trust myself. I'm going to release this object and I hope I will be able to do it at zero speed, so that when it comes back it may touch my chin, but it may not crush my chin. I want you to be extremely quiet, because this is no joke. If I don't succeed in giving it zero speed, then this will be my last lecture.
# I will close my eyes. I don't want to see this. So please be very quiet. I almost didn't sleep all night. Three, two, one, zero. Physics works, and I'm still alive.
import timeit
time_ts = timeit.timeit(
"model.model.generate(encoder_output, [prompt_tokens] * len(features))",
globals=globals(),
number=10,
)
num_tokens_ts = sum(
[len(result.sequences_ids[0]) for result in generation_results_with_ts]
)
time_no_ts = timeit.timeit(
"model.model.generate(encoder_output, [prompt_tokens + no_ts_token] * len(features))",
globals=globals(),
number=10,
)
num_tokens_no_ts = sum(
[len(result.sequences_ids[0]) for result in generation_results_without_ts]
)
print(f"Speed with timestamps: {(num_tokens_ts/time_ts)*10:.2f} tokens/s")
# Speed with timestamps: 196.96 tokens/s
print(f"Speed without timestamps: {(num_tokens_no_ts/time_no_ts)*10:.2f} tokens/s")
# Speed without timestamps: 302.72 tokens/s Vad is used to segment the audio into 30s segments, for accurate reproduction, we cannot use the sequential algorithm of whisper because the chunking relies directly on the generated timestamps, so if we disabled them, the encoded segments will be different |
thanks. What do you mean by sequential algorithm of Whisper? |
The sequential longform transcription algorithm, it uses the last generated timestamp token to shift the window, if no timestamps were generated, it adds 30s to the current window |
Hello
Several reports mention that WER improves greatly when adding
<|notimestamps|>
to the initial prompt in whisper decoding aka disabling timestamps generation, I tested this using This and This. You can check mobiusml/faster-whisper#18 (comment) for an example of decoding difference using the same encoder outputThere are several other reports including but not limited to:
SYSTRAN/faster-whisper#1010
SYSTRAN/faster-whisper#985
Also generation with timestamps has a lower toks/s and the slowdown increases when increasing the batch size
on the side, we have several PRs waiting for @trungkienbkhn review but he seems to be out of office, it'd be great if one of his colleagues has any information when he might return
The text was updated successfully, but these errors were encountered: