At high transcoding speed, high input-fidelity statements are not transcoded #2696

resolutecake · 2025-01-02T12:20:06Z

Q1. When whisper.cpp transcodes a 5-hour audio file finding mostly noise, it enters a fast mode transcoding at 90×.
BUG: In this mode 1–3-minute conversations are missed, and leave no trace in the output

>>> How can I avoid whisper missing short conversations?

For example, should the context be periodically destroyed

Q2. When audio is 30h+ the file size exceeds 4 GiB which 32-bit .wav cannot handle producing empty files.

>>> How can I transcode large audio files or infinite streams?

ffmpeg has -rf64 option for RF64 format https://en.wikipedia.org/wiki/RF64
is there better input format than wav?
I would prefer feeding RAW samples, float or specific PCM

Q3. >>> Is there some other way of improving transcoding word-yield considering the below commands?

hardware is 8 GiB RAM 2021 Apple M processors macOS and 2 parallel whisper instances
A custom Go binding is being considered
It is batch execution, so slow transcode is not a problem

Creating the audio stream:

ffmpeg -hide_banner -i i.mp4 -nostdin -vn -ac 1 -ar 16000 -f wav -

whisper command:

./whisper.cpp/build/bin/main --model whisper.cpp/models/ggml-small.en.bin
--file - --output-lrc --output-file o.lrc

The text was updated successfully, but these errors were encountered:

resolutecake · 2025-01-04T16:53:46Z

Anybody agrees with this below?

.

Action 1: use Go package
github.com/ggerganov/whisper.cpp/bindings/go/pkg/whisper

sample Go main function:
https://github.com/ggerganov/whisper.cpp/blob/ece3ff88f66cdc867d334ff4b8e8c00d9ffcebf7/bindings/go/examples/go-whisper/main.go

Go bindings provides access to method:
SetTokenThreshold
https://github.com/ggerganov/whisper.cpp/blob/ece3ff88f66c/bindings/go/params.go#L94

it is unclear if the Whisper’s voice activity detector can be at all controlled
possibly by lowering this value, more speech results and short conversations may not be missed
it is possible this won’t work, whisper proper has a plethora of questionable get-arounds for Whisper’s voice activity detection deficiency with no supported solution
older 2023 whisper produced repeat on encountering no-voice segments
newer 2024 whisper outputs “[BLANK_AUDIO]“
empirically, mostly-noise input audio never worked with whisper

Go allows to use float PCM directly:

feeding ffmpeg audio code -ac 1 -ar 16k -c:a pcm_f32le — PCM 32-bit floating point little-endian
that may solve the 32-bit header .wav problem
possibly ffmpeg can then output samples prior to reading all audio using raw output to stdout -f f32le -
raw ffmpeg files cannot be detected what they contain, how long they are or bitrate used

.

Second action is to move to turbo model of October 1 2024
openai/whisper#2363

word error rate 18 for turbo over 28 for small en
transcode speed measure 29× over 13×
VRAM 6 GiB from 2 GiB https://www.reddit.com/r/LocalLLaMA/comments/1ft3wyb/comment/lpp1j5s/
turbo may not be runnable on 8 GiB macOS 2021 M1

resolutecake · 2025-01-05T03:19:58Z

just noticed there is an option implemented for whisper.cpp/build/bin/main:
--no-speech-thold
— default value: 0.6 — data type: 32-bit signed float

— if audio quality is better or the model better, the model decides a segment is not speech with a higher probability
— such high probability leads to no transcoding being output
— if not speech probability is low, nonsense text may be output
— if it’s silence, voice activity detection will output “[BLANK_AUDIO]“

the option was added on 241223 hash 153757a
my checkout was 241214 therefore not having it

will be trying this first against my audio…

resolutecake · 2025-01-05T07:54:46Z

--no-speech-thold Doesn’t work

IT’S GENERALLY BROKEN:

.

as of whisper.cpp 250104 hash ece3ff8:

Go bindings make was removed on migrating to cmake 241214
Is the Go binding still being maintained? #2689
Go library libwhisper.a no longer has a make target
Go Discussion has notes that does not work
Go #312

.

Whisper upstream:
Apparently whisper upstream has broken code for
non-speech silence or noise for 60–100 seconds
causing
transcode to leave out subsequent conversations of 1–5 minutes
instead
outputting [BLANK_AUDIO] by vad
and that
voice activity detection cannot be disabled

never worked well

.

chunk-length tampering is discouraged:
reducing chunk length from 30 s to say 5 s
apparently leads to
increased duplicated lines or
bad word error rate

.

No serious solution:
People are using various fantastical detection-preprocessors producing

timestamped segments fed into whisper
because whisper only works reliably if there is actual speech in the audio
as opposed to silence or noise

.

identifying noise as non-speech is difficult
and if whisper doesn’t feel like doing it, it’s even more difficult

.

sample complaint:
https://community.openai.com/t/whisper-leaves-out-chunks-of-speech-in-longer-transcript/715999

resolutecake · 2025-01-05T18:28:45Z

I tested different implementations of large-v3-turbo

on Apple: Mac mini (M1, Late 2020) 8GB 256GB GbEthernet macOS 15.2
with jfk sample, then
with a 4 h audio file, mostly noise

regular large-v3-turbo: 7.133 s
quantized large-v3-turbo: 3.202 s 55% faster
Core ML large-v3-turbo: 2.148 s 70% faster
Core ML M1 Max 1.849 s

noise handling is no better

transcoding where working is
significantly improved and
groups statements much better

the bad result is obtained much faster: 4 h is transcribed in 54 seconds at 263×
with small.en that was 90×, large-v3-turbo is 3 times faster with a 36% better word error rate performance

resolutecake · 2025-01-05T23:29:57Z

How to get to Core ML large-v3-turbo on macOS 15.2 250104 hash ece3ff8

— as brew
brew install anaconda
— as root
xcode-select --switch /Applications/Xcode.app/Contents/Developer
xcodebuild -license
— as user
export PATH=$(brew --prefix)/anaconda3/bin:$PATH
conda create -n py310-whisper python=3.10 -y
conda init zsh
— For changes to take effect, close and re-open your current shell.
zsh
export PATH=$(brew --prefix)/anaconda3/bin:$PATH
conda activate py310-whisper
pip install ane_transformers
pip install "numpy<2"
pip install torch==2.1.0
pip install openai-whisper
pip install coremltools
cd …whisper.cpp
./models/download-ggml-model.sh large-v3-turbo
./models/generate-coreml-model.sh large-v3-turbo
git pull
cmake -B build -DWHISPER_COREML=1
cmake --build build -j --config Release

— you’re in Core ML! The land of 263×
./build/bin/whisper-cli --model models/ggml-large-v3-turbo.bin --file samples/jfk.wav

whisper_init_state: loading Core ML model from '/opt/oth/whisper.cpp/models/ggml-large-v3-turbo-encoder.mlmodelc'
whisper_init_state: Core ML model loaded
system_info: … COREML = 1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

At high transcoding speed, high input-fidelity statements are not transcoded #2696

At high transcoding speed, high input-fidelity statements are not transcoded #2696

resolutecake commented Jan 2, 2025 •

edited

Loading

resolutecake commented Jan 4, 2025 •

edited

Loading

resolutecake commented Jan 5, 2025 •

edited

Loading

resolutecake commented Jan 5, 2025 •

edited

Loading

resolutecake commented Jan 5, 2025 •

edited

Loading

resolutecake commented Jan 5, 2025 •

edited

Loading

At high transcoding speed, high input-fidelity statements are not transcoded #2696

At high transcoding speed, high input-fidelity statements are not transcoded #2696

Comments

resolutecake commented Jan 2, 2025 • edited Loading

resolutecake commented Jan 4, 2025 • edited Loading

resolutecake commented Jan 5, 2025 • edited Loading

resolutecake commented Jan 5, 2025 • edited Loading

resolutecake commented Jan 5, 2025 • edited Loading

resolutecake commented Jan 5, 2025 • edited Loading

resolutecake commented Jan 2, 2025 •

edited

Loading

resolutecake commented Jan 4, 2025 •

edited

Loading

resolutecake commented Jan 5, 2025 •

edited

Loading

resolutecake commented Jan 5, 2025 •

edited

Loading

resolutecake commented Jan 5, 2025 •

edited

Loading

resolutecake commented Jan 5, 2025 •

edited

Loading