Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

At high transcoding speed, high input-fidelity statements are not transcoded #2696

Open
resolutecake opened this issue Jan 2, 2025 · 5 comments

Comments

@resolutecake
Copy link

resolutecake commented Jan 2, 2025

Q1. When whisper.cpp transcodes a 5-hour audio file finding mostly noise, it enters a fast mode transcoding at 90×.
BUG: In this mode 1–3-minute conversations are missed, and leave no trace in the output

>>> How can I avoid whisper missing short conversations?

For example, should the context be periodically destroyed

Q2. When audio is 30h+ the file size exceeds 4 GiB which 32-bit .wav cannot handle producing empty files.

>>> How can I transcode large audio files or infinite streams?

ffmpeg has -rf64 option for RF64 format https://en.wikipedia.org/wiki/RF64
is there better input format than wav?
I would prefer feeding RAW samples, float or specific PCM

Q3. >>> Is there some other way of improving transcoding word-yield considering the below commands?

hardware is 8 GiB RAM 2021 Apple M processors macOS and 2 parallel whisper instances
A custom Go binding is being considered
It is batch execution, so slow transcode is not a problem

Creating the audio stream:

ffmpeg -hide_banner -i i.mp4 -nostdin -vn -ac 1 -ar 16000 -f wav -

whisper command:

./whisper.cpp/build/bin/main --model whisper.cpp/models/ggml-small.en.bin
--file - --output-lrc --output-file o.lrc
@resolutecake
Copy link
Author

resolutecake commented Jan 4, 2025

Anybody agrees with this below?

.

Action 1: use Go package
github.com/ggerganov/whisper.cpp/bindings/go/pkg/whisper

sample Go main function:
https://github.com/ggerganov/whisper.cpp/blob/ece3ff88f66cdc867d334ff4b8e8c00d9ffcebf7/bindings/go/examples/go-whisper/main.go

Go bindings provides access to method:
SetTokenThreshold
https://github.com/ggerganov/whisper.cpp/blob/ece3ff88f66c/bindings/go/params.go#L94

  • it is unclear if the Whisper’s voice activity detector can be at all controlled
  • possibly by lowering this value, more speech results and short conversations may not be missed
  • it is possible this won’t work, whisper proper has a plethora of questionable get-arounds for Whisper’s voice activity detection deficiency with no supported solution
  • older 2023 whisper produced repeat on encountering no-voice segments
  • newer 2024 whisper outputs “[BLANK_AUDIO]“
  • empirically, mostly-noise input audio never worked with whisper

Go allows to use float PCM directly:

  • feeding ffmpeg audio code -ac 1 -ar 16k -c:a pcm_f32le — PCM 32-bit floating point little-endian
  • that may solve the 32-bit header .wav problem
  • possibly ffmpeg can then output samples prior to reading all audio using raw output to stdout -f f32le -
  • raw ffmpeg files cannot be detected what they contain, how long they are or bitrate used

.

Second action is to move to turbo model of October 1 2024
openai/whisper#2363

@resolutecake
Copy link
Author

resolutecake commented Jan 5, 2025

just noticed there is an option implemented for whisper.cpp/build/bin/main:
--no-speech-thold
— default value: 0.6 — data type: 32-bit signed float

— if audio quality is better or the model better, the model decides a segment is not speech with a higher probability
— such high probability leads to no transcoding being output
— if not speech probability is low, nonsense text may be output
— if it’s silence, voice activity detection will output “[BLANK_AUDIO]“

the option was added on 241223 hash 153757a
my checkout was 241214 therefore not having it

will be trying this first against my audio…

@resolutecake
Copy link
Author

resolutecake commented Jan 5, 2025

--no-speech-thold Doesn’t work

IT’S GENERALLY BROKEN:

.

as of whisper.cpp 250104 hash ece3ff8:

.

Whisper upstream:
Apparently whisper upstream has broken code for
non-speech silence or noise for 60–100 seconds
causing
transcode to leave out subsequent conversations of 1–5 minutes
instead
outputting [BLANK_AUDIO] by vad
and that
voice activity detection cannot be disabled

  • never worked well

.

chunk-length tampering is discouraged:
reducing chunk length from 30 s to say 5 s
apparently leads to
increased duplicated lines or
bad word error rate

.

No serious solution:
People are using various fantastical detection-preprocessors producing

  • timestamped segments fed into whisper
  • because whisper only works reliably if there is actual speech in the audio
    as opposed to silence or noise

.

identifying noise as non-speech is difficult
and if whisper doesn’t feel like doing it, it’s even more difficult

.

sample complaint:
https://community.openai.com/t/whisper-leaves-out-chunks-of-speech-in-longer-transcript/715999

@resolutecake
Copy link
Author

resolutecake commented Jan 5, 2025

I tested different implementations of large-v3-turbo

on Apple: Mac mini (M1, Late 2020) 8GB 256GB GbEthernet macOS 15.2
with jfk sample, then
with a 4 h audio file, mostly noise

  • regular large-v3-turbo: 7.133 s
  • quantized large-v3-turbo: 3.202 s 55% faster
  • Core ML large-v3-turbo: 2.148 s 70% faster
  • Core ML M1 Max 1.849 s

noise handling is no better

transcoding where working is
significantly improved and
groups statements much better

the bad result is obtained much faster: 4 h is transcribed in 54 seconds at 263×
with small.en that was 90×, large-v3-turbo is 3 times faster with a 36% better word error rate performance

@resolutecake
Copy link
Author

resolutecake commented Jan 5, 2025

How to get to Core ML large-v3-turbo on macOS 15.2 250104 hash ece3ff8

— as brew
brew install anaconda
— as root
xcode-select --switch /Applications/Xcode.app/Contents/Developer
xcodebuild -license
— as user
export PATH=$(brew --prefix)/anaconda3/bin:$PATH
conda create -n py310-whisper python=3.10 -y
conda init zsh
— For changes to take effect, close and re-open your current shell.
zsh
export PATH=$(brew --prefix)/anaconda3/bin:$PATH
conda activate py310-whisper
pip install ane_transformers
pip install "numpy<2"
pip install torch==2.1.0
pip install openai-whisper
pip install coremltools
cd …whisper.cpp
./models/download-ggml-model.sh large-v3-turbo
./models/generate-coreml-model.sh large-v3-turbo
git pull
cmake -B build -DWHISPER_COREML=1
cmake --build build -j --config Release

— you’re in Core ML! The land of 263×
./build/bin/whisper-cli --model models/ggml-large-v3-turbo.bin --file samples/jfk.wav

whisper_init_state: loading Core ML model from '/opt/oth/whisper.cpp/models/ggml-large-v3-turbo-encoder.mlmodelc'
whisper_init_state: Core ML model loaded
system_info: … COREML = 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant