-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accept pipe to stream #2694
Open
tamo
wants to merge
11
commits into
ggerganov:master
Choose a base branch
from
tamo:pipe-stream
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Accept pipe to stream #2694
+338
−128
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Without it, `stream --save-audio` produces somehow choppy wav: `stream` calculates t_diff in milliseconds and combine audio pieces which are about step_ms long. WHISPER_SAMPLE_RATE / 1000 == only 16 but surprisingly human ears seem to be able to hear the gap as a noise.
Use one deque instead of two vectors (old and new). Old and new are length variables now. Basically: Get `step - new` samples every time. Then substitute `new = (around) step;` The new audio data is simply appended to the deque. (Limit the deque size to 30 seconds.) Pass `old + new` samples to whisper inference. If the data has been consumed, let `old = 0; new = 0;` If some of the data should be kept for the next iter, `old = keep;` If you want to get only N samples next time, `new = step - N;` In VAD mode: `stream --interim --step -3000` will Get 3000ms of audio. Run `vad_simple(step_ms)`. If nothing is detected, get 100ms more audio and retry. If nothing is detected and 3000ms has been passed, go into the interim mode, where `n_segments - 1` segments will be confirmed. (`old -= confirmed_t1`) If `n_segments == 1`, only show the first half of the result. Misc: Increase the default `max_tokens` because 32 is too small for 10 seconds. (Some Japanese speech was garbled.) Write wav as soon as the data is available. `no_timestamps` is the default even for VAD because it is more useful to show to the hard-of-hearing
Now it is easy to test with raw PCM data. Try `cat pcmf32.raw | stream` (or `pv -qL 64000 pcmf32.raw | stream` in realtime) Note: I haven't tested WIN32 ifdefs. You can make such data by `ffmpeg -i jfk.wav -f f32le -acodec pcm_f32le jfk.raw` because wav header length (44) is a multiple of `sizeof float` (4) I decided to ignore the data before `[Start speaking]` because such premature data are not good for remote-transcription systems like: ``` mic2pcm | ssh -C remote "stream | lines2googledocs" ``` or ``` mic2some | ssh -C remote "ffmpeg -loglevel fatal -i pipe:0 -tune zerolatency -af atempo=1.1 -f f32le -ar 16000 -acodec pcm_f32le pipe:1 | stream" ``` So if you want to do a strict test, remove the "ignore" part. Otherwise quite a number of bytes will be ignored.
windows.h defines min unless NOMINMAX is defined
Run `stream --test-pipe --no-vt100 2>/dev/null < pcmf32.raw` to get nearly-reproducible results. If you want to do a strict testing, use `--no-timestamps` as well. ``` cat jfk.raw | ./build/bin/stream -m models/ggml-large-v2.bin --step 2000 --test-pipe -no-vt100 2>/dev/null ( And so my fellow Americans...) ( And so my fellow Americans, ask...) ( And so my fellow Americans, ask not what your country will give you, but what your country will give you.) [00:00:00.000 --> 00:00:30.000] And so my fellow Americans, ask not what your country can do for you. ( Ask what you can do for your) [00:00:02.360 --> 00:00:32.360] Ask what you can do for your country. ``` VAD: ``` cat jfk.raw | ./build/bin/stream -m models/ggml-large-v2.bin --step -2000 --test-pipe -no-vt100 2>/dev/null [00:00:00.000 --> 00:00:03.000] And so, my fellow Americans. [00:00:00.000 --> 00:00:07.920] Ask not what your country can do for you, ask what you can do for your country. ```
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Now it is easy to test with raw PCM data.
Try
cat pcmf32.raw | stream
(or
pv -qL 64000 pcmf32.raw | stream
in realtime)Note: I haven't tested WIN32 ifdefs.
You can make such data by
ffmpeg -i jfk.wav -f f32le -acodec pcm_f32le jfk.raw
because wav header length (44) is a multiple of
sizeof float
(4)I decided to ignore the data before
[Start speaking]
because such premature data are not good
for remote-transcription systems like:
or
So if you want to do a strict test, run
stream --test-pipe --no-vt100 2>/dev/null < pcmf32.raw
to get nearly-reproducible results.
If you want to do a strict testing, use
--no-timestamps
as well.VAD: