Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accept pipe to stream #2694

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open

Accept pipe to stream #2694

wants to merge 11 commits into from

Conversation

tamo
Copy link
Contributor

@tamo tamo commented Jan 2, 2025

Now it is easy to test with raw PCM data.
Try cat pcmf32.raw | stream
(or pv -qL 64000 pcmf32.raw | stream in realtime)

Note: I haven't tested WIN32 ifdefs.

You can make such data by
ffmpeg -i jfk.wav -f f32le -acodec pcm_f32le jfk.raw
because wav header length (44) is a multiple of sizeof float (4)

I decided to ignore the data before [Start speaking]
because such premature data are not good
for remote-transcription systems like:

mic2pcm | ssh -C remote "stream | lines2googledocs"

or

mic2some | ssh -C remote "ffmpeg -loglevel fatal -i pipe:0 -tune zerolatency -af atempo=1.1 -f f32le -ar 16000 -acodec pcm_f32le pipe:1 | stream"

So if you want to do a strict test, run
stream --test-pipe --no-vt100 2>/dev/null < pcmf32.raw
to get nearly-reproducible results.
If you want to do a strict testing, use --no-timestamps as well.

cat jfk.raw | ./build/bin/stream -m models/ggml-large-v2.bin --step 2000 --test-pipe -no-vt100 2>/dev/null
( And so my fellow Americans...)
( And so my fellow Americans, ask...)
( And so my fellow Americans, ask not what your country will give you, but what your country will give you.)
[00:00:00.000 --> 00:00:30.000]   And so my fellow Americans, ask not what your country can do for you.

( Ask what you can do for your)
[00:00:02.360 --> 00:00:32.360]   Ask what you can do for your country.

VAD:

cat jfk.raw | ./build/bin/stream -m models/ggml-large-v2.bin --step -2000 --test-pipe -no-vt100 2>/dev/null

[00:00:00.000 --> 00:00:03.000]   And so, my fellow Americans.

[00:00:00.000 --> 00:00:07.920]   Ask not what your country can do for you, ask what you can do for your country.

tamo added 10 commits January 2, 2025 00:18
Without it, `stream --save-audio` produces somehow choppy wav:
`stream` calculates t_diff in milliseconds
and combine audio pieces which are about step_ms long.

WHISPER_SAMPLE_RATE / 1000 == only 16

but surprisingly human ears seem to be able to hear the gap
as a noise.
Use one deque instead of two vectors (old and new).
Old and new are length variables now.

Basically: Get `step - new` samples every time.
Then substitute `new = (around) step;`
The new audio data is simply appended to the deque.
(Limit the deque size to 30 seconds.)
Pass `old + new` samples to whisper inference.

If the data has been consumed, let `old = 0; new = 0;`
If some of the data should be kept for the next iter, `old = keep;`
If you want to get only N samples next time, `new = step - N;`

In VAD mode: `stream --interim --step -3000` will
Get 3000ms of audio.
Run `vad_simple(step_ms)`.
If nothing is detected, get 100ms more audio and retry.
If nothing is detected and 3000ms has been passed,
go into the interim mode,
where `n_segments - 1` segments will be confirmed.
(`old -= confirmed_t1`)
If `n_segments == 1`, only show the first half of the result.

Misc:
Increase the default `max_tokens` because 32 is too small for 10 seconds.
(Some Japanese speech was garbled.)
Write wav as soon as the data is available.

`no_timestamps` is the default even for VAD
because it is more useful to show to the hard-of-hearing
Now it is easy to test with raw PCM data.
Try `cat pcmf32.raw | stream`
(or `pv -qL 64000 pcmf32.raw | stream` in realtime)

Note: I haven't tested WIN32 ifdefs.

You can make such data by
`ffmpeg -i jfk.wav -f f32le -acodec pcm_f32le jfk.raw`
because wav header length (44) is a multiple of `sizeof float` (4)

I decided to ignore the data before `[Start speaking]`
because such premature data are not good
for remote-transcription systems like:

```
mic2pcm | ssh -C remote "stream | lines2googledocs"
```

or

```
mic2some | ssh -C remote "ffmpeg -loglevel fatal -i pipe:0 -tune zerolatency -af atempo=1.1 -f f32le -ar 16000 -acodec pcm_f32le pipe:1 | stream"
```

So if you want to do a strict test, remove the "ignore" part.
Otherwise quite a number of bytes will be ignored.
windows.h defines min unless NOMINMAX is defined
Run `stream --test-pipe --no-vt100 2>/dev/null < pcmf32.raw`
to get nearly-reproducible results.
If you want to do a strict testing, use `--no-timestamps` as well.

```
cat jfk.raw | ./build/bin/stream -m models/ggml-large-v2.bin --step 2000 --test-pipe -no-vt100 2>/dev/null
( And so my fellow Americans...)
( And so my fellow Americans, ask...)
( And so my fellow Americans, ask not what your country will give you, but what your country will give you.)
[00:00:00.000 --> 00:00:30.000]   And so my fellow Americans, ask not what your country can do for you.

( Ask what you can do for your)
[00:00:02.360 --> 00:00:32.360]   Ask what you can do for your country.
```

VAD:

```
cat jfk.raw | ./build/bin/stream -m models/ggml-large-v2.bin --step -2000 --test-pipe -no-vt100 2>/dev/null

[00:00:00.000 --> 00:00:03.000]   And so, my fellow Americans.

[00:00:00.000 --> 00:00:07.920]   Ask not what your country can do for you, ask what you can do for your country.

```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant