-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to achieve live transcription #19
Comments
Definitely would be interesting to see a live transcription demo! |
Thanks for the request! We are working on a live transcription demo that we will add to the repo soon. |
@evmaki I tried to ran a benchmark and results were very disappointing. I had used below specifications:
in the benchmark only moonshine was loaded and used for transcriptions on the audio segments. rest were preprocessing steps. result was 48 minutes to transcribe 1 hour 28 minutes long video on CPU. I ran a benchmark using Faster-whisper tiny model on CPU with below specifications
It took under 8 minutes to transcribe that same 1 hour 28 minutes audio file. Even I liked faster Whisper's tiny model quality of transcriptions better. provided these results how you guys are promising live-transcriptions? |
@sleepingcat4 thanks for benchmarking and sharing your results. The Keras implementation currently has some speed issues, which is what's causing this. We've added ONNX models that run much faster. I encourage people to try those out. Re: live transcriptions. We'll soon be merging a demo (using the ONNX models) that shows live captioning in action. The branch is already public if you want to check it out. Demo script is located here. PS: What CPU are you running on? And are you willing to share your script so we can reproduce your results? |
Thank you for sharing the demo! By the way, for the following two lines: moonshine/moonshine/demo/live_captions.py Line 153 in 5689bdf
moonshine/moonshine/demo/live_captions.py Line 179 in 5689bdf
they result in redundant computation, which is not efficient. Is there a plan to release a streaming model? |
@evmaki below is the function I used to transcribe 10s segement at a moment. def transcribe_folder(audio_folder, output_file, model='moonshine/tiny'):
initial_time = time.time()
transcription = ""
audio_files = sorted(
[f for f in os.listdir(audio_folder) if f.endswith('.wav')],
key=lambda x: int(re.search(r'part(\d+)\.wav', x).group(1))
)
for i, file_name in enumerate(audio_files):
file_path = os.path.join(audio_folder, file_name)
transcript = moonshine.transcribe(file_path, model)
transcript = " ".join(transcript)
seg_st_time = i * 10
seg_en_time = seg_st_time + 10
start_h = seg_st_time // 3600
start_m = (seg_st_time % 3600) // 60
start_s = seg_st_time % 60
end_h = seg_en_time // 3600
end_m = (seg_en_time % 3600) // 60
end_s = seg_en_time % 60
transcription += f"{start_h:02}:{start_m:02}:{start_s:02}-{end_h:02}:{end_m:02}:{end_s:02}s: {transcript}\n"
with open(output_file, 'w') as f:
f.write(transcription)
end_time = time.time()
execution_time = (end_time - initial_time) / 60
print(f"Execution Time: {execution_time:.2f} minutes")
print(f"Transcription saved to: {output_file}")
transcribe_folder("Yj7ZDcHGtK", output_file="bench_script.txt") |
I suggest that you test it again with sherpa-onnx, which has supported Moonshine models. The following colab notebook guides you step-by-step how to do that. |
Thanks so much @csukuangfj, for providing the notebook, and the comparing with Whisper. Just to summarize, Moonshine tiny can transcribe 335.2 seconds of audio in 19.6 seconds whereas whisper tiny.en needs 64.1 seconds. That's a 3.3x speedup, and the transcription quality looks identical. @sleepingcat4 I hope this helps alleviate some of your disappointment. The provided Keras implementation is indeed non-optimal and we released it to be a reference with future enhancement in mind. Most deployments we had in mind (such as on SBCs) would never have Torch or TF or JAX or other such large frameworks installed. ONNX, TFLite or some other homegrown runtimes are best suitable for seeing the benefits of Moonshine. |
@csukuangfj thanks for the notebook. I was writing an audio transcription pipeline so I needed to have benchmarks on large audio before to get started. results on your notebook looks interesting (my benchmark avoided ONNX models) but I don't think moonshine is still a good fit for making my dataset or pipeline. @keveman Moonshine's transcription quality is par to OpenAI Whipser model but it not exactly at the same level with faster-whisper tiny int8 model. Faster-whisper can capture more nuanced information. If moonshine can match both the speed of tiny whisper on faster-whipser library, I would definitely integrate it into my pipeline. But, I will look into the ONNX models and few other models as well. Because from LAION AI, we are planning to maintain an open-source repo to share our pipeline and benchmark models and share them with everyone. Meanwhile really thanks for the responses and help. |
@sleepingcat4 faster-whisper and OpenAI whisper are the same underlying models, so I am not sure what you mean by getting better quality with faster-whisper. In any case, we do have a PR out for running moonshine with CTranslate2, which is what faster-whisper is based on. |
@keveman Yes, the models are same but faster whipser allows int8 for even tiny whisper and the CTranslate2 is doing rest of the speed-up I think. I meant Whisper tiny does a better transcription job than moonshine. I went through both the transcriptions for the benchmark I did and whisper-tiny int8 was able to capture last 1,2 lines that moonshine didn't. And for me those lines are important. I am also going to run a test with sensevoice maybe that can beat whisper. But, Moonshine right now is third on my list. |
@sleepingcat4 thank you for checking the branch code! I'm authoring the live captions script. To compare this script, (which in current form provides a console user interface that emulates streaming-style live captions with very frequent console refreshes), with faster-whisper speech chunking method requires script changes. To my knowledge they both use the silero-vad iterator class. I'll come back to this issue thread with a summary of suggestions after checking.
|
@sleepingcat4 A suggestion to directly compare faster-whisper and Moonshine models when testing on audio file examples is to adapt faster-whisper code for use with Moonshine models., e.g.: adapt code from For our open-source Here are suggestions how to adapt
I hope this information helps and thank you for your comments. |
@guynich what do you mean by Chunking? (in this context) like if I even consider 10s segment transcriptions for both moonshine and faster-whisper, faster-whisper tiny int8 model does a better job than moonshine tiny. also moonshine in my experiments does worse if I don't use percussive component and rather feed the original audio file. Even when harmonic component (background noise is removed) moonshine's transcriptions are slightly bad than whisper tiny int8 (beam_size=5). so while I am interested into the moonshine model but it doesn't make sense to use it for production or Dataset generation pipeline. (I didn't use ONNX yet) |
By chunking I mean speech segmentation as done in faster-whisper transcribe() method here. Thanks for your comments. |
I've merged the |
Thanks for this @guynich! Do I need to do anything to adjust my system sample rate (48kHz) for this to work? I'm getting: (env_moonshine) $ python3 moonshine/moonshine/demo/live_captions.py
Loading Moonshine model 'moonshine/base' (using ONNX runtime) ...
Expression 'paInvalidSampleRate' failed in 'src/hostapi/alsa/pa_linux_alsa.c', line: 2048
Expression 'PaAlsaStreamComponent_InitialConfigure( &self->capture, inParams, self->primeBuffers, hwParamsCapture, &realSr )' failed in 'src/hostapi/alsa/pa_linux_alsa.c', line: 2718
Expression 'PaAlsaStream_Configure( stream, inputParameters, outputParameters, sampleRate, framesPerBuffer, &inputLatency, &outputLatency, &hostBufferSizeMode )' failed in 'src/hostapi/alsa/pa_linux_alsa.c', line: 2842
Traceback (most recent call last):
File "/path/to/moonshine/moonshine/moonshine/demo/live_captions.py", line 126, in <module>
stream = InputStream(
^^^^^^^^^^^^
File "/path/to/moonshine/env_moonshine/lib/python3.12/site-packages/sounddevice.py", line 1440, in __init__
_StreamBase.__init__(self, kind='input', wrap_callback='array',
File "/path/to/moonshine/env_moonshine/lib/python3.12/site-packages/sounddevice.py", line 909, in __init__
_check(_lib.Pa_OpenStream(self._ptr, iparameters, oparameters,
File "/path/to/moonshine/env_moonshine/lib/python3.12/site-packages/sounddevice.py", line 2796, in _check
raise PortAudioError(errormsg, err)
sounddevice.PortAudioError: Error opening InputStream: Invalid sample rate [PaErrorCode -9997] Thanks! |
@curiositry
a) Change moonshine/moonshine/demo/live_captions.py Line 127 in 0aaaaa0
b) Change moonshine/moonshine/demo/live_captions.py Line 129 in 0aaaaa0
c) Add a line to downsample the chunk samples from 48000 to 16000 using numpy index slicing moonshine/moonshine/demo/live_captions.py Line 150 in 0aaaaa0 |
@guynich I am so grateful for the speedy and helpful reply! That worked like a charm. I have the latest version of With |
With modifications, is it possible to use the encoder in a transducer fashion? |
@curiositry moonshine/moonshine/demo/live_captions.py Line 70 in 0aaaaa0
You could try removing the flatten() in the callback and add a new line chunk = chunk.flatten() before your downsampling code line. If that doesn’t work I’m thinking your hardware may not be suitable for sounddevice package stream methods. Hope it works for you.
|
@curiositry moonshine/moonshine/demo/live_captions.py Line 126 in 0aaaaa0
|
@guynich I tried all those approaches and more on my development machine (Ryzen 7 processor, plenty of RAM, Nvidia GPU, latest version of Linux Mint, PipeWire audio), and still got input overflow errors (interspersed with failures to find the audio device). Whisper.cpp stream works fine, so my audio is working in general for similar tasks, just not with Python soundevice. However, the live_captions.py script works great on the RasPi 5, no audio config or resampling needed. And that's my target device, so getting it running on my laptop isn't cruicial. Thanks for all your help! I'm sure I'll have more dumb questions soon :) |
@curiosity |
Closing this issue as the feature request has been implemented and the discussion seems to have concluded. Please open a discussion thread if you'd like to discuss this further! |
The title of the paper https://arxiv.org/pdf/2410.15608 is
However, the model is a non-streaming model, could you describe how to achieve live transcription?
The demo in this repo is for decoding files,
it would be nice if you can provide a demo for live transcriptions.
The text was updated successfully, but these errors were encountered: