Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Useful Sensors Moonshine model. #1808

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

njeffrie
Copy link

For context on the moonshine model please see the Useful Sensors Moonshine repo

Adds the following:

  • c++ moonshine model
  • pybind for python moonshine model
  • moonshine model spec
  • support for multi-dimensional layernorm on CPU.
  • support for broadcasting layernorm weights for multi-dimensional layernorm on CPU.

For now the moonshine converter (safetensor -> ctranslate2 binary) will live in the moonshine repo. Planning to add a transformers converter once Moonshine is part of the transformers library.

@BBC-Esq
Copy link

BBC-Esq commented Oct 26, 2024

I checked out your repo but didn't see anywhere to actually download the moonshine models. How is Ctranslate2 supposed to evaluate whether to incorporate this pull request if the models' can't be tested?

@njeffrie
Copy link
Author

njeffrie commented Oct 26, 2024

Thanks for taking a look - I've uploaded CTranslate2 models for moonshine base and tiny to UsefulSensors huggingface hub. In case it's helpful for testing, the following is a minimal python script to transcribe a wav file with CTranslate2 moonshine base (assuming the model was downloaded to ./ctranslate2/base):

from ctranslate2.models import Moonshine
from ctranslate2 import StorageView
import torchaudio
import tokenizers

tokenizer = tokenizers.Tokenizer.from_file("ctranslate2/base/tokenizer.json")
model = Moonshine('ctranslate2/base', device='cpu')

audio, sr = torchaudio.load('foo.wav')
if sr != 16000:
    audio = torchaudio.functional.resample(audio, sr, 16000)
audio_sv = StorageView.from_array(audio.numpy())

result = model.generate(audio_sv, [[1]], beam_size=5)[0]
tokens = result.sequences_ids[0]
text = tokenizer.decode(tokens).strip()
print(text)

@BBC-Esq
Copy link

BBC-Esq commented Oct 26, 2024

Thanks for the info, but unfortunately I'm not knowledgeable enough to know how to use .h5 files, but I think this was what I was asking about that you did link to...

image

Also, unfortunately, I have no decision-making power regarding Ctranslate2 either so...But I would recommend that if you can't get a response from the Ctranslate2 people relatively quickly that you reach out to a guy named @MahmoudAshraf97 because, although he's not officially with "Systran," he's also interested in all-things Ctranslate2/TTS and is pretty good about responding and has a good repoir with them.

As I said, I'm just one annoying fan of this technology so...Good luck!

@njeffrie
Copy link
Author

Just uploaded CT2 models for moonshine tiny and base: https://huggingface.co/UsefulSensors/moonshine/tree/main/ctranslate2

@njeffrie
Copy link
Author

@minhthuc2502 perhaps you could take a look or assign somebody to review? Landing this in CTranslate2 is currently blocking us from releasing a faster-whisper style model as part of usefulsensors/moonshine.

Thanks!

@minhthuc2502
Copy link
Collaborator

minhthuc2502 commented Nov 25, 2024

Could you add CUDA support by implementing it in layer_norm_gpu.cu? Additionally, I noticed there isn't a converter to transform the original model into CTranslate2's format, apart from the added spec.

And try to fix the pipeline please.

Thank you.

@njeffrie
Copy link
Author

Thanks for taking a look. I'll cuda support for the layernorm changes, add our safetensors -> CTranslate converter and look into what's going on with the presubmit pipeline.

Additionally, I've added a fix to support batching to address this issue.

Adds the following:
- c++ moonshine model
- pybind for python moonshine model
- moonshine model spec
- safetensor moonshine model converter
- support for GroupNorm-style weights for LayerNorm
- support for multi-axis cuda layernorm
- Add a define to prevent quantizing the first conv layers in the
  Moonshine preprocessor
- Add options to enable rotary positional embeddings in the Transformer
  Encoder spec.
Fixes bug when batch size > 1.
Converts safetensor model def + tokenizer_config.json to ctranslate2 model spec for Moonshine.
@BBC-Esq
Copy link

BBC-Esq commented Dec 4, 2024

Thanks for taking a look. I'll cuda support for the layernorm changes, add our safetensors -> CTranslate converter and look into what's going on with the presubmit pipeline.

Additionally, I've added a fix to support batching to address this issue.

Can you please post when it's ready to review because I'm actually kind of curious to test out these models. I won't do it until all the multiple changes are near final or what not. Thanks!

@njeffrie
Copy link
Author

njeffrie commented Dec 5, 2024

Should be ready to go @minhthuc2502, @BBC-Esq.

@broke-end-dev
Copy link

@guillaumekln @minhthuc2502 could you make some time for this? really wanna try this out on my realtime transcriber project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants