Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transcribe output number #1190

Open
ZhikangNiu opened this issue Dec 6, 2024 · 1 comment
Open

transcribe output number #1190

ZhikangNiu opened this issue Dec 6, 2024 · 1 comment

Comments

@ZhikangNiu
Copy link

Thanks for this meangingful repo and I have a question.
My transcription result is listed as follows and how can I change the transcription result don't output number and generation the english number? It will get higher wer

{"truth": "ninety five lines and no more thats it", "hypo": " 95 lines and no more thats it", "wer": 0.25}

{"truth": "my grandmother has type one diabetes", "hypo": " my grandmother has type 1 diabetes", "wer": 0.16666666666666666}

{"truth": "ford is approximately two hundred years old as supported by the books", "hypo": " ford is approximately 200 years old as supported by the books", "wer": 0.16666666666666666}

@nonnoxer
Copy link
Contributor

nonnoxer commented Dec 27, 2024

def get_suppress_tokens() -> list[int]:
        """Get list of all tokens with numerics characters.

        Store this list in the `suppress_tokens` field in whisper parameters.

        Returns:
            list[int]
                List of all tokens with numeric characters.
        """
        tokenizer = Tokenizer(
            tokenizer=model.hf_tokenizer,
            task="transcribe",
            language="en",
            multilingual=True
        )
        number_tokens = [
            i 
            for i in range(tokenizer.eot)
            if all(c in "0123456789" for c in tokenizer.decode([i]).strip())
        ]
        suppress_tokens = [-1] + number_tokens
        return suppress_tokens

where model is an instance of faster_whisper.WhisperModel

Pass the suppress_tokens argument into transcription parameters

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants