Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Responses with unusual content. #2871

Open
1 of 4 tasks
ncthanhcs opened this issue Dec 30, 2024 · 2 comments
Open
1 of 4 tasks

Responses with unusual content. #2871

ncthanhcs opened this issue Dec 30, 2024 · 2 comments

Comments

@ncthanhcs
Copy link

ncthanhcs commented Dec 30, 2024

System Info

text generation inference api

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

i'm using inference api https://api-inference.huggingface.co/v1/chat/completions with nvidia/Llama-3.1-Nemotron-70B-Instruct-HF model.

I use the same message with the role of "user," and the model produces different results. Most of the time, the model provides normal answers, but occasionally it generates responses with strange content.
image

I temporarily stopped calling the API for a short period. After that, I called the API again with the same message used previously, and the model returned a normal response.

Expected behavior

Is this issue caused by the model? Is there any way to prevent the model from generating such strange responses?

@maiiabocharova
Copy link

I experienced same behaviour with Inference APIs - when there are many parallel requests - model starts generating full rubbish. After restarting it works normally again, for me 32 parallel requests is max before model starts spitting out rubbish. This should not happen of course.

@luonist
Copy link

luonist commented Jan 9, 2025

I experienced the same issue with standard Llama models from Meta as well (3.1 70B Instruct, and 3.3 70B Instruct).
These models are hosted in my corporate infrastructure and usually receive 3/4k requests (and 2/3M input tokens) per hour, which doesn't look to be that much. In fact, I've never seen more than 5 running requests per second for each model.
I'm using TGI 3.0.1 with H100, and H100-nvl GPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants