[Bug]: Continuous batching (OpenAI Server) with greedy search return different results #11658
Open
1 task done
Labels
bug
Something isn't working
Your current environment
The output of `python collect_env.py`
Model Input Dumps
No response
🐛 Describe the bug
I am using greedy decoding (temperature==0.0) for the same gpu and every time we run inference on the same data, the results are a whole lot different
To reproduce, first run the api server
Then run (batching with multithread)
Results are similar if (i) I use for loop (no batching, one example at a time) OR (ii) use offline inference i.e. model.chat(...)
I believe there's a critical bug with continuous batching at the moment (since (ii) works).
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: