-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix bug when preparing quant files, starcoder model does not support #1672
base: main
Are you sure you want to change the base?
Conversation
@regisss , pls help review and merge |
flash attn Signed-off-by: kaixuanliu <[email protected]>
"baichuan", | ||
"gpt_bigcode", | ||
]: | ||
if self.model.config.model_type not in ["falcon", "gpt_bigcode"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kaixuanliu, thanks for your pr.
gpt_bigcode also supports flash_attention_fast_softmax
: https://github.com/huggingface/optimum-habana/blob/main/optimum/habana/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py#L806
should that option also be covered in this script? ( cc: @mgonchar )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I have added this option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks as temporary WA for me. Correct approach would be to figure out original code logic around flag self.attention_softmax_in_fp32
(see here) and align it's behavior with attn_softmax_bf16
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you right notice those two controls do similar things but don't know about each other, which is better to be fixed
Signed-off-by: kaixuanliu <[email protected]>
@@ -128,6 +137,7 @@ def __init__(self, tokenizer, model, args, options): | |||
self.model_inputs.update( | |||
{ | |||
"use_flash_attention": self.options.use_flash_attention, | |||
"flash_attention_fast_softmax": self.options.flash_attention_fast_softmax, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's actually a change touching not only gpt_bigcode
, but all other models from list on line 121. Do all of them support this option? Did you try to run each of those models with this flag?
Anyway it looks like change for separate PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry, have fixed it. As there is inconsistency across different modeling. I have to write code like this kind of WA to cover different cases. Some models do not have flash_attention_fast_softmax
option. And some models use option attn_softmax_bf16
, while some models use attention_softmax_in_fp32
.
"baichuan", | ||
"gpt_bigcode", | ||
]: | ||
if self.model.config.model_type not in ["falcon", "gpt_bigcode"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks as temporary WA for me. Correct approach would be to figure out original code logic around flag self.attention_softmax_in_fp32
(see here) and align it's behavior with attn_softmax_bf16
Signed-off-by: kaixuanliu <[email protected]>
@mgonchar , is the review now ready in your opinion for merge? |
@vidyasiv looks fine for me |
@kaixuanliu, thanks for your patience. Could you paste commands + result of your testing this script for models:
under following conditions:
Then I can approve it. |
Hi, @vidyasiv , The cmd line for starcoder model is: After this, it will generate a file |
I think you misunderstood me. |
@vidyasiv , Well, you mean add related CI test in test_text_generation_example.py? It will cost a lot of time, as we need to go through 51506 items in the dataset, with only 1 model may need > 1h. Can we skip this? As this PR is very simple and it only affects the example code. |
No need to add test, just run 4 commands with the script (bf16 also ok as it runs through same code) like
It is simple and should not take long for you to kick off- you can even limit the samples/dataset(you dont have to watch it run).. if this is not clear I can set up time to explain further. @regisss can you also chime in if you think the regression testing I am asking for makes sense? Otherwise if you feel alright merging this in, please go ahead. |
I tested following cmd line:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, @regisss please reviewl
@mgonchar , Hi, what is the status now? For the |
LGTM! Let's just wait for @mgonchar's answer to the last comment. |
When generate FP8 quant files for model
bigcode/starcoder
, even we add--use_flash_attention
flag, it can not pass to modeling part. This PR fixes it.