fix bug when preparing quant files, starcoder model does not support #1672

kaixuanliu · 2024-12-27T08:04:31Z

When generate FP8 quant files for model bigcode/starcoder, even we add --use_flash_attention flag, it can not pass to modeling part. This PR fixes it.

kaixuanliu · 2024-12-27T08:05:05Z

@regisss , pls help review and merge

flash attn Signed-off-by: kaixuanliu <[email protected]>

vidyasiv · 2025-01-06T22:06:34Z

examples/text-generation/run_lm_eval.py

+            "baichuan",
+            "gpt_bigcode",
+        ]:
+            if self.model.config.model_type not in ["falcon", "gpt_bigcode"]:


@kaixuanliu, thanks for your pr.

gpt_bigcode also supports flash_attention_fast_softmax : https://github.com/huggingface/optimum-habana/blob/main/optimum/habana/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py#L806
should that option also be covered in this script? ( cc: @mgonchar )

Hi, I have added this option.

this looks as temporary WA for me. Correct approach would be to figure out original code logic around flag self.attention_softmax_in_fp32 (see here) and align it's behavior with attn_softmax_bf16

self.attention_softmax_in_fp32 is an option used in stock transformers: L139-L140 to control if we need to upcast softmax to fp32 first and then convert back to original dtype; while attn_softmax_bf16 is a new option brought in by optimum-habana L751-L757 to do similar control.

As you right notice those two controls do similar things but don't know about each other, which is better to be fixed

Signed-off-by: kaixuanliu <[email protected]>

mgonchar · 2025-01-07T17:52:21Z

examples/text-generation/run_lm_eval.py

@@ -128,6 +137,7 @@ def __init__(self, tokenizer, model, args, options):
            self.model_inputs.update(
                {
                    "use_flash_attention": self.options.use_flash_attention,
+                    "flash_attention_fast_softmax": self.options.flash_attention_fast_softmax,


That's actually a change touching not only gpt_bigcode, but all other models from list on line 121. Do all of them support this option? Did you try to run each of those models with this flag?

Anyway it looks like change for separate PR

Oh, sorry, have fixed it. As there is inconsistency across different modeling. I have to write code like this kind of WA to cover different cases. Some models do not have flash_attention_fast_softmax option. And some models use option attn_softmax_bf16, while some models use attention_softmax_in_fp32.

mgonchar · 2025-01-07T18:00:38Z

examples/text-generation/run_lm_eval.py

+            "baichuan",
+            "gpt_bigcode",
+        ]:
+            if self.model.config.model_type not in ["falcon", "gpt_bigcode"]:


this looks as temporary WA for me. Correct approach would be to figure out original code logic around flag self.attention_softmax_in_fp32 (see here) and align it's behavior with attn_softmax_bf16

Signed-off-by: kaixuanliu <[email protected]>

vidyasiv · 2025-01-10T17:21:35Z

@mgonchar , is the review now ready in your opinion for merge?

mgonchar · 2025-01-13T10:49:56Z

@vidyasiv looks fine for me

vidyasiv · 2025-01-13T16:58:07Z

@kaixuanliu, thanks for your patience. Could you paste commands + result of your testing this script for models:

llama
starcoder (gpt_bigcode)

under following conditions:

With flash attention options enabled/set (include the softmax one for starcoder)
Without flash attention options

Then I can approve it.

kaixuanliu · 2025-01-14T06:10:22Z

Hi, @vidyasiv , The cmd line for starcoder model is:
QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_lm_eval.py -o acc_starcoder_measure.txt --model_name_or_path bigcode/starcoder --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --batch_size 1 --trim_logits --bf16 --use_flash_attention --flash_attention_recompute --trust_remote_code

After this, it will generate a file hqt_output/measure_hooks_maxabs.json, in which there is no Layer called transformer.h.0.attn.fused_scaled_dot_product_attention. whereas after applying this patch, there will be this layer in generated measurement file.

vidyasiv · 2025-01-14T17:06:24Z

Hi, @vidyasiv , The cmd line for starcoder model is: QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_lm_eval.py -o acc_starcoder_measure.txt --model_name_or_path bigcode/starcoder --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --batch_size 1 --trim_logits --bf16 --use_flash_attention --flash_attention_recompute --trust_remote_code

After this, it will generate a file hqt_output/measure_hooks_maxabs.json, in which there is no Layer called transformer.h.0.attn.fused_scaled_dot_product_attention. whereas after applying this patch, there will be this layer in generated measurement file.

I think you misunderstood me.
Typically we ask for regression test results but I dont see existing tests for run_lm_eval in tests/ directory so I am asking you to manually run the run_lm_eval script you modified with models(pick one small llama model and one starcoder) and options affecting this PR and paste the results.
Seems the same code affects bf16 so you can run with that.
Please do the regression testing at your end to ensure the PR doesn't break anything and paste the results so we have a record of it.

kaixuanliu · 2025-01-15T15:19:18Z

@vidyasiv , Well, you mean add related CI test in test_text_generation_example.py? It will cost a lot of time, as we need to go through 51506 items in the dataset, with only 1 model may need > 1h. Can we skip this? As this PR is very simple and it only affects the example code.

vidyasiv · 2025-01-15T16:36:02Z

@vidyasiv , Well, you mean add related CI test in test_text_generation_example.py? It will cost a lot of time, as we need to go through 51506 items in the dataset, with only 1 model may need > 1h. Can we skip this? As this PR is very simple and it only affects the example code.

No need to add test, just run 4 commands with the script (bf16 also ok as it runs through same code) like python run_lm_eval.py -model_name_or_path bigcode/starcoder < other options> :

llama + flash attention
llama without flash attention
starcoder with flash attention
starcoder without flash attention

It is simple and should not take long for you to kick off- you can even limit the samples/dataset(you dont have to watch it run).. if this is not clear I can set up time to explain further.
I think the request for regression testing (not even adding test but just running commands to test) is very reasonable in software development. I have even constrained the cases your code affects instead of running all tests.

@regisss can you also chime in if you think the regression testing I am asking for makes sense? Otherwise if you feel alright merging this in, please go ahead.

kaixuanliu · 2025-01-16T05:04:52Z

I tested following cmd line:

python run_lm_eval.py -o acc_starcoder_measure.txt --model_name_or_path bigcode/starcoder --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --batch_size 1 --trim_logits --bf16 --use_flash_attention --flash_attention_recompute --trust_remote_code
python run_lm_eval.py -o acc_starcoder_measure.txt --model_name_or_path bigcode/starcoder --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --batch_size 1 --trim_logits --bf16 --trust_remote_code
python run_lm_eval.py -o acc_starcoder_measure.txt --model_name_or_path meta-llama/Llama-2-7b-chat-hf --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --batch_size 1 --trim_logits --bf16
python run_lm_eval.py -o acc_starcoder_measure.txt --model_name_or_path meta-llama/Llama-2-7b-chat-hf --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --batch_size 1 --trim_logits --bf16 --use_flash_attention --flash_attention_recompute
And they all run successfully, the output all looks like

Max memory allocated                = 13.87 GB
Total memory available              = 94.62 GB
{
  "results": {
    "hellaswag": {
      "acc": 0.579964150567616,
      "acc_stderr": 0.004925556104679425,
      "acc_norm": 0.7600079665405298,
      "acc_norm_stderr": 0.004262054526577102
    },
    "lambada_openai": {
      "ppl": 3.3759228770476937,
      "ppl_stderr": 0.09107585451331207,
      "acc": 0.6986221618474675,
      "acc_stderr": 0.00639276748297851
    },
    "piqa": {
      "acc": 0.7633297062023939,
      "acc_stderr": 0.009916841655042807,
      "acc_norm": 0.7704026115342764,
      "acc_norm_stderr": 0.00981268295081518
    },
    "winogrande": {
      "acc": 0.6827150749802684,
      "acc_stderr": 0.013080598411332118
    }
  },
  "versions": {
    "hellaswag": 0,
    "lambada_openai": 0,
    "piqa": 0,
    "winogrande": 0
  },
  "args": {
    "buckets": [
      16,
      32,
      64,
      128,
      189,
      284,
      384
    ],
    "output_file": "acc_starcoder_measure.txt",
    "tasks": [
      "hellaswag",
      "lambada_openai",
      "piqa",
      "winogrande"
    ],
    "limit_iters": null,
    "device": "hpu",
    "model_name_or_path": "meta-llama/Llama-2-7b-chat-hf",
    "bf16": true,
    "max_new_tokens": 100,
    "max_input_tokens": 0,
    "batch_size": 1,
    "warmup": 3,
    "n_iterations": 5,
    "local_rank": 0,
    "use_kv_cache": true,
    "use_hpu_graphs": true,
    "dataset_name": null,
    "column_name": null,
    "do_sample": false,
    "num_beams": 1,
    "top_k": null,
    "penalty_alpha": null,
    "trim_logits": true,
    "seed": 27,
    "profiling_warmup_steps": 0,
    "profiling_steps": 0,
    "profiling_record_shapes": false,
    "prompt": null,
    "bad_words": null,
    "force_words": null,
    "assistant_model": null,
    "peft_model": null,
    "num_return_sequences": 1,
    "token": null,
    "model_revision": "main",
    "attn_softmax_bf16": false,
    "output_dir": null,
    "bucket_size": -1,
    "bucket_internal": false,
    "dataset_max_samples": -1,
    "limit_hpu_graphs": false,
    "show_graphs_count": false,
    "reuse_cache": false,
    "verbose_workers": false,
    "simulate_dyn_prompt": null,
    "reduce_recompile": false,
    "use_chat_template": false,
    "use_flash_attention": true,
    "flash_attention_recompute": true,
    "flash_attention_causal_mask": false,
    "flash_attention_fast_softmax": true,
    "book_source": false,
    "torch_compile": false,
    "ignore_eos": true,
    "temperature": 1.0,
    "top_p": 1.0,
    "const_serialization_path": null,
    "trust_remote_code": false,
    "parallel_strategy": "none",
    "input_embeds": false,
    "run_partial_dataset": false,
    "sdp_on_bf16": false,
    "load_quantized_model_with_autogptq": false,
    "disk_offload": false,
    "load_quantized_model_with_inc": false,
    "local_quantized_inc_model_path": null,
    "quant_config": "",
    "world_size": 0,
    "global_rank": 0
  },
  "duration": 613.695127427578
}```

vidyasiv

LGTM, @regisss please reviewl

kaixuanliu · 2025-01-20T12:06:45Z

@mgonchar , Hi, what is the status now? For the attn_softmax_bf16 and attention_softmax_in_fp32 param difference we discussed, I think it is better you submit another PR to solve it thoroughly if you think it is necessary.

regisss · 2025-01-20T22:08:15Z

LGTM! Let's just wait for @mgonchar's answer to the last comment.

kaixuanliu requested a review from regisss as a code owner December 27, 2024 08:04

fix bug when preparing quant files, starcoder model does not support

3fe0f10

flash attn Signed-off-by: kaixuanliu <[email protected]>

vidyasiv suggested changes Jan 6, 2025

View reviewed changes

add flash_attention_fast_softmax option

a279ae2

Signed-off-by: kaixuanliu <[email protected]>

mgonchar reviewed Jan 7, 2025

View reviewed changes

adjust

fe490fe

Signed-off-by: kaixuanliu <[email protected]>

This comment was marked as resolved.

Sign in to view

vidyasiv approved these changes Jan 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix bug when preparing quant files, starcoder model does not support #1672

fix bug when preparing quant files, starcoder model does not support #1672

kaixuanliu commented Dec 27, 2024

kaixuanliu commented Dec 27, 2024

vidyasiv Jan 6, 2025

kaixuanliu Jan 7, 2025

mgonchar Jan 7, 2025

kaixuanliu Jan 9, 2025

mgonchar Jan 13, 2025

mgonchar Jan 7, 2025

kaixuanliu Jan 8, 2025

mgonchar Jan 7, 2025

vidyasiv commented Jan 10, 2025

mgonchar commented Jan 13, 2025

vidyasiv commented Jan 13, 2025

kaixuanliu commented Jan 14, 2025 •

edited

Loading

vidyasiv commented Jan 14, 2025 •

edited

Loading

kaixuanliu commented Jan 15, 2025

vidyasiv commented Jan 15, 2025 •

edited

Loading

This comment was marked as resolved.

kaixuanliu commented Jan 16, 2025

vidyasiv left a comment

kaixuanliu commented Jan 20, 2025

regisss commented Jan 20, 2025

fix bug when preparing quant files, starcoder model does not support #1672

Are you sure you want to change the base?

fix bug when preparing quant files, starcoder model does not support #1672

Conversation

kaixuanliu commented Dec 27, 2024

kaixuanliu commented Dec 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vidyasiv commented Jan 10, 2025

mgonchar commented Jan 13, 2025

vidyasiv commented Jan 13, 2025

kaixuanliu commented Jan 14, 2025 • edited Loading

vidyasiv commented Jan 14, 2025 • edited Loading

kaixuanliu commented Jan 15, 2025

vidyasiv commented Jan 15, 2025 • edited Loading

This comment was marked as resolved.

kaixuanliu commented Jan 16, 2025

vidyasiv left a comment

Choose a reason for hiding this comment

kaixuanliu commented Jan 20, 2025

regisss commented Jan 20, 2025

kaixuanliu commented Jan 14, 2025 •

edited

Loading

vidyasiv commented Jan 14, 2025 •

edited

Loading

vidyasiv commented Jan 15, 2025 •

edited

Loading