Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix bug when preparing quant files, starcoder model does not support #1672

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

kaixuanliu
Copy link
Contributor

When generate FP8 quant files for model bigcode/starcoder, even we add --use_flash_attention flag, it can not pass to modeling part. This PR fixes it.

@kaixuanliu kaixuanliu requested a review from regisss as a code owner December 27, 2024 08:04
@kaixuanliu
Copy link
Contributor Author

@regisss , pls help review and merge

"baichuan",
"gpt_bigcode",
]:
if self.model.config.model_type not in ["falcon", "gpt_bigcode"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaixuanliu, thanks for your pr.

gpt_bigcode also supports flash_attention_fast_softmax : https://github.com/huggingface/optimum-habana/blob/main/optimum/habana/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py#L806
should that option also be covered in this script? ( cc: @mgonchar )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I have added this option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks as temporary WA for me. Correct approach would be to figure out original code logic around flag self.attention_softmax_in_fp32 (see here) and align it's behavior with attn_softmax_bf16

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.attention_softmax_in_fp32 is an option used in stock transformers: L139-L140 to control if we need to upcast softmax to fp32 first and then convert back to original dtype; while attn_softmax_bf16 is a new option brought in by optimum-habana L751-L757 to do similar control.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you right notice those two controls do similar things but don't know about each other, which is better to be fixed

@@ -128,6 +137,7 @@ def __init__(self, tokenizer, model, args, options):
self.model_inputs.update(
{
"use_flash_attention": self.options.use_flash_attention,
"flash_attention_fast_softmax": self.options.flash_attention_fast_softmax,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's actually a change touching not only gpt_bigcode, but all other models from list on line 121. Do all of them support this option? Did you try to run each of those models with this flag?

Anyway it looks like change for separate PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, have fixed it. As there is inconsistency across different modeling. I have to write code like this kind of WA to cover different cases. Some models do not have flash_attention_fast_softmax option. And some models use option attn_softmax_bf16, while some models use attention_softmax_in_fp32.

"baichuan",
"gpt_bigcode",
]:
if self.model.config.model_type not in ["falcon", "gpt_bigcode"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks as temporary WA for me. Correct approach would be to figure out original code logic around flag self.attention_softmax_in_fp32 (see here) and align it's behavior with attn_softmax_bf16

Signed-off-by: kaixuanliu <[email protected]>
@vidyasiv
Copy link
Contributor

@mgonchar , is the review now ready in your opinion for merge?

@mgonchar
Copy link
Contributor

@vidyasiv looks fine for me

@vidyasiv
Copy link
Contributor

@kaixuanliu, thanks for your patience. Could you paste commands + result of your testing this script for models:

  • llama
  • starcoder (gpt_bigcode)

under following conditions:

  • With flash attention options enabled/set (include the softmax one for starcoder)
  • Without flash attention options

Then I can approve it.

@kaixuanliu
Copy link
Contributor Author

kaixuanliu commented Jan 14, 2025

Hi, @vidyasiv , The cmd line for starcoder model is:
QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_lm_eval.py -o acc_starcoder_measure.txt --model_name_or_path bigcode/starcoder --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --batch_size 1 --trim_logits --bf16 --use_flash_attention --flash_attention_recompute --trust_remote_code

After this, it will generate a file hqt_output/measure_hooks_maxabs.json, in which there is no Layer called transformer.h.0.attn.fused_scaled_dot_product_attention. whereas after applying this patch, there will be this layer in generated measurement file.

@vidyasiv
Copy link
Contributor

vidyasiv commented Jan 14, 2025

Hi, @vidyasiv , The cmd line for starcoder model is: QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_lm_eval.py -o acc_starcoder_measure.txt --model_name_or_path bigcode/starcoder --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --batch_size 1 --trim_logits --bf16 --use_flash_attention --flash_attention_recompute --trust_remote_code

After this, it will generate a file hqt_output/measure_hooks_maxabs.json, in which there is no Layer called transformer.h.0.attn.fused_scaled_dot_product_attention. whereas after applying this patch, there will be this layer in generated measurement file.

I think you misunderstood me.
Typically we ask for regression test results but I dont see existing tests for run_lm_eval in tests/ directory so I am asking you to manually run the run_lm_eval script you modified with models(pick one small llama model and one starcoder) and options affecting this PR and paste the results.
Seems the same code affects bf16 so you can run with that.
Please do the regression testing at your end to ensure the PR doesn't break anything and paste the results so we have a record of it.

@kaixuanliu
Copy link
Contributor Author

@vidyasiv , Well, you mean add related CI test in test_text_generation_example.py? It will cost a lot of time, as we need to go through 51506 items in the dataset, with only 1 model may need > 1h. Can we skip this? As this PR is very simple and it only affects the example code.

@vidyasiv
Copy link
Contributor

vidyasiv commented Jan 15, 2025

@vidyasiv , Well, you mean add related CI test in test_text_generation_example.py? It will cost a lot of time, as we need to go through 51506 items in the dataset, with only 1 model may need > 1h. Can we skip this? As this PR is very simple and it only affects the example code.

No need to add test, just run 4 commands with the script (bf16 also ok as it runs through same code) like python run_lm_eval.py -model_name_or_path bigcode/starcoder < other options> :

  • llama + flash attention
  • llama without flash attention
  • starcoder with flash attention
  • starcoder without flash attention

It is simple and should not take long for you to kick off- you can even limit the samples/dataset(you dont have to watch it run).. if this is not clear I can set up time to explain further.
I think the request for regression testing (not even adding test but just running commands to test) is very reasonable in software development. I have even constrained the cases your code affects instead of running all tests.

@regisss can you also chime in if you think the regression testing I am asking for makes sense? Otherwise if you feel alright merging this in, please go ahead.

vidyasiv

This comment was marked as resolved.

@kaixuanliu
Copy link
Contributor Author

I tested following cmd line:

  1. python run_lm_eval.py -o acc_starcoder_measure.txt --model_name_or_path bigcode/starcoder --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --batch_size 1 --trim_logits --bf16 --use_flash_attention --flash_attention_recompute --trust_remote_code
  2. python run_lm_eval.py -o acc_starcoder_measure.txt --model_name_or_path bigcode/starcoder --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --batch_size 1 --trim_logits --bf16 --trust_remote_code
  3. python run_lm_eval.py -o acc_starcoder_measure.txt --model_name_or_path meta-llama/Llama-2-7b-chat-hf --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --batch_size 1 --trim_logits --bf16
  4. python run_lm_eval.py -o acc_starcoder_measure.txt --model_name_or_path meta-llama/Llama-2-7b-chat-hf --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --batch_size 1 --trim_logits --bf16 --use_flash_attention --flash_attention_recompute
    And they all run successfully, the output all looks like
Max memory allocated                = 13.87 GB
Total memory available              = 94.62 GB
{
  "results": {
    "hellaswag": {
      "acc": 0.579964150567616,
      "acc_stderr": 0.004925556104679425,
      "acc_norm": 0.7600079665405298,
      "acc_norm_stderr": 0.004262054526577102
    },
    "lambada_openai": {
      "ppl": 3.3759228770476937,
      "ppl_stderr": 0.09107585451331207,
      "acc": 0.6986221618474675,
      "acc_stderr": 0.00639276748297851
    },
    "piqa": {
      "acc": 0.7633297062023939,
      "acc_stderr": 0.009916841655042807,
      "acc_norm": 0.7704026115342764,
      "acc_norm_stderr": 0.00981268295081518
    },
    "winogrande": {
      "acc": 0.6827150749802684,
      "acc_stderr": 0.013080598411332118
    }
  },
  "versions": {
    "hellaswag": 0,
    "lambada_openai": 0,
    "piqa": 0,
    "winogrande": 0
  },
  "args": {
    "buckets": [
      16,
      32,
      64,
      128,
      189,
      284,
      384
    ],
    "output_file": "acc_starcoder_measure.txt",
    "tasks": [
      "hellaswag",
      "lambada_openai",
      "piqa",
      "winogrande"
    ],
    "limit_iters": null,
    "device": "hpu",
    "model_name_or_path": "meta-llama/Llama-2-7b-chat-hf",
    "bf16": true,
    "max_new_tokens": 100,
    "max_input_tokens": 0,
    "batch_size": 1,
    "warmup": 3,
    "n_iterations": 5,
    "local_rank": 0,
    "use_kv_cache": true,
    "use_hpu_graphs": true,
    "dataset_name": null,
    "column_name": null,
    "do_sample": false,
    "num_beams": 1,
    "top_k": null,
    "penalty_alpha": null,
    "trim_logits": true,
    "seed": 27,
    "profiling_warmup_steps": 0,
    "profiling_steps": 0,
    "profiling_record_shapes": false,
    "prompt": null,
    "bad_words": null,
    "force_words": null,
    "assistant_model": null,
    "peft_model": null,
    "num_return_sequences": 1,
    "token": null,
    "model_revision": "main",
    "attn_softmax_bf16": false,
    "output_dir": null,
    "bucket_size": -1,
    "bucket_internal": false,
    "dataset_max_samples": -1,
    "limit_hpu_graphs": false,
    "show_graphs_count": false,
    "reuse_cache": false,
    "verbose_workers": false,
    "simulate_dyn_prompt": null,
    "reduce_recompile": false,
    "use_chat_template": false,
    "use_flash_attention": true,
    "flash_attention_recompute": true,
    "flash_attention_causal_mask": false,
    "flash_attention_fast_softmax": true,
    "book_source": false,
    "torch_compile": false,
    "ignore_eos": true,
    "temperature": 1.0,
    "top_p": 1.0,
    "const_serialization_path": null,
    "trust_remote_code": false,
    "parallel_strategy": "none",
    "input_embeds": false,
    "run_partial_dataset": false,
    "sdp_on_bf16": false,
    "load_quantized_model_with_autogptq": false,
    "disk_offload": false,
    "load_quantized_model_with_inc": false,
    "local_quantized_inc_model_path": null,
    "quant_config": "",
    "world_size": 0,
    "global_rank": 0
  },
  "duration": 613.695127427578
}```

Copy link
Contributor

@vidyasiv vidyasiv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @regisss please reviewl

@kaixuanliu
Copy link
Contributor Author

@mgonchar , Hi, what is the status now? For the attn_softmax_bf16 and attention_softmax_in_fp32 param difference we discussed, I think it is better you submit another PR to solve it thoroughly if you think it is necessary.

@regisss
Copy link
Collaborator

regisss commented Jan 20, 2025

LGTM! Let's just wait for @mgonchar's answer to the last comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants