[Question]: 用P800跑微调任务的时候遇到问题 #9727

wwwqllll · 2025-01-02T08:38:17Z

请提出你的问题

跑了两步之后报这个错

[2025-01-02 16:22:14,907] [    INFO] - loss: 1.48597932, learning_rate: 4e-06, global_step: 4, current_memory_allocated: 24.7507706284523, current_memory_reserved: 0.0, max_memory_allocated: 27.094865918159485, max_memory_reserved: 0.0, interval_runtime: 3.4259, interval_samples_per_second: 37.3621, interval_steps_per_second: 0.2919, ppl: 4.419291194523183, progress_or_epoch: 1.0                                    [2025-01-02 16:22:14,908] [    INFO] - ***** Running Evaluation *****                                                                                                                                            [2025-01-02 16:22:14,909] [    INFO] -   Num examples = 100                                                                                                                                                      [2025-01-02 16:22:14,909] [    INFO] -   Total prediction steps = 2                                                                                                                                              [2025-01-02 16:22:14,909] [    INFO] -   Pre device batch size = 8                                                                                                                                               [2025-01-02 16:22:14,909] [    INFO] -   Total Batch size = 64                                                                                                                                                   Traceback (most recent call last):
  File "/workspace/PaddleNLP/llm/run_finetune.py", line 658, in <module>
    main()
  File "/workspace/PaddleNLP/llm/run_finetune.py", line 533, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/workspace/PaddleNLP/paddlenlp/trainer/trainer.py", line 854, in train
    return self._inner_training_loop(
  File "/workspace/PaddleNLP/paddlenlp/trainer/trainer.py", line 1193, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, epoch, ignore_keys_for_eval, inputs=inputs)
  File "/workspace/PaddleNLP/paddlenlp/trainer/trainer.py", line 1430, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/workspace/PaddleNLP/paddlenlp/trainer/trainer.py", line 2741, in evaluate
    output = self.evaluation_loop(
  File "/workspace/PaddleNLP/paddlenlp/trainer/trainer.py", line 2870, in evaluation_loop
    loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
  File "/workspace/PaddleNLP/llm/utils/utils.py", line 280, in prediction_step
    return (loss, logits.argmax(axis=-1, keepdim=True), labels)
  File "/usr/local/lib/python3.10/dist-packages/paddle/tensor/search.py", line 239, in argmax
    return _C_ops.argmax(x, axis, keepdim, flatten, var_dtype)
RuntimeError: (NotFound) The kernel with key (XPU, Undefined(AnyLayout), bfloat16) of kernel `argmax` is not registered and fail to fallback to CPU one. Selected wrong DataType `bfloat16`. Paddle support following DataTypes: float32, int32, float16.
  [Hint: Expected kernel_iter != iter->second.end(), but received kernel_iter == iter->second.end().] (at /host/Paddle/paddle/phi/core/kernel_factory.cc:347)



用的是下面这个脚本
#!/bin/bash
cd llm
task_name_or_path="llama2-7b-4k"

runtime_location=/workspace/so-runtime
bkcl_location=/workspace/so-bkcl
export LD_LIBRARY_PATH=${bkcl_location}/:${runtime_location}/:$LD_LIBRARY_PATH

export XBLAS_FC_HBM_VERSION=40
export XPU_PADDLE_L3_SIZE=43554432
export XPUAPI_DEFAULT_SIZE=1610612800

unset PADDLE_ELASTIC_JOB_ID
unset PADDLE_TRAINER_ENDPOINTS
unset DISTRIBUTED_TRAINER_ENDPOINTS
unset FLAGS_START_PORT
unset PADDLE_ELASTIC_TIMEOUT
unset PADDLE_TRAINERS_NUM

echo "bkcl version:"
strings ${bkcl_location}/libbkcl.so | grep COM

timestamp=$(date +%Y%m%d%H%M%S)
echo $timestamp

PYTHONPATH=../:$PYTHONPATH \
python -u -m paddle.distributed.launch \
    --xpus "0,1,2,3,4,5,6,7" \
    --log_dir "output/$task_name_or_path/$timestamp""_log" \
    run_finetune.py \
    --model_name_or_path "meta-llama/Llama-2-7b" \
    --dataset_name_or_path "./data" \
    --output_dir "./checkpoints/lora_ckpts" \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --per_device_eval_batch_size 8 \
    --eval_accumulation_steps 16 \
    --num_train_epochs 1 \
    --learning_rate 0.00003 \
    --warmup_steps 30 \
    --logging_steps 1 \
    --evaluation_strategy "epoch" \
    --save_strategy "epoch" \
    --src_length 1024 \
    --max_length 2048 \
    --bf16 true \
    --fp16_opt_level "O2" \
    --do_train true \
    --do_eval true \
    --disable_tqdm true \
    --load_best_model_at_end true \
    --eval_with_do_generation false \
    --metric_for_best_model "accuracy" \
    --recompute true \
    --save_total_limit 1 \
    --tensor_parallel_degree 1 \
    --pipeline_parallel_degree 1 \
    --sharding "stage1" \
    --lora true \
    --zero_padding false \
    --unified_checkpoint true \
    --pissa false \
    --device "xpu" \
    --max_steps 50

wwwqllll added the question Further information is requested label Jan 2, 2025

paddle-bot bot assigned ZHUI Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: 用P800跑微调任务的时候遇到问题 #9727

[Question]: 用P800跑微调任务的时候遇到问题 #9727

wwwqllll commented Jan 2, 2025 •

edited by ZHUI

Loading

[Question]: 用P800跑微调任务的时候遇到问题 #9727

[Question]: 用P800跑微调任务的时候遇到问题 #9727

Comments

wwwqllll commented Jan 2, 2025 • edited by ZHUI Loading

请提出你的问题

wwwqllll commented Jan 2, 2025 •

edited by ZHUI

Loading