Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: 用P800跑微调任务的时候遇到问题 #9727

Open
wwwqllll opened this issue Jan 2, 2025 · 0 comments
Open

[Question]: 用P800跑微调任务的时候遇到问题 #9727

wwwqllll opened this issue Jan 2, 2025 · 0 comments
Assignees
Labels
question Further information is requested

Comments

@wwwqllll
Copy link

wwwqllll commented Jan 2, 2025

请提出你的问题

跑了两步之后报这个错

[2025-01-02 16:22:14,907] [    INFO] - loss: 1.48597932, learning_rate: 4e-06, global_step: 4, current_memory_allocated: 24.7507706284523, current_memory_reserved: 0.0, max_memory_allocated: 27.094865918159485, max_memory_reserved: 0.0, interval_runtime: 3.4259, interval_samples_per_second: 37.3621, interval_steps_per_second: 0.2919, ppl: 4.419291194523183, progress_or_epoch: 1.0                                    [2025-01-02 16:22:14,908] [    INFO] - ***** Running Evaluation *****                                                                                                                                            [2025-01-02 16:22:14,909] [    INFO] -   Num examples = 100                                                                                                                                                      [2025-01-02 16:22:14,909] [    INFO] -   Total prediction steps = 2                                                                                                                                              [2025-01-02 16:22:14,909] [    INFO] -   Pre device batch size = 8                                                                                                                                               [2025-01-02 16:22:14,909] [    INFO] -   Total Batch size = 64                                                                                                                                                   Traceback (most recent call last):
  File "/workspace/PaddleNLP/llm/run_finetune.py", line 658, in <module>
    main()
  File "/workspace/PaddleNLP/llm/run_finetune.py", line 533, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/workspace/PaddleNLP/paddlenlp/trainer/trainer.py", line 854, in train
    return self._inner_training_loop(
  File "/workspace/PaddleNLP/paddlenlp/trainer/trainer.py", line 1193, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, epoch, ignore_keys_for_eval, inputs=inputs)
  File "/workspace/PaddleNLP/paddlenlp/trainer/trainer.py", line 1430, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/workspace/PaddleNLP/paddlenlp/trainer/trainer.py", line 2741, in evaluate
    output = self.evaluation_loop(
  File "/workspace/PaddleNLP/paddlenlp/trainer/trainer.py", line 2870, in evaluation_loop
    loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
  File "/workspace/PaddleNLP/llm/utils/utils.py", line 280, in prediction_step
    return (loss, logits.argmax(axis=-1, keepdim=True), labels)
  File "/usr/local/lib/python3.10/dist-packages/paddle/tensor/search.py", line 239, in argmax
    return _C_ops.argmax(x, axis, keepdim, flatten, var_dtype)
RuntimeError: (NotFound) The kernel with key (XPU, Undefined(AnyLayout), bfloat16) of kernel `argmax` is not registered and fail to fallback to CPU one. Selected wrong DataType `bfloat16`. Paddle support following DataTypes: float32, int32, float16.
  [Hint: Expected kernel_iter != iter->second.end(), but received kernel_iter == iter->second.end().] (at /host/Paddle/paddle/phi/core/kernel_factory.cc:347)



用的是下面这个脚本
#!/bin/bash
cd llm
task_name_or_path="llama2-7b-4k"

runtime_location=/workspace/so-runtime
bkcl_location=/workspace/so-bkcl
export LD_LIBRARY_PATH=${bkcl_location}/:${runtime_location}/:$LD_LIBRARY_PATH

export XBLAS_FC_HBM_VERSION=40
export XPU_PADDLE_L3_SIZE=43554432
export XPUAPI_DEFAULT_SIZE=1610612800

unset PADDLE_ELASTIC_JOB_ID
unset PADDLE_TRAINER_ENDPOINTS
unset DISTRIBUTED_TRAINER_ENDPOINTS
unset FLAGS_START_PORT
unset PADDLE_ELASTIC_TIMEOUT
unset PADDLE_TRAINERS_NUM

echo "bkcl version:"
strings ${bkcl_location}/libbkcl.so | grep COM

timestamp=$(date +%Y%m%d%H%M%S)
echo $timestamp

PYTHONPATH=../:$PYTHONPATH \
python -u -m paddle.distributed.launch \
    --xpus "0,1,2,3,4,5,6,7" \
    --log_dir "output/$task_name_or_path/$timestamp""_log" \
    run_finetune.py \
    --model_name_or_path "meta-llama/Llama-2-7b" \
    --dataset_name_or_path "./data" \
    --output_dir "./checkpoints/lora_ckpts" \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --per_device_eval_batch_size 8 \
    --eval_accumulation_steps 16 \
    --num_train_epochs 1 \
    --learning_rate 0.00003 \
    --warmup_steps 30 \
    --logging_steps 1 \
    --evaluation_strategy "epoch" \
    --save_strategy "epoch" \
    --src_length 1024 \
    --max_length 2048 \
    --bf16 true \
    --fp16_opt_level "O2" \
    --do_train true \
    --do_eval true \
    --disable_tqdm true \
    --load_best_model_at_end true \
    --eval_with_do_generation false \
    --metric_for_best_model "accuracy" \
    --recompute true \
    --save_total_limit 1 \
    --tensor_parallel_degree 1 \
    --pipeline_parallel_degree 1 \
    --sharding "stage1" \
    --lora true \
    --zero_padding false \
    --unified_checkpoint true \
    --pissa false \
    --device "xpu" \
    --max_steps 50
@wwwqllll wwwqllll added the question Further information is requested label Jan 2, 2025
@paddle-bot paddle-bot bot assigned ZHUI Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants