We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
跑了两步之后报这个错
[2025-01-02 16:22:14,907] [ INFO] - loss: 1.48597932, learning_rate: 4e-06, global_step: 4, current_memory_allocated: 24.7507706284523, current_memory_reserved: 0.0, max_memory_allocated: 27.094865918159485, max_memory_reserved: 0.0, interval_runtime: 3.4259, interval_samples_per_second: 37.3621, interval_steps_per_second: 0.2919, ppl: 4.419291194523183, progress_or_epoch: 1.0 [2025-01-02 16:22:14,908] [ INFO] - ***** Running Evaluation ***** [2025-01-02 16:22:14,909] [ INFO] - Num examples = 100 [2025-01-02 16:22:14,909] [ INFO] - Total prediction steps = 2 [2025-01-02 16:22:14,909] [ INFO] - Pre device batch size = 8 [2025-01-02 16:22:14,909] [ INFO] - Total Batch size = 64 Traceback (most recent call last): File "/workspace/PaddleNLP/llm/run_finetune.py", line 658, in <module> main() File "/workspace/PaddleNLP/llm/run_finetune.py", line 533, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/workspace/PaddleNLP/paddlenlp/trainer/trainer.py", line 854, in train return self._inner_training_loop( File "/workspace/PaddleNLP/paddlenlp/trainer/trainer.py", line 1193, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, model, epoch, ignore_keys_for_eval, inputs=inputs) File "/workspace/PaddleNLP/paddlenlp/trainer/trainer.py", line 1430, in _maybe_log_save_evaluate metrics = self.evaluate(ignore_keys=ignore_keys_for_eval) File "/workspace/PaddleNLP/paddlenlp/trainer/trainer.py", line 2741, in evaluate output = self.evaluation_loop( File "/workspace/PaddleNLP/paddlenlp/trainer/trainer.py", line 2870, in evaluation_loop loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys) File "/workspace/PaddleNLP/llm/utils/utils.py", line 280, in prediction_step return (loss, logits.argmax(axis=-1, keepdim=True), labels) File "/usr/local/lib/python3.10/dist-packages/paddle/tensor/search.py", line 239, in argmax return _C_ops.argmax(x, axis, keepdim, flatten, var_dtype) RuntimeError: (NotFound) The kernel with key (XPU, Undefined(AnyLayout), bfloat16) of kernel `argmax` is not registered and fail to fallback to CPU one. Selected wrong DataType `bfloat16`. Paddle support following DataTypes: float32, int32, float16. [Hint: Expected kernel_iter != iter->second.end(), but received kernel_iter == iter->second.end().] (at /host/Paddle/paddle/phi/core/kernel_factory.cc:347) 用的是下面这个脚本 #!/bin/bash cd llm task_name_or_path="llama2-7b-4k" runtime_location=/workspace/so-runtime bkcl_location=/workspace/so-bkcl export LD_LIBRARY_PATH=${bkcl_location}/:${runtime_location}/:$LD_LIBRARY_PATH export XBLAS_FC_HBM_VERSION=40 export XPU_PADDLE_L3_SIZE=43554432 export XPUAPI_DEFAULT_SIZE=1610612800 unset PADDLE_ELASTIC_JOB_ID unset PADDLE_TRAINER_ENDPOINTS unset DISTRIBUTED_TRAINER_ENDPOINTS unset FLAGS_START_PORT unset PADDLE_ELASTIC_TIMEOUT unset PADDLE_TRAINERS_NUM echo "bkcl version:" strings ${bkcl_location}/libbkcl.so | grep COM timestamp=$(date +%Y%m%d%H%M%S) echo $timestamp PYTHONPATH=../:$PYTHONPATH \ python -u -m paddle.distributed.launch \ --xpus "0,1,2,3,4,5,6,7" \ --log_dir "output/$task_name_or_path/$timestamp""_log" \ run_finetune.py \ --model_name_or_path "meta-llama/Llama-2-7b" \ --dataset_name_or_path "./data" \ --output_dir "./checkpoints/lora_ckpts" \ --per_device_train_batch_size 4 \ --gradient_accumulation_steps 4 \ --per_device_eval_batch_size 8 \ --eval_accumulation_steps 16 \ --num_train_epochs 1 \ --learning_rate 0.00003 \ --warmup_steps 30 \ --logging_steps 1 \ --evaluation_strategy "epoch" \ --save_strategy "epoch" \ --src_length 1024 \ --max_length 2048 \ --bf16 true \ --fp16_opt_level "O2" \ --do_train true \ --do_eval true \ --disable_tqdm true \ --load_best_model_at_end true \ --eval_with_do_generation false \ --metric_for_best_model "accuracy" \ --recompute true \ --save_total_limit 1 \ --tensor_parallel_degree 1 \ --pipeline_parallel_degree 1 \ --sharding "stage1" \ --lora true \ --zero_padding false \ --unified_checkpoint true \ --pissa false \ --device "xpu" \ --max_steps 50
The text was updated successfully, but these errors were encountered:
ZHUI
No branches or pull requests
请提出你的问题
跑了两步之后报这个错
The text was updated successfully, but these errors were encountered: