-
Notifications
You must be signed in to change notification settings - Fork 7.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
微调ch_PP-OCRv4_det_server_train,训练时评估模型显示out of memory #13759
Comments
显存不够不够,调小batchsize |
train的batch size是8,跑的时候没问题。eval的batch_size是1,但跑不起来。训练中途每1000个step评估一次嘛,然后它就爆”显存不足“。前面1000个step训练都是正常的 |
那有试过更改每次评估的step间隔吗?改小 |
观察一下到底是内存爆了还是显存爆了吧,把batchsize改成4 看看,虽然我也不知道有没有用,没碰到过这种问题 |
是显存爆了,调了train的batchsize也不行,我训完之后用tools/infer_det.py检测图片也是说显存爆了,就很搞不懂。。。 |
paddle有时候有些奇奇怪怪的bug,要不重新装一下训练环境 看看(doge |
要安装哪个版本paddlepaddle? 我都是设置为1,还是爆显存 |
遇到同样的问题,不知如何解决 |
🔎 Search before asking
🐛 Bug (问题描述)
[2024/08/27 19:14:23] ppocr INFO: epoch: [5/500], global_step: 10, lr: 0.001000, loss: 2.168079, loss_shrink_maps: 1.022120, loss_threshold_maps: 0.760488, loss_binary_maps: 0.204714, loss_cbn: 0.204714, avg_reader_cost: 0.03694 s, avg_batch_cost: 0.04500 s, avg_samples: 0.12, ips: 2.66682 samples/s, eta: 0:41:51, max_mem_reserved: 13909 MB, max_mem_allocated: 11894 MB
eval model:: 0%| | 0/4 [00:00<?, ?it/s]Traceback (most recent call last):
File "/app/ocr/PaddleOCR-release-2.8/tools/train.py", line 257, in
main(config, device, logger, vdl_writer, seed)
File "/app/ocr/PaddleOCR-release-2.8/tools/train.py", line 209, in main
program.train(
File "/app/ocr/PaddleOCR-release-2.8/tools/program.py", line 452, in train
cur_metric = eval(
File "/app/ocr/PaddleOCR-release-2.8/tools/program.py", line 622, in eval
preds = model(images)
File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1429, in call
return self.forward(*inputs, **kwargs)
File "/app/ocr/PaddleOCR-release-2.8/ppocr/modeling/architectures/base_model.py", line 99, in forward
x = self.head(x, targets=data)
File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1429, in call
return self.forward(*inputs, **kwargs)
File "/app/ocr/PaddleOCR-release-2.8/ppocr/modeling/heads/det_db_head.py", line 145, in forward
cbn_maps = self.cbn_layer(self.up_conv(f), shrink_maps, None)
File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1429, in call
return self.forward(*inputs, **kwargs)
File "/app/ocr/PaddleOCR-release-2.8/ppocr/modeling/heads/det_db_head.py", line 127, in forward
out = self.last_1(self.last_3(outf))
File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1429, in call
return self.forward(*inputs, **kwargs)
File "/app/ocr/PaddleOCR-release-2.8/ppocr/modeling/backbones/det_mobilenet_v3.py", line 186, in forward
x = self.conv(x)
File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1429, in call
return self.forward(*inputs, **kwargs)
File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/layer/conv.py", line 715, in forward
out = F.conv._conv_nd(
File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/functional/conv.py", line 128, in _conv_nd
pre_bias = _C_ops.conv2d(
MemoryError:
C++ Traceback (most recent call last):
0 paddle::pybind::eager_api_conv2d(_object*, _object*, _object*)
1 conv2d_ad_func(paddle::Tensor const&, paddle::Tensor const&, std::vector<int, std::allocator >, std::vector<int, std::allocator >, std::string, std::vector<int, std::allocator >, int, std::string)
2 paddle::experimental::conv2d(paddle::Tensor const&, paddle::Tensor const&, std::vector<int, std::allocator > const&, std::vector<int, std::allocator > const&, std::string const&, std::vector<int, std::allocator > const&, int, std::string const&)
3 void phi::ConvCudnnKernel<float, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor const&, std::vector<int, std::allocator > const&, std::vector<int, std::allocator > const&, std::string const&, std::vector<int, std::allocator > const&, int, std::string const&, phi::DenseTensor*)
4 float* phi::DeviceContext::Alloc(phi::TensorBase*, unsigned long, bool) const
5 phi::DeviceContext::Impl::Alloc(phi::TensorBase*, phi::Place const&, phi::DataType, unsigned long, bool, bool) const
6 phi::DenseTensor::AllocateFrom(phi::Allocator*, phi::DataType, unsigned long, bool)
7 paddle::memory::allocation::Allocator::Allocate(unsigned long)
8 paddle::memory::allocation::StatAllocator::AllocateImpl(unsigned long)
9 paddle::memory::allocation::Allocator::Allocate(unsigned long)
10 paddle::memory::allocation::Allocator::Allocate(unsigned long)
11 paddle::memory::allocation::Allocator::Allocate(unsigned long)
12 paddle::memory::allocation::Allocator::Allocate(unsigned long)
13 paddle::memory::allocation::CUDAAllocator::AllocateImpl(unsigned long)
14 std::string phi::enforce::GetCompleteTraceBackString<std::string >(std::string&&, char const*, int)
15 phi::enforce::GetCurrentTraceBackStringabi:cxx11
Error Message Summary:
ResourceExhaustedError:
Out of memory error on GPU 1. Cannot allocate 3.158203GB memory on GPU 1, 13.315369GB memory has been allocated and available memory is only 2.386902GB.
Please check whether there is any other process using GPU 1.
(at /paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:86)
🏃♂️ Environment (运行环境)
PaddlePaddle-gpu:2.6 PaddleOCR:2.8 RAM:16G
🌰 Minimal Reproducible Example (最小可复现问题的Demo)
python tools/train.py -c configs/det/ch_PP-OCRv4/ch_PP-OCRv4_det_teacher.yml
The text was updated successfully, but these errors were encountered: