A random bug #9

liumarcus70s · 2019-02-22T16:27:12Z

Hi everyone,

When I train the net, I got a random bug. An error will occur in random bench

Processing |########################## | (50860/61225) Data: 2.597300s | Batch: 3.278s | Total: 0:56:45 |Processing |########################## | (50880/61225) Data: 0.000299s | Batch: 0.681s | Total: 0:56:46 |Processing |########################## | (50900/61225) Data: 0.000489s | Batch: 0.691s | Total: 0:56:47 |Processing |########################## | (50920/61225) Data: 0.000502s | Batch: 0.683s | Total: 0:56:47 |Processing |########################## | (50940/61225) Data: 2.483688s | Batch: 3.165s | Total: 0:56:50 | ETA: 0:10:09 | LOSS vox: 0.0337; coord: 0.0034 | NME: 0.3116Traceback (most recent call last):
File "train.py", line 281, in
main(parser.parse_args())
File "train.py", line 90, in main
run(model, train_loader, mode, criterion_vox, criterion_coord, optimizer_G, optimizer_P)
File "train.py", line 144, in run
for i, (inputs, target, meta) in enumerate(data_loader):
File "/home/jliu9/anaconda3/envs/jvcr/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 623, in next
return self._process_next_batch(batch)
File "/home/jliu9/anaconda3/envs/jvcr/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
ValueError: Traceback (most recent call last):
File "/home/jliu9/anaconda3/envs/jvcr/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/home/jliu9/Codes/JVCR-3Dlandmark/datasets/fa68pt3D.py", line 151, in getitem
target_j = draw_labelvolume(target_j, tpts[j] - 1, self.sigma, type=self.label_type)
File "/home/jliu9/Codes/JVCR-3Dlandmark/utils/imutils.py", line 123, in draw_labelvolume
img[img_y[0]:img_y[1], img_x[0]:img_x[1]] = g[g_y[0]:g_y[1], g_x[0]:g_x[1]]
ValueError: could not broadcast input array from shape (7,7) into shape (7,8)

So, what's the problem?

HongwenZhang · 2019-03-09T09:21:19Z

Replacing the int() with the np.int() in utils/imutils.py#L94-L95 may solve this problem.

JackLongKing · 2019-07-03T07:03:45Z

I meet this problem too. And modify int to np.int, this error still happens.
I use pytorch0.4.0. Hope help! @HongwenZhang

JackLongKing · 2019-07-03T07:05:03Z

Did you solve this problem? @liumarcus70s

HongwenZhang · 2019-07-03T08:10:07Z

Hi @JackLongKing, Could you print the value of ul, br, and pt when the bug occurs?

JackLongKing · 2019-07-03T09:19:52Z

Information Flow as follows:
//============================================================================
('pt: \n', tensor([ 48.4674, 5.6901, -0.0979]))
('ul: \n', [45, 0])
('br: \n', [52, 7]Traceback (most recent call last):
File "train.py", line 278, in
main(parser.parse_args())
File "train.py", line 90, in main
run(model, train_loader, mode, criterion_vox, criterion_coord, optimizer_G, optimizer_P)
File "train.py", line 144, in run
for i, (inputs, target, meta) in enumerate(data_loader):
File "/home/gulong/project/env/python2/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 272, in next
return self._process_next_batch(batch)
File "/home/gulong/project/env/python2/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 307, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
ValueError: Traceback (most recent call last):
File "/home/gulong/project/env/python2/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 57, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/home/gulong/project/face/landmark/JVCR-3Dlandmark/datasets/fa68pt3D.py", line 151, in getitem
target_j = draw_labelvolume(target_j, tpts[j] - 1, self.sigma, type=self.label_type)
File "/home/gulong/project/face/landmark/JVCR-3Dlandmark/utils/imutils.py", line 124, in draw_labelvolume
img[img_y[0]:img_y[1], img_x[0]:img_x[1]] = g[g_y[0]:g_y[1], g_x[0]:g_x[1]]
ValueError: could not broadcast input array from shape (7,7) into shape (8,7)
//============================================================================
@HongwenZhang Great appreciation for your help!

HongwenZhang · 2019-07-03T09:49:33Z

These values seem inconsistent with utils/imutils.py#L94-L95.
sigma is 1 and int(5.6901 - 3 * 1) should be 2 for ul[1]?
Could you carefully check and provide values at utils/imutils.py#L94 and img_x, img_y, g_x, g_y at utils/imutils.py#L119?

JackLongKing · 2019-07-04T08:31:00Z

Print code as follows:
//==================================================================
print("pt: {}\n".format(pt))
print("ul: {}\n".format(ul))
print("br: {}\n".format(br))
print('g_x[0]: {},g_x[1]: {}\n'.format(g_x[0],g_x[1]))
print('g_y[0]: {},g_y[1]: {}\n'.format(g_y[0],g_y[1]))
print('img_x[0]: {},img_x[1]: {}\n'.format(img_x[0],img_x[1]))
print('img_y[0]: {},img_y[1]: {}\n'.format(img_y[0],img_y[1]))
//==================================================================
And ouput information as follows:
//==================================================================
pt: tensor([ 50.2262, 18.8357, -0.0273])
ul: [47, 15]
br: [54, 22]
g_x[0]: 0,g_x[1]: 7
g_y[0]: 0,g_y[1]: 7
img_x[0]: 47,img_x[1]: 54
img_y[0]: 15,img_y[1]: 22

pt: tensor([ 49.ValueError: Traceback (most recent call last):
File "/home/gulong/project/env/python2/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 57, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/home/gulong/project/face/landmark/JVCR-3Dlandmark/datasets/fa68pt3D.py", line 151, in getitem
target_j = draw_labelvolume(target_j, tpts[j] - 1, self.sigma, type=self.label_type)
File "/home/gulong/project/face/landmark/JVCR-3Dlandmark/utils/imutils.py", line 130, in draw_labelvolume
img[img_y[0]:img_y[1], img_x[0]:img_x[1]] = g[g_y[0]:g_y[1], g_x[0]:g_x[1]]
ValueError: could not broadcast input array from shape (7,7) into shape (7,8)
//==================================================================
From the output information, maybe this is caused by pt ? @HongwenZhang

HongwenZhang · 2019-07-04T09:24:09Z

These values are so weird. Given these values, both img[15:22, 47:54] and g[0:7, 0:7] should have the same shape of (7,7).
So, I think it's better to replace utils/imutils.py#L119 with the following code for debugging.

try:
    img[img_y[0]:img_y[1], img_x[0]:img_x[1]] = g[g_y[0]:g_y[1], g_x[0]:g_x[1]]
except:
    print('something wrong happened.\n')
    print('pt: {}\n'.format(pt))
    print('ul: {}\n'.format(ul))
    print('br: {}\n'.format(br))
    print('sigma: {}\n'.format(sigma))
    print('g_x[0]: {},g_x[1]: {}\n'.format(g_x[0],g_x[1]))
    print('g_y[0]: {},g_y[1]: {}\n'.format(g_y[0],g_y[1]))
    print('img_x[0]: {},img_x[1]: {}\n'.format(img_x[0],img_x[1]))
    print('img_y[0]: {},img_y[1]: {}\n'.format(img_y[0],img_y[1]))
    print('img shape: {}\n'.format(img.shape))
    print('g shape: {}\n'.format(g.shape))
    raise

JackLongKing · 2019-07-04T12:23:18Z

Yes, try...except was used in utils/imutils.py, and then met another problem, out of memory, which needs another try. My device is Titan X(12GB). My log as follows and thank you for your help! @HongwenZhang
//=================================================================
==> creating model: stacks=4, blocks=1, z-res=[1, 2, 4, 64]
coarse to fine mode: True
p2v params: 13.01M
v2c params: 19.46M
using ADAM optimizer.

Epoch: 1 | LR: 0.00025000
pre_training...
train.py:201: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
losses_vox.update(loss_vox.data[0], inputs.size(0))
train.py:202: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
losses_coord.update(loss_coord.data[0], inputs.size(0))
train.py:217: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
loss='vox: {:.4f}; coord: {:.4f}'.format(loss_vox.data[0], loss_coord.data[0]),
train.py:122: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead.
input_var = torch.autograd.Variable(inputs.cuda(), volatile=True)
train.py:124: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead.
range(len(target))]
train.py:125: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead.
coord_var = torch.autograd.Variable(meta['tpts_inp'].cuda(async=True), volatile=True)
THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Exception NameError: "global name 'FileNotFoundError' is not defined" in <bound method _DataLoaderIter.del of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f57d4ad7fd0>> ignored
Traceback (most recent call last):
File "train.py", line 278, in
main(parser.parse_args())
File "train.py", line 95, in main
optimizer_P)
File "train.py", line 151, in run
pred_vox, _, pred_coord = model.forward(input_var)
File "/home/gulong/project/face/landmark/JVCR-3Dlandmark/models/pix2vox2coord.py", line 55, in forward
vox_list = self.pix2vox(x)
File "/home/gulong/project/env/python2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/gulong/project/env/python2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 114, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/gulong/project/env/python2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 124, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/gulong/project/env/python2/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py", line 65, in parallel_apply
raise output
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58
//=================================================================

HongwenZhang · 2019-07-04T13:24:42Z

The error of 'out of memory' is out of the scope of this issue.
To reproduce the bug occurred in the dataloader, we can bypass the forward of the network by adding continue at train.py#L145.

HongwenZhang mentioned this issue Mar 9, 2019

draw_labelvolume error #5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A random bug #9

A random bug #9

liumarcus70s commented Feb 22, 2019 •

edited

Loading

HongwenZhang commented Mar 9, 2019

JackLongKing commented Jul 3, 2019 •

edited

Loading

JackLongKing commented Jul 3, 2019

HongwenZhang commented Jul 3, 2019

JackLongKing commented Jul 3, 2019 •

edited

Loading

HongwenZhang commented Jul 3, 2019

JackLongKing commented Jul 4, 2019 •

edited

Loading

HongwenZhang commented Jul 4, 2019

JackLongKing commented Jul 4, 2019 •

edited

Loading

HongwenZhang commented Jul 4, 2019

A random bug #9

A random bug #9

Comments

liumarcus70s commented Feb 22, 2019 • edited Loading

HongwenZhang commented Mar 9, 2019

JackLongKing commented Jul 3, 2019 • edited Loading

JackLongKing commented Jul 3, 2019

HongwenZhang commented Jul 3, 2019

JackLongKing commented Jul 3, 2019 • edited Loading

HongwenZhang commented Jul 3, 2019

JackLongKing commented Jul 4, 2019 • edited Loading

HongwenZhang commented Jul 4, 2019

JackLongKing commented Jul 4, 2019 • edited Loading

HongwenZhang commented Jul 4, 2019

liumarcus70s commented Feb 22, 2019 •

edited

Loading

JackLongKing commented Jul 3, 2019 •

edited

Loading

JackLongKing commented Jul 3, 2019 •

edited

Loading

JackLongKing commented Jul 4, 2019 •

edited

Loading

JackLongKing commented Jul 4, 2019 •

edited

Loading