Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A random bug #9

Open
liumarcus70s opened this issue Feb 22, 2019 · 10 comments
Open

A random bug #9

liumarcus70s opened this issue Feb 22, 2019 · 10 comments

Comments

@liumarcus70s
Copy link

liumarcus70s commented Feb 22, 2019

Hi everyone,

When I train the net, I got a random bug. An error will occur in random bench


Processing |########################## | (50860/61225) Data: 2.597300s | Batch: 3.278s | Total: 0:56:45 |Processing |########################## | (50880/61225) Data: 0.000299s | Batch: 0.681s | Total: 0:56:46 |Processing |########################## | (50900/61225) Data: 0.000489s | Batch: 0.691s | Total: 0:56:47 |Processing |########################## | (50920/61225) Data: 0.000502s | Batch: 0.683s | Total: 0:56:47 |Processing |########################## | (50940/61225) Data: 2.483688s | Batch: 3.165s | Total: 0:56:50 | ETA: 0:10:09 | LOSS vox: 0.0337; coord: 0.0034 | NME: 0.3116Traceback (most recent call last):
File "train.py", line 281, in
main(parser.parse_args())
File "train.py", line 90, in main
run(model, train_loader, mode, criterion_vox, criterion_coord, optimizer_G, optimizer_P)
File "train.py", line 144, in run
for i, (inputs, target, meta) in enumerate(data_loader):
File "/home/jliu9/anaconda3/envs/jvcr/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 623, in next
return self._process_next_batch(batch)
File "/home/jliu9/anaconda3/envs/jvcr/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
ValueError: Traceback (most recent call last):
File "/home/jliu9/anaconda3/envs/jvcr/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/home/jliu9/Codes/JVCR-3Dlandmark/datasets/fa68pt3D.py", line 151, in getitem
target_j = draw_labelvolume(target_j, tpts[j] - 1, self.sigma, type=self.label_type)
File "/home/jliu9/Codes/JVCR-3Dlandmark/utils/imutils.py", line 123, in draw_labelvolume
img[img_y[0]:img_y[1], img_x[0]:img_x[1]] = g[g_y[0]:g_y[1], g_x[0]:g_x[1]]
ValueError: could not broadcast input array from shape (7,7) into shape (7,8)


So, what's the problem?

@HongwenZhang
Copy link
Owner

Replacing the int() with the np.int() in utils/imutils.py#L94-L95 may solve this problem.

@JackLongKing
Copy link

JackLongKing commented Jul 3, 2019

I meet this problem too. And modify int to np.int, this error still happens.
I use pytorch0.4.0. Hope help! @HongwenZhang

@JackLongKing
Copy link

Did you solve this problem? @liumarcus70s

@HongwenZhang
Copy link
Owner

Hi @JackLongKing, Could you print the value of ul, br, and pt when the bug occurs?

@JackLongKing
Copy link

JackLongKing commented Jul 3, 2019

Information Flow as follows:
//============================================================================
('pt: \n', tensor([ 48.4674, 5.6901, -0.0979]))
('ul: \n', [45, 0])
('br: \n', [52, 7]Traceback (most recent call last):
File "train.py", line 278, in
main(parser.parse_args())
File "train.py", line 90, in main
run(model, train_loader, mode, criterion_vox, criterion_coord, optimizer_G, optimizer_P)
File "train.py", line 144, in run
for i, (inputs, target, meta) in enumerate(data_loader):
File "/home/gulong/project/env/python2/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 272, in next
return self._process_next_batch(batch)
File "/home/gulong/project/env/python2/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 307, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
ValueError: Traceback (most recent call last):
File "/home/gulong/project/env/python2/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 57, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/home/gulong/project/face/landmark/JVCR-3Dlandmark/datasets/fa68pt3D.py", line 151, in getitem
target_j = draw_labelvolume(target_j, tpts[j] - 1, self.sigma, type=self.label_type)
File "/home/gulong/project/face/landmark/JVCR-3Dlandmark/utils/imutils.py", line 124, in draw_labelvolume
img[img_y[0]:img_y[1], img_x[0]:img_x[1]] = g[g_y[0]:g_y[1], g_x[0]:g_x[1]]
ValueError: could not broadcast input array from shape (7,7) into shape (8,7)
//============================================================================
@HongwenZhang Great appreciation for your help!

@HongwenZhang
Copy link
Owner

These values seem inconsistent with utils/imutils.py#L94-L95.
sigma is 1 and int(5.6901 - 3 * 1) should be 2 for ul[1]?
Could you carefully check and provide values at utils/imutils.py#L94 and img_x, img_y, g_x, g_y at utils/imutils.py#L119?

@JackLongKing
Copy link

JackLongKing commented Jul 4, 2019

Print code as follows:
//==================================================================
print("pt: {}\n".format(pt))
print("ul: {}\n".format(ul))
print("br: {}\n".format(br))
print('g_x[0]: {},g_x[1]: {}\n'.format(g_x[0],g_x[1]))
print('g_y[0]: {},g_y[1]: {}\n'.format(g_y[0],g_y[1]))
print('img_x[0]: {},img_x[1]: {}\n'.format(img_x[0],img_x[1]))
print('img_y[0]: {},img_y[1]: {}\n'.format(img_y[0],img_y[1]))
//==================================================================
And ouput information as follows:
//==================================================================
pt: tensor([ 50.2262, 18.8357, -0.0273])
ul: [47, 15]
br: [54, 22]
g_x[0]: 0,g_x[1]: 7
g_y[0]: 0,g_y[1]: 7
img_x[0]: 47,img_x[1]: 54
img_y[0]: 15,img_y[1]: 22

pt: tensor([ 49.ValueError: Traceback (most recent call last):
File "/home/gulong/project/env/python2/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 57, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/home/gulong/project/face/landmark/JVCR-3Dlandmark/datasets/fa68pt3D.py", line 151, in getitem
target_j = draw_labelvolume(target_j, tpts[j] - 1, self.sigma, type=self.label_type)
File "/home/gulong/project/face/landmark/JVCR-3Dlandmark/utils/imutils.py", line 130, in draw_labelvolume
img[img_y[0]:img_y[1], img_x[0]:img_x[1]] = g[g_y[0]:g_y[1], g_x[0]:g_x[1]]
ValueError: could not broadcast input array from shape (7,7) into shape (7,8)
//==================================================================
From the output information, maybe this is caused by pt ? @HongwenZhang

@HongwenZhang
Copy link
Owner

These values are so weird. Given these values, both img[15:22, 47:54] and g[0:7, 0:7] should have the same shape of (7,7).
So, I think it's better to replace utils/imutils.py#L119 with the following code for debugging.

try:
    img[img_y[0]:img_y[1], img_x[0]:img_x[1]] = g[g_y[0]:g_y[1], g_x[0]:g_x[1]]
except:
    print('something wrong happened.\n')
    print('pt: {}\n'.format(pt))
    print('ul: {}\n'.format(ul))
    print('br: {}\n'.format(br))
    print('sigma: {}\n'.format(sigma))
    print('g_x[0]: {},g_x[1]: {}\n'.format(g_x[0],g_x[1]))
    print('g_y[0]: {},g_y[1]: {}\n'.format(g_y[0],g_y[1]))
    print('img_x[0]: {},img_x[1]: {}\n'.format(img_x[0],img_x[1]))
    print('img_y[0]: {},img_y[1]: {}\n'.format(img_y[0],img_y[1]))
    print('img shape: {}\n'.format(img.shape))
    print('g shape: {}\n'.format(g.shape))
    raise

@JackLongKing
Copy link

JackLongKing commented Jul 4, 2019

Yes, try...except was used in utils/imutils.py, and then met another problem, out of memory, which needs another try. My device is Titan X(12GB). My log as follows and thank you for your help! @HongwenZhang
//=================================================================
==> creating model: stacks=4, blocks=1, z-res=[1, 2, 4, 64]
coarse to fine mode: True
p2v params: 13.01M
v2c params: 19.46M
using ADAM optimizer.

Epoch: 1 | LR: 0.00025000
pre_training...
train.py:201: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
losses_vox.update(loss_vox.data[0], inputs.size(0))
train.py:202: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
losses_coord.update(loss_coord.data[0], inputs.size(0))
train.py:217: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
loss='vox: {:.4f}; coord: {:.4f}'.format(loss_vox.data[0], loss_coord.data[0]),
train.py:122: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead.
input_var = torch.autograd.Variable(inputs.cuda(), volatile=True)
train.py:124: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead.
range(len(target))]
train.py:125: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead.
coord_var = torch.autograd.Variable(meta['tpts_inp'].cuda(async=True), volatile=True)
THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Exception NameError: "global name 'FileNotFoundError' is not defined" in <bound method _DataLoaderIter.del of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f57d4ad7fd0>> ignored
Traceback (most recent call last):
File "train.py", line 278, in
main(parser.parse_args())
File "train.py", line 95, in main
optimizer_P)
File "train.py", line 151, in run
pred_vox, _, pred_coord = model.forward(input_var)
File "/home/gulong/project/face/landmark/JVCR-3Dlandmark/models/pix2vox2coord.py", line 55, in forward
vox_list = self.pix2vox(x)
File "/home/gulong/project/env/python2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/gulong/project/env/python2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 114, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/gulong/project/env/python2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 124, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/gulong/project/env/python2/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py", line 65, in parallel_apply
raise output
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58
//=================================================================

@HongwenZhang
Copy link
Owner

The error of 'out of memory' is out of the scope of this issue.
To reproduce the bug occurred in the dataloader, we can bypass the forward of the network by adding continue at train.py#L145.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants