Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunk too large for Blosc codec #98

Open
aulemahal opened this issue Sep 3, 2021 · 1 comment
Open

Chunk too large for Blosc codec #98

aulemahal opened this issue Sep 3, 2021 · 1 comment

Comments

@aulemahal
Copy link

Hi! Thanks for the very usefull package! I thinks I found a bug in the chunks choice mechanism:

My input dataset has shape (176, 226, 55115) with chunks (20, 20, 55115). The requested output chunks are (80, 60, 365). I allowed 3GB of max_mem, and there is a temp store.

Rechunking fails with : (elided traceback)

  File "/path/to/.conda/x38/lib/python3.9/site-packages/distributed/client.py", line 1813, in _gather
    raise exception.with_traceback(traceback)
  File "/path/to/.conda/x38/lib/python3.9/site-packages/rechunker/pipeline.py", line 47, in _copy_chunk
    target[chunk_key] = data
  File "/path/to/.conda/x38/lib/python3.9/site-packages/zarr/core.py", line 1213, in __setitem__
    self.set_basic_selection(selection, value, fields=fields)
  File "/path/to/.conda/x38/lib/python3.9/site-packages/zarr/core.py", line 1308, in set_basic_selection
    return self._set_basic_selection_nd(selection, value, fields=fields)
  File "/path/to/.conda/x38/lib/python3.9/site-packages/zarr/core.py", line 1599, in _set_basic_selection_nd
    self._set_selection(indexer, value, fields=fields)
  File "/path/to/.conda/x38/lib/python3.9/site-packages/zarr/core.py", line 1651, in _set_selection
    self._chunk_setitem(chunk_coords, chunk_selection, chunk_value, fields=fields)
  File "/path/to/.conda/x38/lib/python3.9/site-packages/zarr/core.py", line 1888, in _chunk_setitem
    self._chunk_setitem_nosync(chunk_coords, chunk_selection, value,
  File "/path/to/.conda/x38/lib/python3.9/site-packages/zarr/core.py", line 1893, in _chunk_setitem_nosync
    cdata = self._process_for_setitem(ckey, chunk_selection, value, fields=fields)
  File "/path/to/.conda/x38/lib/python3.9/site-packages/zarr/core.py", line 1952, in _process_for_setitem
    return self._encode_chunk(chunk)
  File "/path/to/.conda/x38/lib/python3.9/site-packages/zarr/core.py", line 2009, in _encode_chunk
    cdata = self._compressor.encode(chunk)
  File "numcodecs/blosc.pyx", line 557, in numcodecs.blosc.Blosc.encode
  File "/path/to/.conda/x38/lib/python3.9/site-packages/numcodecs/compat.py", line 102, in ensure_contiguous_ndarray
    raise ValueError(msg)
ValueError: Codec does not support buffers of > 2147483647 bytes

Turns out 55115 * 176 * 226 = 2192254240 = 8 Go (float32) and is slightly over the number in the error message (by 2%). So I'm guessing rechunker is trying to put everything in a single chunk? Even though this is way above max_mem. Also, I never asked for Blosc encoding, so I guess it is automatic? Not a problem, but it seems a smaller chunk should be chosen in that case.

@rabernat
Copy link
Member

rabernat commented Sep 5, 2021

Thanks for the bug report.

So I'm guessing rechunker is trying to put everything in a single chunk?

This should definitely not happen unless your total dataset size is < max_mem, which is not the case here.

I tried to reproduce your issue but could not

import zarr
from dask.diagnostics import ProgressBar
from rechunker import rechunk

shape = (176, 226, 55115)
source = zarr.ones(shape, chunks=(20, 20, 55115), dtype='f8', store='tmp-data/source.zarr', overwrite=True)
rechunked = rechunk(source, (80, 60, 365), '3GB', 'tmp-data/target.zarr',
                    target_options=dict(overwrite=True))
assert rechunked._intermediate is None  # no intermediate

with ProgressBar():
    rechunked.execute()

Could you share a bit more detail about your input data and the exact code your are using to call rechunker?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants