Chunk too large for Blosc codec #98

aulemahal · 2021-09-03T22:02:10Z

Hi! Thanks for the very usefull package! I thinks I found a bug in the chunks choice mechanism:

My input dataset has shape (176, 226, 55115) with chunks (20, 20, 55115). The requested output chunks are (80, 60, 365). I allowed 3GB of max_mem, and there is a temp store.

Rechunking fails with : (elided traceback)

  File "/path/to/.conda/x38/lib/python3.9/site-packages/distributed/client.py", line 1813, in _gather
    raise exception.with_traceback(traceback)
  File "/path/to/.conda/x38/lib/python3.9/site-packages/rechunker/pipeline.py", line 47, in _copy_chunk
    target[chunk_key] = data
  File "/path/to/.conda/x38/lib/python3.9/site-packages/zarr/core.py", line 1213, in __setitem__
    self.set_basic_selection(selection, value, fields=fields)
  File "/path/to/.conda/x38/lib/python3.9/site-packages/zarr/core.py", line 1308, in set_basic_selection
    return self._set_basic_selection_nd(selection, value, fields=fields)
  File "/path/to/.conda/x38/lib/python3.9/site-packages/zarr/core.py", line 1599, in _set_basic_selection_nd
    self._set_selection(indexer, value, fields=fields)
  File "/path/to/.conda/x38/lib/python3.9/site-packages/zarr/core.py", line 1651, in _set_selection
    self._chunk_setitem(chunk_coords, chunk_selection, chunk_value, fields=fields)
  File "/path/to/.conda/x38/lib/python3.9/site-packages/zarr/core.py", line 1888, in _chunk_setitem
    self._chunk_setitem_nosync(chunk_coords, chunk_selection, value,
  File "/path/to/.conda/x38/lib/python3.9/site-packages/zarr/core.py", line 1893, in _chunk_setitem_nosync
    cdata = self._process_for_setitem(ckey, chunk_selection, value, fields=fields)
  File "/path/to/.conda/x38/lib/python3.9/site-packages/zarr/core.py", line 1952, in _process_for_setitem
    return self._encode_chunk(chunk)
  File "/path/to/.conda/x38/lib/python3.9/site-packages/zarr/core.py", line 2009, in _encode_chunk
    cdata = self._compressor.encode(chunk)
  File "numcodecs/blosc.pyx", line 557, in numcodecs.blosc.Blosc.encode
  File "/path/to/.conda/x38/lib/python3.9/site-packages/numcodecs/compat.py", line 102, in ensure_contiguous_ndarray
    raise ValueError(msg)
ValueError: Codec does not support buffers of > 2147483647 bytes

Turns out 55115 * 176 * 226 = 2192254240 = 8 Go (float32) and is slightly over the number in the error message (by 2%). So I'm guessing rechunker is trying to put everything in a single chunk? Even though this is way above max_mem. Also, I never asked for Blosc encoding, so I guess it is automatic? Not a problem, but it seems a smaller chunk should be chosen in that case.

The text was updated successfully, but these errors were encountered:

rabernat · 2021-09-05T18:40:56Z

Thanks for the bug report.

So I'm guessing rechunker is trying to put everything in a single chunk?

This should definitely not happen unless your total dataset size is < max_mem, which is not the case here.

I tried to reproduce your issue but could not

import zarr
from dask.diagnostics import ProgressBar
from rechunker import rechunk

shape = (176, 226, 55115)
source = zarr.ones(shape, chunks=(20, 20, 55115), dtype='f8', store='tmp-data/source.zarr', overwrite=True)
rechunked = rechunk(source, (80, 60, 365), '3GB', 'tmp-data/target.zarr',
                    target_options=dict(overwrite=True))
assert rechunked._intermediate is None  # no intermediate

with ProgressBar():
    rechunked.execute()

Could you share a bit more detail about your input data and the exact code your are using to call rechunker?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunk too large for Blosc codec #98

Chunk too large for Blosc codec #98

aulemahal commented Sep 3, 2021

rabernat commented Sep 5, 2021

Chunk too large for Blosc codec #98

Chunk too large for Blosc codec #98

Comments

aulemahal commented Sep 3, 2021

rabernat commented Sep 5, 2021