Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add ZSTD decompression support to IPC reader #693

Merged
merged 45 commits into from
Dec 27, 2024

Conversation

paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Dec 16, 2024

This PR implements ZSTD buffer decompression in the ArrowIpcDecoder and in the ArrowArrayStreamReader when built with -DNANOARROW_IPC_WITH_ZSTD=ON. It also allows a user to inject support for these into the ArrowIpcDecoder if for whatever reason they don't have control over the build flags (or want to use ZSTD that has been made available to them in a different way).

This doesn't implement multithreaded decompression but does allow a user to implement it by not using the default ArrowIpcSerialDecompressor(). This could be included in header-only C++ if there is some interest.

A non-trivial example in Python bindings (where were also wired up to support it):

import urllib.request
import nanoarrow as na
url = "https://github.com/geoarrow/geoarrow-data/releases/download/v0.1.0/ns-water-basin_point-wkb.arrow"

# Work around the 'no arrow file support'
with urllib.request.urlopen(url) as f:
    f.read(8)
    print(na.ArrayStream.from_readable(f).read_all())
#> nanoarrow.Array<non-nullable struct<OBJECTID: int64, FEAT_CODE: string, ...>[46]
#> {'OBJECTID': 1, 'FEAT_CODE': 'WABA30', 'BASIN_NAME': '01EB000', 'RIVER': 'BAR...
#> {'OBJECTID': 2, 'FEAT_CODE': 'WABA30', 'BASIN_NAME': '01EC000', 'RIVER': 'ROS...
#> {'OBJECTID': 3, 'FEAT_CODE': 'WABA30', 'BASIN_NAME': '01EA000', 'RIVER': 'TUS...
#> {'OBJECTID': 4, 'FEAT_CODE': 'WABA30', 'BASIN_NAME': '01DA000', 'RIVER': 'MET...
#> {'OBJECTID': 5, 'FEAT_CODE': 'WABA30', 'BASIN_NAME': '01ED000', 'RIVER': 'MER...
#> {'OBJECTID': 6, 'FEAT_CODE': 'WABA30', 'BASIN_NAME': '01EE000', 'RIVER': 'HER...
#> {'OBJECTID': 7, 'FEAT_CODE': 'WABA30', 'BASIN_NAME': '01EG000', 'RIVER': 'GOL...
#> {'OBJECTID': 8, 'FEAT_CODE': 'WABA30', 'BASIN_NAME': '01EF000', 'RIVER': 'LAH...
#> {'OBJECTID': 9, 'FEAT_CODE': 'WABA30', 'BASIN_NAME': '01EJ000', 'RIVER': 'SAC...
#> {'OBJECTID': 10, 'FEAT_CODE': 'WABA30', 'BASIN_NAME': '01EH000', 'RIVER': 'EA...
#> ...and 36 more items

This PR doesn't implement the R bindings (because adding a zstd dependency there is a can of worms better suited to another PR).

meson.build Show resolved Hide resolved
@paleolimbot paleolimbot force-pushed the maybe-compression-ipc branch from dfd3d4a to 6452dc6 Compare December 17, 2024 03:17
@codecov-commenter
Copy link

codecov-commenter commented Dec 17, 2024

Codecov Report

Attention: Patch coverage is 68.57143% with 33 lines in your changes missing coverage. Please review.

Project coverage is 87.43%. Comparing base (9945d33) to head (0689761).
Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
src/nanoarrow/ipc/decoder.c 63.49% 23 Missing ⚠️
src/nanoarrow/ipc/codecs.c 76.19% 10 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #693      +/-   ##
==========================================
- Coverage   87.63%   87.43%   -0.20%     
==========================================
  Files         101      104       +3     
  Lines       14537    14851     +314     
==========================================
+ Hits        12739    12985     +246     
- Misses       1798     1866      +68     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -45,6 +45,11 @@ requires = [
build-backend = "mesonpy"

[tool.meson-python.args]
# Consistent version for zstd, rather than attempting to delocate whatever
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say delocate are you just referring to just to macOS or does this affect all platforms?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should check the logs, but I believe on both MacOS and Linux the wheel build attempts to copy shared library dependencies into the build. In the MacOS case, it was attempting to copy a version of zstd that would have required updating the minimum supported MacOS version. I'm not sure if this would have happened on Linux (because I'm not sure if that library was installed on manylinux2014), but it seems better to have the same version of zstd sitting in the wheel build? (Maybe handling this via build arguments in the CI job that builds the wheels is better?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok. I can take a look too.

Thinking out loud, we might need to be careful if it's the wheel build itself installing the libraries (because the subproject meson configurations instruct the build system to do so) or if it's the wheel repair jobs that are bundling improper libraries. I'm guessing it's the former.

I think we could add the following to pyproject.toml to prevent the former issue:

[tool.meson-python.args]
install = ['--skip-subprojects']

Or possibly even statically link the zstd subproject (?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have the --skip-subprojects in pyrpoject.toml already...I think the issue was that meson actively tries to avoid building and statically linking if it can avoid it (but this is exactly what we want for the wheel, I think!). I'm happy with this at the moment but we should definitely take a closer look at exactly what's packaged before we release the first version with the meson-python build.

src/nanoarrow/ipc/decoder.c Show resolved Hide resolved
@paleolimbot paleolimbot marked this pull request as ready for review December 21, 2024 21:15
Copy link
Contributor

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few more minor comments but this lgtm. Nice work - this was a good deal of effort

python/pyproject.toml Outdated Show resolved Hide resolved
src/nanoarrow/ipc/decoder_test.cc Show resolved Hide resolved
@paleolimbot
Copy link
Member Author

@WillAyd Thank you for the review!

@paleolimbot paleolimbot merged commit dd437a9 into apache:main Dec 27, 2024
45 checks passed
@paleolimbot paleolimbot deleted the maybe-compression-ipc branch December 27, 2024 02:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants