Adding new rewrite manifest spark action to accept custom partition o… #11881
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Note this is a fresh PR replacing #9731. It had too much accumulated conflicts and changes, I rebased and messed it up. This is a clean start with all previous feedback incorporated.
What
This adds a simple
sort
method to theRewriteManifests
spark action which lets user specify the partition column order to consider when grouping manifests.Illustration:
Closes #9615
Why
Iceberg's metadata is organized into a forest of manifest_files which point to data files sharing common partitions. By default, and during
RewriteManifests
, the partition grouping is determined by the defaultSpec
partition order. If the primary query pattern is more aligned with the last partition in the table's spec, manifests are poorly suited to quickly plan and prune around those partitions.EG
Will create manifests that first group by
region
, whosemanifest_file
contents may span a wide range ofevent_time
values. For a primary query pattern that doesn't care aboutregion
,storeId
, etc, this leads to inefficient queries.