Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RewriteManifest with more options #9615

Open
jackye1995 opened this issue Feb 1, 2024 · 6 comments · May be fixed by #9731 or #11881
Open

RewriteManifest with more options #9615

jackye1995 opened this issue Feb 1, 2024 · 6 comments · May be fixed by #9731 or #11881
Assignees

Comments

@jackye1995
Copy link
Contributor

Feature Request / Improvement

from devlist discussion https://lists.apache.org/thread/x2rqfck4nz78j0fmz4sdchr5wxoywm29

I think we are landing in some experience like (e.g. in Spark):

SparkActions.rewriteManifests(table)
  .sort("b", "a")
  .commit()
SparkActions.rewriteManifests(table)
  .sort(partitionData -> string)
  .minManifests(20)
  .maxManifests(40)
  .targetManifestSize(8MB)
  .commit()

so that users can more flexibly fine tune the sorting and colocation behaviors they want for the Iceberg metadata tree

Query engine

None

@rdblue
Copy link
Contributor

rdblue commented Feb 1, 2024

This looks great to me!

@zachdisc
Copy link

zachdisc commented Feb 7, 2024

Love the idea, I can take a crack at it!

@zachdisc
Copy link

zachdisc commented Feb 8, 2024

Looking at the base API interface and core implementation, there is a ClusterBy interface that is simply missing from the SparkActions implementation.

I think the API Jack proposed is cleaner, but it is probably more correct to just implement the same clusterBy interface. Thoughts?

@zachdisc
Copy link

Okay first cut at the simple version is in PR above

Copy link

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

@github-actions github-actions bot added the stale label Oct 13, 2024
@ZachDischner
Copy link

Still open and would like to incorporate this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment