Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce assign_fresh_ids flag and allow skipping fresh assignment of IDs on Table creation #1304

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

sungwy
Copy link
Collaborator

@sungwy sungwy commented Nov 8, 2024

Implements: #1284

@sungwy sungwy marked this pull request as ready for review November 8, 2024 22:50
@sungwy sungwy changed the title Introduce assign_fresh_ids and allow skipping fresh assignment of IDs on table creation Introduce assign_fresh_ids flag and allow skipping fresh assignment of IDs on Table creation Nov 8, 2024
@sungwy sungwy requested review from Fokko, HonahX and kevinjqliu and removed request for HonahX November 13, 2024 17:02
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay @sungwy. This looks good, I left two small comments. Thanks for adding all the tests 👍

@@ -122,32 +122,13 @@ schema = Schema(
),
)

from pyiceberg.partitioning import PartitionSpec, PartitionField
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it, thanks for cleaning this up!

pyiceberg/catalog/__init__.py Show resolved Hide resolved
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finally got a chance to look over this, sorry for the delay

Comment on lines +524 to +528
if assign_fresh_ids:
fresh_schema = assign_fresh_schema_ids(schema)
partition_spec = assign_fresh_partition_spec_ids(partition_spec, schema, fresh_schema)
sort_order = assign_fresh_sort_order_ids(sort_order, schema, fresh_schema)
schema = fresh_schema
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is where assign_fresh_ids is ultimately used

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and this function is called by _create_staged_table and create_table

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both functions take schema: Union[Schema, "pa.Schema"], as input.

  • If pa.Schema is given, we want to convert and assign id (this is currently done by setting the assign_fresh_ids flag to True)
  • If Schema is given, currently the default is to assign the schema ids, assign_fresh_ids: bool = True.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My proposal is to not include assign_fresh_ids as a flag in functions other than new_table_metadata.
So when _create_staged_table and create_table is given

  • a pa.Schema, convert to Schema and set assign_fresh_ids to True in new_table_metadata
  • a Schema. Assume the user created Schema with the correct IDs (possibly verify some correctness characteristics such as uniqueness). And use the schema as is
    If a user wants to reassign IDs for Schema, this can be done outside the create_table functions and we can even provide a helper function to do so.

I feel like this way can help break apart the responsibilities of schema id assignment from the create_table methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LMK if this makes sense or if im missing something!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kevinjqliu thank you for the review! Yes, I agree that the code path would be simpler if we didn't expose assign_fresh_ids as a parameter for the API. However, I think there were some concerns that were raised in not surfacing that as an argument and having two code paths based strictly on the input parameter. #1284

I will add this to the agenda for the PyIceberg Sync on Tuesday and see if we that will help the community in reaching a consensus.

@Fokko Fokko self-requested a review November 26, 2024 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants