Introduce `assign_fresh_ids` flag and allow skipping fresh assignment of IDs on Table creation #1304

sungwy · 2024-11-08T05:28:13Z

Implements: #1284

Fokko

Sorry for the delay @sungwy. This looks good, I left two small comments. Thanks for adding all the tests 👍

Fokko · 2024-11-20T19:12:08Z

mkdocs/docs/api.md

@@ -122,32 +122,13 @@ schema = Schema(
    ),
 )

-from pyiceberg.partitioning import PartitionSpec, PartitionField


Love it, thanks for cleaning this up!

pyiceberg/catalog/__init__.py

kevinjqliu

finally got a chance to look over this, sorry for the delay

kevinjqliu · 2024-11-23T19:45:37Z

pyiceberg/table/metadata.py

+    if assign_fresh_ids:
+        fresh_schema = assign_fresh_schema_ids(schema)
+        partition_spec = assign_fresh_partition_spec_ids(partition_spec, schema, fresh_schema)
+        sort_order = assign_fresh_sort_order_ids(sort_order, schema, fresh_schema)
+        schema = fresh_schema


this is where assign_fresh_ids is ultimately used

and this function is called by _create_staged_table and create_table

both functions take schema: Union[Schema, "pa.Schema"], as input.

If pa.Schema is given, we want to convert and assign id (this is currently done by setting the assign_fresh_ids flag to True)

If Schema is given, currently the default is to assign the schema ids, assign_fresh_ids: bool = True.

My proposal is to not include assign_fresh_ids as a flag in functions other than new_table_metadata.
So when _create_staged_table and create_table is given

a pa.Schema, convert to Schema and set assign_fresh_ids to True in new_table_metadata

a Schema. Assume the user created Schema with the correct IDs (possibly verify some correctness characteristics such as uniqueness). And use the schema as is
If a user wants to reassign IDs for Schema, this can be done outside the create_table functions and we can even provide a helper function to do so.

I feel like this way can help break apart the responsibilities of schema id assignment from the create_table methods.

LMK if this makes sense or if im missing something!

Hi @kevinjqliu thank you for the review! Yes, I agree that the code path would be simpler if we didn't expose assign_fresh_ids as a parameter for the API. However, I think there were some concerns that were raised in not surfacing that as an argument and having two code paths based strictly on the input parameter. #1284

I will add this to the agenda for the PyIceberg Sync on Tuesday and see if we that will help the community in reaching a consensus.

sungwy added 2 commits November 8, 2024 05:27

assign_fresh_ids

5bf80a2

add all tests

983eeb6

sungwy marked this pull request as ready for review November 8, 2024 22:50

sungwy changed the title ~~Introduce assign_fresh_ids and allow skipping fresh assignment of IDs on table creation~~ Introduce assign_fresh_ids flag and allow skipping fresh assignment of IDs on Table creation Nov 8, 2024

sungwy requested review from Fokko, HonahX and kevinjqliu and removed request for HonahX November 13, 2024 17:02

sungwy mentioned this pull request Nov 13, 2024

Enhance catalog.create_table API to enable creation of table with matching field_ids to provided Schema #1284

Open

Fokko reviewed Nov 20, 2024

View reviewed changes

kevinjqliu reviewed Nov 23, 2024

View reviewed changes

Fokko self-requested a review November 26, 2024 17:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce `assign_fresh_ids` flag and allow skipping fresh assignment of IDs on Table creation #1304

Introduce `assign_fresh_ids` flag and allow skipping fresh assignment of IDs on Table creation #1304

sungwy commented Nov 8, 2024

Fokko left a comment

Fokko Nov 20, 2024

kevinjqliu left a comment

kevinjqliu Nov 23, 2024

kevinjqliu Nov 23, 2024

kevinjqliu Nov 23, 2024

kevinjqliu Nov 23, 2024

kevinjqliu Nov 23, 2024

sungwy Nov 23, 2024

Introduce assign_fresh_ids flag and allow skipping fresh assignment of IDs on Table creation #1304

Are you sure you want to change the base?

Introduce assign_fresh_ids flag and allow skipping fresh assignment of IDs on Table creation #1304

Conversation

sungwy commented Nov 8, 2024

Fokko left a comment

Choose a reason for hiding this comment

Fokko Nov 20, 2024

Choose a reason for hiding this comment

kevinjqliu left a comment

Choose a reason for hiding this comment

kevinjqliu Nov 23, 2024

Choose a reason for hiding this comment

kevinjqliu Nov 23, 2024

Choose a reason for hiding this comment

kevinjqliu Nov 23, 2024

Choose a reason for hiding this comment

kevinjqliu Nov 23, 2024

Choose a reason for hiding this comment

kevinjqliu Nov 23, 2024

Choose a reason for hiding this comment

sungwy Nov 23, 2024

Choose a reason for hiding this comment

Introduce `assign_fresh_ids` flag and allow skipping fresh assignment of IDs on Table creation #1304

Introduce `assign_fresh_ids` flag and allow skipping fresh assignment of IDs on Table creation #1304