Implement column projection #1443

gabeiglio · 2024-12-18T20:26:02Z

This is a fix for issue #1401. In which table scans needed to infer partition column by following the column projection rules

Fixes #1401

…ction together

…an initial-default

kevinjqliu

Added a few comments, please take a look! The PR looks great already. Thanks for working on this!

pyiceberg/io/pyarrow.py

tests/io/test_pyarrow.py

pyiceberg/io/pyarrow.py

…tion logic to helper method, changed test to use high-level table scan

Fokko · 2024-12-29T06:23:27Z

pyiceberg/io/pyarrow.py

+    file: DataFile,
+    projected_schema: Schema,
+    projected_field_ids: Set[int],
+    file_project_schema: Schema,


Since we only need the IDs, I'd rather just pass in those for clarity:

Suggested change

file_project_schema: Schema,

file_project_schema: Set[int],

pyiceberg/io/pyarrow.py

Fokko · 2024-12-29T07:48:05Z

pyiceberg/io/pyarrow.py

+        if partition_spec is not None:
+            for partition_field in partition_spec.fields_by_source_id(field_id):
+                if isinstance(partition_field.transform, IdentityTransform) and partition_field.name in file.partition.__dict__:
+                    projected_missing_fields[partition_field.name] = file.partition.__dict__[partition_field.name]


This is my mistake, the lookup should never be done by name, but by field-id. I've added the lookup by name to the record, but this should not be used outside of tests.

Instead, we want to create a lookup table from the field:

iceberg-python/pyiceberg/schema.py

Lines 1224 to 1233 in e646500

def build_position_accessors(schema_or_type: Union[Schema, IcebergType]) -> Dict[int, Accessor]:

"""Generate an index of field IDs to schema position accessors.

Args:

schema_or_type (Union[Schema, IcebergType]): A schema or type to index.

Returns:

Dict[int, Accessor]: An index of field IDs to accessors.

"""

return visit(schema_or_type, _BuildPositionAccessors())

It should be used something like:

accessors = build_position_accessors(table_schema) projected_missing_fields[partition_field.name] = record[accessors[partition_field.field_id]]

Thanks for the input! thats right, the partition field shouldn't be accessed with the name. But, if I'm understanding correctly, the partition record position in the manifest file is not the same as the position of the field in the schema.

For example, lets say I have this schema

schema { 1 field1: string 2 partition_field: int }

And a DataFile like this:

DataFile { ... partition: Record(partittion_field: 1) ... }

Running build_position_accessor with the schema defined earlier will result in:

{ 1: Accessor(position=0,inner=None), 2: Accessor(position=1,inner=None), }

Then, doing accessor[partition_field.id] will result in Accessor(position=1,inner=None) but the partition record in the manifest would be in position 0.

Gabriel Igliozzi and others added 3 commits December 18, 2024 12:01

Initial commit for fix

f814ee1

Add test and commit lint changes

cf36660

Merge branch 'apache:main' into specPartitionIdentity

2fb6a16

Fokko self-requested a review December 18, 2024 20:38

Gabriel Igliozzi added 3 commits December 19, 2024 00:19

default-value bug fixes and adding more tests

7982465

Add continue, check file_schema before using it, group steps of proje…

e4d5882

…ction together

Fix lint issues, reorder partition spec to be of higher importance th…

694a52d

…an initial-default

gabeiglio marked this pull request as ready for review December 19, 2024 15:12

kevinjqliu reviewed Dec 19, 2024

View reviewed changes

pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved

tests/io/test_pyarrow.py Outdated Show resolved Hide resolved

pyiceberg/io/pyarrow.py Show resolved Hide resolved

Fokko reviewed Dec 20, 2024

View reviewed changes

pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved

Fokko reviewed Dec 20, 2024

View reviewed changes

pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved

kevinjqliu self-requested a review December 23, 2024 19:04

Removed file_schema check and initial-default logic, separated projec…

fee24ab

…tion logic to helper method, changed test to use high-level table scan

Fokko reviewed Dec 29, 2024

View reviewed changes

pyiceberg/io/pyarrow.py Show resolved Hide resolved

Fokko reviewed Dec 29, 2024

View reviewed changes

pyiceberg/io/pyarrow.py Show resolved Hide resolved

Fokko reviewed Dec 29, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement column projection #1443

Implement column projection #1443

gabeiglio commented Dec 18, 2024 •

edited by Fokko

Loading

kevinjqliu left a comment

Fokko Dec 29, 2024

Fokko Dec 29, 2024

gabeiglio Jan 2, 2025

	def build_position_accessors(schema_or_type: Union[Schema, IcebergType]) -> Dict[int, Accessor]:
	"""Generate an index of field IDs to schema position accessors.

	Args:
	schema_or_type (Union[Schema, IcebergType]): A schema or type to index.

	Returns:
	Dict[int, Accessor]: An index of field IDs to accessors.
	"""
	return visit(schema_or_type, _BuildPositionAccessors())

Implement column projection #1443

Are you sure you want to change the base?

Implement column projection #1443

Conversation

gabeiglio commented Dec 18, 2024 • edited by Fokko Loading

kevinjqliu left a comment

Choose a reason for hiding this comment

Fokko Dec 29, 2024

Choose a reason for hiding this comment

Fokko Dec 29, 2024

Choose a reason for hiding this comment

gabeiglio Jan 2, 2025

Choose a reason for hiding this comment

gabeiglio commented Dec 18, 2024 •

edited by Fokko

Loading