Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Location Providers #1452

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

smaheshwar-pltr
Copy link
Contributor

@smaheshwar-pltr smaheshwar-pltr commented Dec 20, 2024

Closes #861.

As the issue suggests, introduces a LocationProvider interface with the default and object-store-optimised implementations (the latter can be enabled via the newly-introduced table properties). This is pluggable, just like FileIO.

Largely inspired by and consistent with the Java implementation.

@smaheshwar-pltr smaheshwar-pltr changed the title WIP: Support LocationProviders WIP: Support LocationProviders Dec 20, 2024
@smaheshwar-pltr smaheshwar-pltr changed the title WIP: Support LocationProviders WIP: Support Location Providers Dec 20, 2024
Comment on lines +1671 to +1673
module_name, class_name = ".".join(path_parts[:-1]), path_parts[-1]
module = importlib.import_module(module_name)
class_ = getattr(module, class_name)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, wonder if we should reduce duplication between this and file IO loading.

@@ -2622,13 +2631,15 @@ def _dataframe_to_data_files(
property_name=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
default=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT,
)
location_provider = load_location_provider(table_location=table_metadata.location, table_properties=table_metadata.properties)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't love this. I wanted to do something like this and cache on at least the Transaction (which this method is exclusively invoked by) but the problem I think is that properties can change on the Transaction, potentially changing the location provider to be used. I suppose we can update that provider on a property change (or maybe any metadata change) but unsure if this complexity is even worth it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats an interesting edge case. it seems like an anti-pattern to change the table property and write in the same transaction, although its currently allowed

@smaheshwar-pltr smaheshwar-pltr changed the title WIP: Support Location Providers Support Location Providers Dec 20, 2024
from pyiceberg.utils.properties import property_as_bool


class DefaultLocationProvider(LocationProvider):
Copy link
Contributor Author

@smaheshwar-pltr smaheshwar-pltr Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The biggest difference vs the Java implementations is that I've not supported write.data.path here. I think it's natural for write.metadata.path to be supported alongside this so this would be a larger and arguably location-provider-independent change? Can look into it as a follow-up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! would be great to have write.data.path and write.metadata.path

@@ -192,6 +195,14 @@ class TableProperties:
WRITE_PARTITION_SUMMARY_LIMIT = "write.summary.partition-limit"
WRITE_PARTITION_SUMMARY_LIMIT_DEFAULT = 0

WRITE_LOCATION_PROVIDER_IMPL = "write.location-provider.impl"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though the docs say that the default is null, having a constant for this being None felt unnecessary

return (
f"{prefix}/{hashed_path}/{data_file_name}"
if self._include_partition_paths
else f"{prefix}/{hashed_path}-{data_file_name}"
Copy link
Contributor Author

@smaheshwar-pltr smaheshwar-pltr Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting that disabling include_partition_paths affects paths of non-partitioned data files. I've matched Java behaviour here but it does feel odd.

TableProperties.WRITE_OBJECT_STORE_PARTITIONED_PATHS_DEFAULT,
)

def new_data_location(self, data_file_name: str, partition_key: Optional[PartitionKey] = None) -> str:
Copy link
Contributor Author

@smaheshwar-pltr smaheshwar-pltr Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to make this as consistent with its Java counter-part so file locations are consistent too. This means hashing on both the partition key and the data file name below, and using the same hash function.

Seemed reasonable to port over the the object storage stuff in this PR, given that the original issue #861 mentions this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Iceberg is mainly focussed on object-stores, I'm leaning towards making the ObjectStorageLocationProvider the default. Java is a great source of inspiration, but it also holds a lot of historical decisions that are not easy to change, so we should reconsider this at PyIceberg.

@smaheshwar-pltr smaheshwar-pltr marked this pull request as ready for review December 20, 2024 14:09
Comment on lines 98 to 100
# Field name is not encoded but partition value is - this differs from the Java implementation
# https://github.com/apache/iceberg/blob/cdf748e8e5537f13d861aa4c617a51f3e11dc97c/core/src/test/java/org/apache/iceberg/TestLocationProvider.java#L304
assert partition_segment == "part#field=example%23val"
Copy link
Contributor Author

@smaheshwar-pltr smaheshwar-pltr Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put up #1457 - I'll remove this special-character testing (that the Java test counterpart does) here because it'll be tested in that PR.

return f"custom_location_provider/{data_file_name}"


def test_default_location_provider() -> None:
Copy link
Contributor Author

@smaheshwar-pltr smaheshwar-pltr Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests in this file are inspired by https://github.com/apache/iceberg/blob/main/core/src/test/java/org/apache/iceberg/TestLocationProvider.java.

The hash functions are the same so those constants are unchanged.

@smaheshwar-pltr
Copy link
Contributor Author

@Fokko, think this is ready for review now!

I've implemented this for write codepaths - add_files seems like it should just add the files specified without transforming locations.

@@ -1627,6 +1632,67 @@ class AddFileTask:
partition_field_value: Record


class LocationProvider(ABC):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also expect this one to be in location.py? The table/__init__.py is already pretty big

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Generally LGTM, i left a few nit comments.

This matches the behavior of the Java implementation. However, if we're reusing the same property (write.location-provider.impl), then there's a conflict when loading in both Java and Python. I wonder if we should add a python specific property, otherwise location-provider will only work in one of the implementations and might error in the other.

from pyiceberg.utils.properties import property_as_bool


class DefaultLocationProvider(LocationProvider):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! would be great to have write.data.path and write.metadata.path

Comment on lines +36 to +38
HASH_BINARY_STRING_BITS = 20
ENTROPY_DIR_LENGTH = 4
ENTROPY_DIR_DEPTH = 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move these into ObjectStoreLocationProvider

@@ -2622,13 +2631,15 @@ def _dataframe_to_data_files(
property_name=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
default=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT,
)
location_provider = load_location_provider(table_location=table_metadata.location, table_properties=table_metadata.properties)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats an interesting edge case. it seems like an anti-pattern to change the table property and write in the same transaction, although its currently allowed

@kevinjqliu kevinjqliu self-requested a review January 2, 2025 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support LocationProviders like the Java Iceberg Reference Implementaiton
3 participants