-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why shouldn't we return an UnboundPartitionSpec
instead?
#694
Comments
Hi, @Fokko, I created an issue to make it easier for us to continue the discussion. Also cc @liurenjie1024 and @c-thiel. |
@Xuanwo Thanks, appreciate it. I'm also preparing a PR for Java that illustrates the solution more clearly. |
I asked same question, and here is the answer from @c-thiel : And the reason I this is reasonable is here: #645 (comment) |
Here is the reason why I accept it:
I agree with @c-thiel that it's not appropriate to use |
I think I have most of my motivation in #645 (comment) and the comments referenced above. I follow the argument from Renjie. |
@liurenjie1024, @Fokko, @Xuanwo we should probably come to a conclusion here before the next release. |
Unbound (not bound to a schema) and schemaless are the same, so I find them confusing. Regarding #645 (comment) there are some incorrect assumptions there:
I don't understand this statement. If I look at the #[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone, TypedBuilder)]
#[serde(rename_all = "kebab-case")]
pub struct UnboundPartitionField {
/// A source column id from the table’s schema
pub source_id: i32,
/// A partition field id that is used to identify a partition field and is unique within a partition spec.
/// In v2 table metadata, it is unique across all partition specs.
#[builder(default, setter(strip_option))]
pub field_id: Option<i32>,
/// A partition name.
pub name: String,
/// A transform that is applied to the source column to produce a partition value.
pub transform: Transform,
}
/// Unbound partition spec can be built without a schema and later bound to a schema.
#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone, Default)]
#[serde(rename_all = "kebab-case")]
pub struct UnboundPartitionSpec {
/// Identifier for PartitionSpec
pub(crate) spec_id: Option<i32>,
/// Details of the partition spec
pub(crate) fields: Vec<UnboundPartitionField>,
}
This is not entirely correct. For example, when a field is being promoted in a valid way, then it is still valid to bind against the same source-id. So going from the problem space to the solution space. I was looking at the Java code and the PyIceberg code, and at PyIceberg we took a slightly different approach that might be interesting for Iceberg-rust as well. As mentioned earlier, the source ID might evolve (assume compatible ones iceberg-rust/crates/iceberg/src/spec/partition.rs Lines 60 to 70 in 5034519
When the field is not in the current spec anymore, it is also unlikely that you would filter on that field, therefore that field from the partition-spec could also be ignored. The downside is that with strongly typed languages, we need to know the type upfront, and otherwise we only know it when we read the actual Avro file (that has the schema with the type). Therefore I'm playing around with a solution for this on the Java side: apache/iceberg#11542 I hope this helps! |
They are not quite - a previously bound schema has a required
True, that was too strict. What I should have expressed it that it is not guaranteed to be compatible with the current schema, which is why Java uses
This would require a more permissive The
Summing it up, my approach would be to keep all three types - I simply don't see a good way around them. |
@liurenjie1024, @Xuanwo, @Fokko may I ask for another round of Feedback for this? |
If I understand correctly, one of the most important use case of
After discussios(incluiding in slack, github issues), I think I have a better understanding why java's
So my suggestion would be as following:
|
@Fokko didn't you try to remove the
At least for Apart from that, my preference would still be to introduce a lightweight additional type that clearly states what its used for than to add a With |
I've created apache/iceberg#11604 to reflect my idea on the Java side. When you have
If we the PR I mentioned above in, we can remove the Edit, in Java we're also in the process of decoupling the |
I think pyiceberg's approach of decoupling schema from partition spec is reasonable to me. |
And the
PartitionField
andUnboundPartitionField
are basically the same.If we cannot find the field anymore, the best thing to do is to include the file in the query plan.
Originally posted by @Fokko in #645 (comment)
The text was updated successfully, but these errors were encountered: