Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why shouldn't we return an UnboundPartitionSpec instead? #694

Closed
Tracked by #739
Xuanwo opened this issue Nov 11, 2024 · 13 comments · Fixed by #771
Closed
Tracked by #739

Why shouldn't we return an UnboundPartitionSpec instead? #694

Xuanwo opened this issue Nov 11, 2024 · 13 comments · Fixed by #771

Comments

@Xuanwo
Copy link
Member

Xuanwo commented Nov 11, 2024

          I'm not convinced that we need another type, such as `SchemalessPartitionSpec`. Why shouldn't we return an `UnboundPartitionSpec` instead? Looking at the signatures:
pub struct UnboundPartitionSpec {
    /// Identifier for PartitionSpec
    pub(crate) spec_id: Option<i32>,
    /// Details of the partition spec
    pub(crate) fields: Vec<UnboundPartitionField>,
}
pub struct SchemalessPartitionSpec {
    /// Identifier for PartitionSpec
    spec_id: i32,
    /// Details of the partition spec
    fields: Vec<PartitionField>,
}

And the PartitionField and UnboundPartitionField are basically the same.

Introducing SchemalessPartitionSpec might be our way to avoid apache/iceberg#4563.

If we cannot find the field anymore, the best thing to do is to include the file in the query plan.

Originally posted by @Fokko in #645 (comment)

@Xuanwo
Copy link
Member Author

Xuanwo commented Nov 11, 2024

Hi, @Fokko, I created an issue to make it easier for us to continue the discussion.

Also cc @liurenjie1024 and @c-thiel.

@Fokko
Copy link
Contributor

Fokko commented Nov 11, 2024

@Xuanwo Thanks, appreciate it. I'm also preparing a PR for Java that illustrates the solution more clearly.

@liurenjie1024
Copy link
Contributor

I asked same question, and here is the answer from @c-thiel :
#645 (comment)

And the reason I this is reasonable is here: #645 (comment)

@liurenjie1024
Copy link
Contributor

Here is the reason why I accept it:

SchemalessPartitionSpec is used when we load table metadata and build partition spec from table metadata. Since there is no schema id associated with partition spec, in java it's built with bindUncheck with current schema in current snapshot. That's to say, java implementation assumes that the partiton specs are valid with current schema without checking. I'm skeptical that it's a safe approach.

I agree with @c-thiel that it's not appropriate to use UnboundPartitionSpec since UnboundPartitionSpec means not validated, user constructed partition spec, which is not the case for partiton specs from table metadata.

@c-thiel
Copy link
Collaborator

c-thiel commented Nov 13, 2024

I think I have most of my motivation in #645 (comment) and the comments referenced above. I follow the argument from Renjie.
For me UnboundPartitionSpec has a different semantic than a previously bound spec, manifested also in the Optional Field ID. Using UnboundPartitionSpec in TableMetadata would introduce lines of error handling in other places, where we expect the FieldId to be present. As field_id is a required field in the spec for TableMetadata, we would have to use SchemalessPartitionSpec at least to deserialize the json before transforming it to UnboundPartitionSpec.

@c-thiel
Copy link
Collaborator

c-thiel commented Nov 16, 2024

@liurenjie1024, @Fokko, @Xuanwo we should probably come to a conclusion here before the next release.
@Fokko is there already something in your Java PR which we can peek at? What do you have in mind for a solution?

@Fokko
Copy link
Contributor

Fokko commented Nov 16, 2024

Unbound (not bound to a schema) and schemaless are the same, so I find them confusing.

Regarding #645 (comment) there are some incorrect assumptions there:

If we agree on this, then we have a small Problem with TableMetadata: It contains historic PartitionSpecs that cannot be bound against the current_schema. As a result, we need a type that has the same properties as PartitionSpec but is not bound to a schema. I thought we could use UnboundPartitionSpec for this at first, but it serves a different purpose and would make the very important field_id Optional.

I don't understand this statement. If I look at the UnboundPartitionSpec:

#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone, TypedBuilder)]
#[serde(rename_all = "kebab-case")]
pub struct UnboundPartitionField {
    /// A source column id from the table’s schema
    pub source_id: i32,
    /// A partition field id that is used to identify a partition field and is unique within a partition spec.
    /// In v2 table metadata, it is unique across all partition specs.
    #[builder(default, setter(strip_option))]
    pub field_id: Option<i32>,
    /// A partition name.
    pub name: String,
    /// A transform that is applied to the source column to produce a partition value.
    pub transform: Transform,
}

/// Unbound partition spec can be built without a schema and later bound to a schema.
#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone, Default)]
#[serde(rename_all = "kebab-case")]
pub struct UnboundPartitionSpec {
    /// Identifier for PartitionSpec
    pub(crate) spec_id: Option<i32>,
    /// Details of the partition spec
    pub(crate) fields: Vec<UnboundPartitionField>,
}

A PartitionSpec is only valid for the schema is had been build against, and we should not imply otherwise.

This is not entirely correct. For example, when a field is being promoted in a valid way, then it is still valid to bind against the same source-id.

So going from the problem space to the solution space. I was looking at the Java code and the PyIceberg code, and at PyIceberg we took a slightly different approach that might be interesting for Iceberg-rust as well. As mentioned earlier, the source ID might evolve (assume compatible ones int -> long, float -> double, etc) over time. The most obvious example is the identity-partition where the sourceType is equal to the resultType. Therefore we allow binding with different schema's in PyIceberg: https://github.com/apache/iceberg-python/blob/b2f0a9e5cd7dd548e19cdcdd7f9205f03454369a/pyiceberg/partitioning.py#L203-L225 I think that might be a better approach for Rust as well, as this is now hard-coded:

#[derive(Debug, PartialEq, Eq, Clone)]
pub struct BoundPartitionSpec {
/// Identifier for PartitionSpec
spec_id: i32,
/// Details of the partition spec
fields: Vec<PartitionField>,
/// The schema this partition spec is bound to
schema: SchemaRef,
/// Type of the partition spec
partition_type: StructType,
}

When the field is not in the current spec anymore, it is also unlikely that you would filter on that field, therefore that field from the partition-spec could also be ignored. The downside is that with strongly typed languages, we need to know the type upfront, and otherwise we only know it when we read the actual Avro file (that has the schema with the type). Therefore I'm playing around with a solution for this on the Java side: apache/iceberg#11542

I hope this helps!

@c-thiel
Copy link
Collaborator

c-thiel commented Nov 16, 2024

Unbound (not bound to a schema) and schemaless are the same, so I find them confusing.

They are not quite - a previously bound schema has a required field_id (reference to PartitionField), while unbound has an optional field_id (Option<i32>) as it references UnboundPartitionField. TableMetadata.partition_specs requires field_id, so we shouldn't use UnboundPartitionSpec there.

A PartitionSpec is only valid for the schema is had been build against, and we should not imply otherwise.

This is not entirely correct. For example, when a field is being promoted in a valid way, then it is still valid to bind against the same source-id.

True, that was too strict. What I should have expressed it that it is not guaranteed to be compatible with the current schema, which is why Java uses bindUnchecked().
I think with the current implementation we are quite close to what you propose - binding a SchemalessPartitionSpec or UnboundPartitionSpec to a new schema is easily possible by using spec.to_unbound().bind(new_schema). Currently this binding can fail. What is missing is the notion that you described with

When the field is not in the current spec anymore, it is also unlikely that you would filter on that field, therefore that field from the partition-spec could also be ignored

This would require a more permissive bind method, or an additional flag, that would ignore unknown fields, similar to what you try to incorporate into Java with Any. The important point is that dynamic re-binding is already possible today.

The BoundPartitionSpec serves two additional purposes - it is a way to compute the binding and store the result typesafe. This is useful for

  1. Pre-computing the bind, as we currently do for default_spec against the current_schema - which is the most important combination. I don't see why we shouldn't store the resulting type as we need to assert compatibility anyway for this combination.
  2. Have the means to get the source field names for a partition specs. This is useful when a spec is transferred from one context to another. In Java this is used for addPartitionSpec which eventually runs this code: https://github.com/apache/iceberg/blob/acd7cc1126b192ccb53ad8198bda37e983aa4c6c/core/src/main/java/org/apache/iceberg/TableMetadata.java#L768. Note that currently we don't have this semantic in the builder, which in java performs name-mapping to potentially change the source_id. Instead we pass an UnboundSpec which trusts the provides source_id and not the field name.

Summing it up, my approach would be to keep all three types - I simply don't see a good way around them.
However, we should add a more permissive bind method that incorporates better the semantic of binding to an incompatible schema. This would be an alternative to the bindUnchecked() in Java. Maybe we can just follow the PR you are working on in Java once that's ready. This method would not necessarily be able to determine a type for each field in a spec.

@c-thiel
Copy link
Collaborator

c-thiel commented Nov 30, 2024

@liurenjie1024, @Xuanwo, @Fokko may I ask for another round of Feedback for this?
I believe if we decide to not introduce a SchemalessPartitionSpec, we should do so before 0.4.0

@liurenjie1024
Copy link
Contributor

@liurenjie1024, @Xuanwo, @Fokko may I ask for another round of Feedback for this? I believe if we decide to not introduce a SchemalessPartitionSpec, we should do so before 0.4.0

If I understand correctly, one of the most important use case of SchemalessPartitionSpec is when loading table metadata, there are many historial partition spec, which have two important characteristics:

  1. They are not unbound partition spec, since they have been bound to some schema, not have to be current schema.
  2. Their partition type don't have to be computed immediately, since this maybe a time consuming process.

After discussios(incluiding in slack, github issues), I think I have a better understanding why java's buildUnchecked method maybe a reasonable solution:

  1. It resolves the second point I mentioned above.
  2. When we need to compute the partiton type of a partition spec, we verify it at that time. But from our discussion before, iceberg should guarantee that schema evolution to be always compatible to reachable partiton spec, e.g. it should be invalid to drop a field referenced by a reachable partiton spec.

So my suggestion would be as following:

  1. Remove SchemalessPartitonSpec
  2. Lazily compute partition type
  3. Add a crate private buildUnchecked method.

@c-thiel
Copy link
Collaborator

c-thiel commented Dec 3, 2024

@Fokko didn't you try to remove the buildUnchecked in Java? Did you succeed?

Add a crate private buildUnchecked method.

At least for Lakekeeper we would need buildUnchecked to be public. We are creating TableMetadata not from a big JSON blob, but instead from individual parts that we store in the Database in individual tables. This makes some operations around snapshots lightning fast, because we don't need to do a slow roundtrip to S3. We need the same fine-granular control publicly exposed that the MetadataBuilder is using internally as we would need to stay on a fork otherwise.

Apart from that, my preference would still be to introduce a lightweight additional type that clearly states what its used for than to add a buildUnchecked method with lazy accessors that might panic if whoever wrote the Metadata file did not implement the checks accordingly - this includes all java and pyiceberg versions that are currently out.

With SchemalessPartitionSpec we already have lazy field types, as we can just .bind() it at a later point in time.

@Fokko
Copy link
Contributor

Fokko commented Dec 3, 2024

I've created apache/iceberg#11604 to reflect my idea on the Java side. When you have PartitionSpec that references a field that has been dropped, then we can also drop it from the PartitionSpec, because we will never use it. Another option I'm noodling about, is replacing the type with the newly in V3 introduced UnknownType.

@Fokko didn't you try to remove the buildUnchecked in Java? Did you succeed?

If we the PR I mentioned above in, we can remove the buildUnchecked. What I think is even better is to defer the binding to a later point like we did in PyIceberg. This is performance wise probably also better, because you don't know yet which partition spec's you're going to encounter. I checked what the current situation is in PyIceberg, and there we need one similar check: apache/iceberg-python#1393. The Avro reader will just skip the partition field that references the dropped field.

Edit, in Java we're also in the process of decoupling the Transform from the source type. But since this is part of the API module the deprecation cycle is very long :)

@liurenjie1024
Copy link
Contributor

I think pyiceberg's approach of decoupling schema from partition spec is reasonable to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants