-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix error with multiple nested partition columns on Iceberg #24629
base: master
Are you sure you want to change the base?
Fix error with multiple nested partition columns on Iceberg #24629
Conversation
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: jinyang_li.
|
5269716
to
98c7606
Compare
98c7606
to
612c082
Compare
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: jinyang_li.
|
612c082
to
88a4559
Compare
dea4ec4
to
835d669
Compare
plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/BaseIcebergSystemTables.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/BaseIcebergSystemTables.java
Outdated
Show resolved
Hide resolved
835d669
to
b391dba
Compare
plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/BaseIcebergSystemTables.java
Outdated
Show resolved
Hide resolved
b391dba
to
2986072
Compare
plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/BaseIcebergSystemTables.java
Outdated
Show resolved
Hide resolved
2986072
to
f2ccfcb
Compare
f2ccfcb
to
0afc91f
Compare
I've applied some fixes, ptal @ebyhr @raunaqmorarka @chenjian2664 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the root problem here is my original code assumed the fields from the iceberg spec would be returned in field ID order and that not necessarily true. The fix is to build the data paths, sort by field ID, and then build the final paths? Maybe we should mention the faulty assumption and the fix in the commit message.
I had some comments about adding more comments but the rest looks good to me.
@@ -92,16 +113,14 @@ private static boolean buildDataPaths(Set<Integer> partitionFieldIds, Types.Stru | |||
currentPaths.addLast(fieldOrdinal); | |||
org.apache.iceberg.types.Type type = field.type(); | |||
if (type instanceof Types.StructType nestedStruct) { | |||
hasPartitionFields = buildDataPaths(partitionFieldIds, nestedStruct, currentPaths, dataPaths) || hasPartitionFields; | |||
buildDataPaths(partitionFieldIds, nestedStruct, currentPaths, dataPaths); | |||
} | |||
// Map and List types are not supported in partitioning | |||
if (type.isPrimitiveType() && partitionFieldIds.contains(fieldId)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be else if
? I assume structure is mutually exclusive of these cases. If so, I think it makes more sense to be explicit. If not, we should explain what the consequences of that is.
for (Types.NestedField field : spec.schema().asStruct().fields()) { | ||
// Partition fields can only be nested in a struct | ||
if (field.type() instanceof Types.StructType nestedStruct) { | ||
if (buildDataPaths(partitionFieldIds, nestedStruct, new ArrayDeque<>(List.of(channel)), fieldInfo)) { | ||
channel++; | ||
} | ||
buildDataPaths(partitionFieldIds, nestedStruct, new ArrayDeque<>(ImmutableList.of(field.fieldId())), fieldInfo); | ||
} | ||
else if (field.type().isPrimitiveType() && partitionFieldIds.contains(field.fieldId())) { | ||
fieldInfo.put(field.fieldId(), ImmutableList.of(channel)); | ||
channel++; | ||
fieldInfo.put(field.fieldId(), ImmutableList.of(field.fieldId())); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like it is just one step unrolled of the inner build data paths call, but I'm pretty sure it is not. I believe the difference is this out call is using field id, but he inner code is using ordinal position. This is preexisting, but maybe add a comment mentioning this, since the code now looks so similar.
} | ||
} | ||
return fieldInfo; | ||
|
||
// assign channel for top level fields based on the order of the field id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this step is doing more than the comment lets on. The paths are being rewritten from fieldId.structOrdinalX.structOrdinalY
to channel.structOrdinalX.structOrdinalY
, where channels are assigned based on the numeric ordering of the field ID. So I would mention this rewrites the path, changing the root path from field ID to sequential channel number (maybe phrase this better with help from copilot).
@@ -64,26 +66,45 @@ private static Map<Integer, List<Integer>> buildDataPaths(PartitionSpec spec) | |||
{ | |||
Set<Integer> partitionFieldIds = spec.fields().stream().map(PartitionField::sourceId).collect(toImmutableSet()); | |||
|
|||
int channel = 0; | |||
Map<Integer, List<Integer>> fieldInfo = new HashMap<>(); | |||
for (Types.NestedField field : spec.schema().asStruct().fields()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading the change again, it seems that the sole purpose of the change is to ensure that channels are allocated in field ID order. If so, can we fix this by sorting the fields before this loop. I'm thinking something like this:
for (Types.NestedField field : spec.schema().asStruct().fields()) { | |
List<Types.NestedField> fields = spec.schema().asStruct().fields().stream() | |
.sorted(Comparator.comparingInt(Types.NestedField::fieldId)) | |
.collect(toImmutableList()); | |
for (Types.NestedField field : fields) { |
Description
Assign channel number based on the order of field ID.
Fixes #24628
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text: