-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a FileIO implementation for WASB #360
Conversation
private static final String WASB_SCHEME = "wasb"; | ||
private static final String ABFS_SCHEME = "abfs"; | ||
|
||
public WasbTranslatingFileIO(FileIO io) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it looks like the current structure of FileIOFactory is forcing this to wrap all FileIOs including non-Azure-related ones, we might as well make the class itself more generalized and we could move the Azure-specificity into config.
What if we just called this thing SchemeTranslatingFileIO
that takes a map of source schemes to destination schemes? And get the map through config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see you also have the endpoint rewrite from blob
to dfs
which wouldn't fit into that. I agree we might not want to go down the road of overly general regexes either.
Per apache/iceberg#10127 (comment) the underlying Azure SDK doesn't actually seem to actually care about the endpoint and will internally sort into both a dfs
and a blob
client, so in theory a scheme-only replacement should still work, but I guess it could be more fragile.
@collado-mike What do you think? Should we just keep this very wasb-specific?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like the current structure of FileIOFactory is forcing this to wrap all FileIOs including non-Azure-related ones
It's doing a little more than just scheme translating, and this is only true if Polaris is actually configured by an admin to use this factory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I were designing this, I would have a TransformingFileIO
with a constructor like
public TransformingFileIO(Function<String, String> pathTransformer, FileIOFactory delegate) {
}
and the azure class would extend that class and pass in a function that did the wasb->abfs conversion. But that's the kind of code refactoring that can be done later without any compatibility or behavior impact, so... 🤷🏽♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this idea and actually @dennishuo and I had discussed something similar.
My concern for the time being is that it's unclear how we can pass in that pathTransformer
via the YAML config file, and that I don't necessarily want to put designing a syntax for arbitrary string transformation onto the critical path for making WASB work.
Agreed that this makes a lot of sense as something to design and follow up with.
polaris-service/src/main/java/org/apache/polaris/service/catalog/io/WasbTranslatingFileIO.java
Outdated
Show resolved
Hide resolved
@JsonTypeName("wasb") | ||
public class WasbTranslatingFileIOFactory implements FileIOFactory { | ||
@Override | ||
public FileIO loadFileIO(String ioImpl, Map<String, String> properties) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point it's starting to seem that we need a way to do a chain of delegation. Both this class (or the suggested generalized SchemeTranslatingFileIOFactory
) as well as MeasuredFileIOFactory
fundamentally delegators, so maybe we should've entrenched an interface for delegation instead of just loading from String ioImpl
.
If we make the delegation behavior configured as construction-time params or injection then we don't have to change the method signatures or callsites either. Basically like this:
mainFileIOFactory = new SchemeTranslatingFileIOFactory(
new MeasuredFileIOFactory(new DefaultFileIOFactory()),
Map.of("wasb", "abfs", "wasbs", "abfss"));
And then we get rid of the hard-coded CatalogUtil.loadFileIO
in both SchemeTranslatingFileIOFactory
and MeasuredFileIOFactory
:
@Override
public FileIO loadFileIO(String ioImpl, Map<String, String> properties) {
return new SchemeTranslatingFileIO(this.delegateFactory.loadFileIO(ioImpl, properties);
}
The drawback I guess is the config for setting the delegate factory is specific to each outer delegator factory:
io:
factoryType: scheme-translating
schemeTranslatingDelegateType: measured
measuredFactoryDelegateType: default
But this might be better than forcing delegation through a single config syntax since delegation might not be consistent for all delegator types. For example you might have something like:
multiplexingIoFactoryDelegateTypes: r2,s3n,viewfs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see how it could be a bigger undertaking to properly entrench how we want to convey such a delegation chain though, so I don't feel too strongly about whether to tackle this in this PR or in a later followup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can avoid working on this now. A production use case will require things like metering, throttling, as well as azure support. As is, the configuration only supports one IO factory, meaning you couldn't get all three without writing a new class that does all three.
I think the configuration would work fine as
io:
factoryType: chain
delegates:
- measured
- wasb
- default
I think dropwizard should support detecting the generic type argument in something like
public void setDelegates(List<FileIOFactory> delegates);
so the above configuration should work out of the box.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It won't work exactly out of the box, since FileIOFactory doesn't have a way to wrap a FileIO today. So it's not clear how the chain
implementation would actually chain them. In this feature branch I toyed with adding a new interface called SupportsFileIOWrapping
for this purpose, but I don't know that this is necessary just to get WASB working E2E
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine, but I think we have to support multiple delegating FileIOFactory
s.
@JsonTypeName("wasb") | ||
public class WasbTranslatingFileIOFactory implements FileIOFactory { | ||
@Override | ||
public FileIO loadFileIO(String ioImpl, Map<String, String> properties) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can avoid working on this now. A production use case will require things like metering, throttling, as well as azure support. As is, the configuration only supports one IO factory, meaning you couldn't get all three without writing a new class that does all three.
I think the configuration would work fine as
io:
factoryType: chain
delegates:
- measured
- wasb
- default
I think dropwizard should support detecting the generic type argument in something like
public void setDelegates(List<FileIOFactory> delegates);
so the above configuration should work out of the box.
private static final String WASB_SCHEME = "wasb"; | ||
private static final String ABFS_SCHEME = "abfs"; | ||
|
||
public WasbTranslatingFileIO(FileIO io) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I were designing this, I would have a TransformingFileIO
with a constructor like
public TransformingFileIO(Function<String, String> pathTransformer, FileIOFactory delegate) {
}
and the azure class would extend that class and pass in a function that did the wasb->abfs conversion. But that's the kind of code refactoring that can be done later without any compatibility or behavior impact, so... 🤷🏽♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed adding the chain of factories in a separate PR
Description
In #344, checks within Polaris were relaxed so that Azure WASB paths can be used to e.g. create a catalog, and so that WASB paths and ABFS paths that referred to the same object could be compared. This PR introduces a FileIO implementation that Polaris can use to interact with entities that use such a WASB path. The FileIO implementation simply rewrites WASB paths into ABFS paths, and delegations interactions with the ABFS paths to the Iceberg ADLSFileIO.
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
To use the new FileIO easily in Polaris, you can change the
factoryType
in the server config towasb
.With this change, the table creation API succeeds:
Without this FileIO, it will fail with an error such as:
Checklist:
Please delete options that are not relevant.