-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a FileIO implementation for WASB #360
Changes from all commits
fac2d72
0647980
9ffbe51
421f8d1
67da471
384eb69
c9ba2c9
5b249b0
dd02047
e33b2f4
4971c01
a6e9165
480502b
0c85100
579bbbd
e72527f
3a1da9e
a1fc8ee
ba5791d
7aa9143
6193430
57ede87
b0cb01d
e4609cc
6214f05
2c7d6ce
4b4ab4f
70e375a
93cddda
745f1e1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,93 @@ | ||
/* | ||
* Licensed to the Apache Software Foundation (ASF) under one | ||
* or more contributor license agreements. See the NOTICE file | ||
* distributed with this work for additional information | ||
* regarding copyright ownership. The ASF licenses this file | ||
* to you under the Apache License, Version 2.0 (the | ||
* "License"); you may not use this file except in compliance | ||
* with the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, | ||
* software distributed under the License is distributed on an | ||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
* KIND, either express or implied. See the License for the | ||
* specific language governing permissions and limitations | ||
* under the License. | ||
*/ | ||
package org.apache.polaris.service.catalog.io; | ||
|
||
import java.util.Map; | ||
import org.apache.iceberg.io.FileIO; | ||
import org.apache.iceberg.io.InputFile; | ||
import org.apache.iceberg.io.OutputFile; | ||
import org.apache.polaris.core.storage.StorageLocation; | ||
import org.apache.polaris.core.storage.azure.AzureLocation; | ||
|
||
/** | ||
* A {@link FileIO} implementation that translates WASB paths into ABFS paths and then delegates to | ||
* another underlying FileIO implementation | ||
*/ | ||
public class WasbTranslatingFileIO implements FileIO { | ||
private final FileIO io; | ||
|
||
private static final String WASB_SCHEME = "wasb"; | ||
private static final String ABFS_SCHEME = "abfs"; | ||
|
||
public WasbTranslatingFileIO(FileIO io) { | ||
this.io = io; | ||
} | ||
|
||
private static String translate(String path) { | ||
if (path == null) { | ||
return null; | ||
} else { | ||
StorageLocation storageLocation = StorageLocation.of(path); | ||
if (storageLocation instanceof AzureLocation azureLocation) { | ||
String scheme = azureLocation.getScheme(); | ||
if (scheme.startsWith(WASB_SCHEME)) { | ||
scheme = scheme.replaceFirst(WASB_SCHEME, ABFS_SCHEME); | ||
} | ||
return String.format( | ||
"%s://%s@%s.dfs.core.windows.net/%s", | ||
scheme, | ||
azureLocation.getContainer(), | ||
azureLocation.getStorageAccount(), | ||
azureLocation.getFilePath()); | ||
} else { | ||
return path; | ||
} | ||
} | ||
} | ||
|
||
@Override | ||
public InputFile newInputFile(String path) { | ||
return io.newInputFile(translate(path)); | ||
} | ||
|
||
@Override | ||
public OutputFile newOutputFile(String path) { | ||
return io.newOutputFile(translate(path)); | ||
} | ||
|
||
@Override | ||
public void deleteFile(String path) { | ||
io.deleteFile(translate(path)); | ||
} | ||
|
||
@Override | ||
public Map<String, String> properties() { | ||
return io.properties(); | ||
} | ||
|
||
@Override | ||
public void initialize(Map<String, String> properties) { | ||
io.initialize(properties); | ||
} | ||
|
||
@Override | ||
public void close() { | ||
io.close(); | ||
} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
/* | ||
* Licensed to the Apache Software Foundation (ASF) under one | ||
* or more contributor license agreements. See the NOTICE file | ||
* distributed with this work for additional information | ||
* regarding copyright ownership. The ASF licenses this file | ||
* to you under the Apache License, Version 2.0 (the | ||
* "License"); you may not use this file except in compliance | ||
* with the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, | ||
* software distributed under the License is distributed on an | ||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
* KIND, either express or implied. See the License for the | ||
* specific language governing permissions and limitations | ||
* under the License. | ||
*/ | ||
package org.apache.polaris.service.catalog.io; | ||
|
||
import com.fasterxml.jackson.annotation.JsonTypeName; | ||
import java.util.Map; | ||
import org.apache.hadoop.conf.Configuration; | ||
import org.apache.iceberg.CatalogUtil; | ||
import org.apache.iceberg.io.FileIO; | ||
|
||
/** A {@link FileIOFactory} that translates WASB paths to ABFS ones */ | ||
@JsonTypeName("wasb") | ||
public class WasbTranslatingFileIOFactory implements FileIOFactory { | ||
@Override | ||
public FileIO loadFileIO(String ioImpl, Map<String, String> properties) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. At this point it's starting to seem that we need a way to do a chain of delegation. Both this class (or the suggested generalized If we make the delegation behavior configured as construction-time params or injection then we don't have to change the method signatures or callsites either. Basically like this:
And then we get rid of the hard-coded
The drawback I guess is the config for setting the delegate factory is specific to each outer delegator factory:
But this might be better than forcing delegation through a single config syntax since delegation might not be consistent for all delegator types. For example you might have something like:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can see how it could be a bigger undertaking to properly entrench how we want to convey such a delegation chain though, so I don't feel too strongly about whether to tackle this in this PR or in a later followup There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think we can avoid working on this now. A production use case will require things like metering, throttling, as well as azure support. As is, the configuration only supports one IO factory, meaning you couldn't get all three without writing a new class that does all three. I think the configuration would work fine as
I think dropwizard should support detecting the generic type argument in something like
so the above configuration should work out of the box. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It won't work exactly out of the box, since FileIOFactory doesn't have a way to wrap a FileIO today. So it's not clear how the |
||
WasbTranslatingFileIO wrapped = | ||
new WasbTranslatingFileIO(CatalogUtil.loadFileIO(ioImpl, properties, new Configuration())); | ||
return wrapped; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it looks like the current structure of FileIOFactory is forcing this to wrap all FileIOs including non-Azure-related ones, we might as well make the class itself more generalized and we could move the Azure-specificity into config.
What if we just called this thing
SchemeTranslatingFileIO
that takes a map of source schemes to destination schemes? And get the map through configThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see you also have the endpoint rewrite from
blob
todfs
which wouldn't fit into that. I agree we might not want to go down the road of overly general regexes either.Per apache/iceberg#10127 (comment) the underlying Azure SDK doesn't actually seem to actually care about the endpoint and will internally sort into both a
dfs
and ablob
client, so in theory a scheme-only replacement should still work, but I guess it could be more fragile.@collado-mike What do you think? Should we just keep this very wasb-specific?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's doing a little more than just scheme translating, and this is only true if Polaris is actually configured by an admin to use this factory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I were designing this, I would have a
TransformingFileIO
with a constructor likeand the azure class would extend that class and pass in a function that did the wasb->abfs conversion. But that's the kind of code refactoring that can be done later without any compatibility or behavior impact, so... 🤷🏽♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this idea and actually @dennishuo and I had discussed something similar.
My concern for the time being is that it's unclear how we can pass in that
pathTransformer
via the YAML config file, and that I don't necessarily want to put designing a syntax for arbitrary string transformation onto the critical path for making WASB work.Agreed that this makes a lot of sense as something to design and follow up with.