Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a FileIO implementation for WASB #360

Merged
merged 30 commits into from
Oct 10, 2024

Conversation

eric-maynard
Copy link
Contributor

Description

In #344, checks within Polaris were relaxed so that Azure WASB paths can be used to e.g. create a catalog, and so that WASB paths and ABFS paths that referred to the same object could be compared. This PR introduces a FileIO implementation that Polaris can use to interact with entities that use such a WASB path. The FileIO implementation simply rewrites WASB paths into ABFS paths, and delegations interactions with the ABFS paths to the Iceberg ADLSFileIO.

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • Documentation update
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

To use the new FileIO easily in Polaris, you can change the factoryType in the server config to wasb.

With this change, the table creation API succeeds:

curl -X POST "http://localhost:8181/api/catalog/v1/wasb_test/namespaces/wasb_test_namespace/tables" \
>      -H "Authorization: Bearer ${token}" \
>      -H "Content-Type: application/json" \
>      -H "X-Iceberg-Access-Delegation: vended-credentials" \
>      -d '{
>           "name": "manual-table-1",
>            "schema": {
>              "type": "struct",
>              "fields": [
>                {"id": 101, "name": "id", "type": "string", "required": false}
>              ]
>            }
>          }' \
>      2> /dev/null && echo
{"metadata-location":"abfss://....","metadata":{"format-version":2,"table-uuid":"...","location":"wasbs://...", ...}}

Without this FileIO, it will fail with an error such as:

{"error":{"message":"Invalid ADLS URI: wasbs://...","type":"ValidationException","code":400}}

Checklist:

Please delete options that are not relevant.

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings

private static final String WASB_SCHEME = "wasb";
private static final String ABFS_SCHEME = "abfs";

public WasbTranslatingFileIO(FileIO io) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it looks like the current structure of FileIOFactory is forcing this to wrap all FileIOs including non-Azure-related ones, we might as well make the class itself more generalized and we could move the Azure-specificity into config.

What if we just called this thing SchemeTranslatingFileIO that takes a map of source schemes to destination schemes? And get the map through config

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see you also have the endpoint rewrite from blob to dfs which wouldn't fit into that. I agree we might not want to go down the road of overly general regexes either.

Per apache/iceberg#10127 (comment) the underlying Azure SDK doesn't actually seem to actually care about the endpoint and will internally sort into both a dfs and a blob client, so in theory a scheme-only replacement should still work, but I guess it could be more fragile.

@collado-mike What do you think? Should we just keep this very wasb-specific?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like the current structure of FileIOFactory is forcing this to wrap all FileIOs including non-Azure-related ones

It's doing a little more than just scheme translating, and this is only true if Polaris is actually configured by an admin to use this factory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I were designing this, I would have a TransformingFileIO with a constructor like

public TransformingFileIO(Function<String, String> pathTransformer, FileIOFactory delegate) {
}

and the azure class would extend that class and pass in a function that did the wasb->abfs conversion. But that's the kind of code refactoring that can be done later without any compatibility or behavior impact, so... 🤷🏽‍♂️

Copy link
Contributor Author

@eric-maynard eric-maynard Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this idea and actually @dennishuo and I had discussed something similar.

My concern for the time being is that it's unclear how we can pass in that pathTransformer via the YAML config file, and that I don't necessarily want to put designing a syntax for arbitrary string transformation onto the critical path for making WASB work.

Agreed that this makes a lot of sense as something to design and follow up with.

@JsonTypeName("wasb")
public class WasbTranslatingFileIOFactory implements FileIOFactory {
@Override
public FileIO loadFileIO(String ioImpl, Map<String, String> properties) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point it's starting to seem that we need a way to do a chain of delegation. Both this class (or the suggested generalized SchemeTranslatingFileIOFactory) as well as MeasuredFileIOFactory fundamentally delegators, so maybe we should've entrenched an interface for delegation instead of just loading from String ioImpl.

If we make the delegation behavior configured as construction-time params or injection then we don't have to change the method signatures or callsites either. Basically like this:

mainFileIOFactory = new SchemeTranslatingFileIOFactory(
    new MeasuredFileIOFactory(new DefaultFileIOFactory()),
    Map.of("wasb", "abfs", "wasbs", "abfss"));

And then we get rid of the hard-coded CatalogUtil.loadFileIO in both SchemeTranslatingFileIOFactory and MeasuredFileIOFactory:

@Override
public FileIO loadFileIO(String ioImpl, Map<String, String> properties) {
  return new SchemeTranslatingFileIO(this.delegateFactory.loadFileIO(ioImpl, properties);
}

The drawback I guess is the config for setting the delegate factory is specific to each outer delegator factory:

io:
  factoryType: scheme-translating
schemeTranslatingDelegateType: measured
measuredFactoryDelegateType: default

But this might be better than forcing delegation through a single config syntax since delegation might not be consistent for all delegator types. For example you might have something like:

multiplexingIoFactoryDelegateTypes: r2,s3n,viewfs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see how it could be a bigger undertaking to properly entrench how we want to convey such a delegation chain though, so I don't feel too strongly about whether to tackle this in this PR or in a later followup

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can avoid working on this now. A production use case will require things like metering, throttling, as well as azure support. As is, the configuration only supports one IO factory, meaning you couldn't get all three without writing a new class that does all three.

I think the configuration would work fine as

io:
  factoryType: chain
    delegates:
      - measured
      - wasb
      - default

I think dropwizard should support detecting the generic type argument in something like

public void setDelegates(List<FileIOFactory> delegates);

so the above configuration should work out of the box.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't work exactly out of the box, since FileIOFactory doesn't have a way to wrap a FileIO today. So it's not clear how the chain implementation would actually chain them. In this feature branch I toyed with adding a new interface called SupportsFileIOWrapping for this purpose, but I don't know that this is necessary just to get WASB working E2E

Copy link
Contributor

@collado-mike collado-mike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine, but I think we have to support multiple delegating FileIOFactorys.

@JsonTypeName("wasb")
public class WasbTranslatingFileIOFactory implements FileIOFactory {
@Override
public FileIO loadFileIO(String ioImpl, Map<String, String> properties) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can avoid working on this now. A production use case will require things like metering, throttling, as well as azure support. As is, the configuration only supports one IO factory, meaning you couldn't get all three without writing a new class that does all three.

I think the configuration would work fine as

io:
  factoryType: chain
    delegates:
      - measured
      - wasb
      - default

I think dropwizard should support detecting the generic type argument in something like

public void setDelegates(List<FileIOFactory> delegates);

so the above configuration should work out of the box.

private static final String WASB_SCHEME = "wasb";
private static final String ABFS_SCHEME = "abfs";

public WasbTranslatingFileIO(FileIO io) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I were designing this, I would have a TransformingFileIO with a constructor like

public TransformingFileIO(Function<String, String> pathTransformer, FileIOFactory delegate) {
}

and the azure class would extend that class and pass in a function that did the wasb->abfs conversion. But that's the kind of code refactoring that can be done later without any compatibility or behavior impact, so... 🤷🏽‍♂️

Copy link
Contributor

@collado-mike collado-mike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed adding the chain of factories in a separate PR

@collado-mike collado-mike enabled auto-merge (squash) October 10, 2024 17:18
@collado-mike collado-mike merged commit 7b2cca4 into apache:main Oct 10, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants