Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support S3 Table Buckets with S3TablesCatalog #1429

Open
wants to merge 32 commits into
base: main
Choose a base branch
from

Conversation

felixscherz
Copy link
Contributor

@felixscherz felixscherz commented Dec 14, 2024

Hi, this is in regards to #1404 and very much in a work in progress / draft state.

I created a first draft of an S3TablesCatalog that uses the S3 Table Bucket API for catalog operations.

Features implemented:

  • create/drop namespace
  • create/drop table
  • commit table

Current issues:

  • s3tables API requires boto3 >= 1.35.74 (see boto3 changelog) and poetry doesn't allow specifying version requirements in extras
  • writing metadata with PyarrowFileIO runs into issues when uploading files to S3 (see below)

writing metadata with PyarrowFileIO raises S3TablesUnsupportedHeader

I am working on supporting table creation but ran into the issue described here: #1404 (comment) which I could work around initially. But then I ran into the issue that the S3 Table buckets don't seem to support older versions of the S3 API (at least that is what the error looks like to me).

This is the pytest output:

    def test_create_table(table_bucket_arn, database_name: str, table_name:str, table_schema_nested: Schema):
        properties = {"warehouse": table_bucket_arn}
        catalog = S3TableCatalog(name="test_s3tables_catalog", **properties)
        identifier = (database_name, table_name)

        catalog.create_namespace(namespace=database_name)
        print(database_name, table_name)
        # this fails with
        # OSError: When completing multiple part upload for key 'metadata/00000-55a9c37c-b822-4a81-ac0e-1efbcd145dba.metadata.json' in bucket '14e4e036-d4ae-44f8-koana45eruw
        # Uunable to parse ExceptionName: S3TablesUnsupportedHeader Message: S3 Tables does not support the following header: x-amz-api-version value: 2006-03-01
>       table = catalog.create_table(identifier=identifier, schema=table_schema_nested)

tests/catalog/test_s3tables.py:70:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pyiceberg/catalog/s3tables.py:146: in create_table
    self._write_metadata(metadata, io, metadata_location, overwrite=True)
pyiceberg/catalog/__init__.py:946: in _write_metadata
    ToOutputFile.table_metadata(metadata, io.new_output(metadata_path), overwrite=overwrite)
pyiceberg/serializers.py:130: in table_metadata
    with output_file.create(overwrite=overwrite) as output_stream:
pyarrow/io.pxi:137: in pyarrow.lib.NativeFile.__exit__
    ???
pyarrow/io.pxi:207: in pyarrow.lib.NativeFile.close
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   OSError: When completing multiple part upload for key 'metadata/00000-6c76904d-0f68-468e-97c0-5110db79e4ec.metadata.json' in bucket 'abb69116-611a-442b-uhe3jwgurwwrnxsbr1otw8gm14po1use1b--table-s3': AWS Error UNKNOWN (HTTP status 400) during CompleteMultipartUpload operation: Unable to parse ExceptionName: S3TablesUnsupportedHeader Message: S3 Tables does not support the following header: x-amz-api-version value: 2006-03-01

@felixscherz felixscherz marked this pull request as draft December 14, 2024 16:43
@felixscherz
Copy link
Contributor Author

I was able to work around the issue above by using FsspecFileIO instead of the default PyarrowFileIO. Using FsspecFileIO the catalog is now able to create new tables.

@kevinjqliu
Copy link
Contributor

Thanks for working on this @felixscherz Feel free to tag me when its ready for review :)

@felixscherz
Copy link
Contributor Author

I think you can now review this PR if you have time @kevinjqliu :)
The biggest issue for now will be that testing is only possible against AWS itself since moto does not support the s3tables API yet. I created an issue on the moto side but have not had the time to implement it myself getmoto/moto#8422.

I currently run tests by setting the ARN env variable to that of an s3 table bucket I created within my personal AWS account:

https://github.com/felixscherz/iceberg-python/blob/feat/s3tables-catalog/tests/catalog/test_s3tables.py#L24-L31

@felixscherz felixscherz marked this pull request as ready for review December 29, 2024 15:51
@felixscherz felixscherz changed the title WIP: feat: support S3 Table Buckets with S3TablesCatalog feat: support S3 Table Buckets with S3TablesCatalog Dec 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants