feat: support S3 Table Buckets with S3TablesCatalog #1429

felixscherz · 2024-12-14T14:06:47Z

Hi, this is in regards to #1404 and very much in a work in progress / draft state.

I created a first draft of an S3TablesCatalog that uses the S3 Table Bucket API for catalog operations.

Features implemented:

create/drop namespace
create/drop table
commit table

Current issues:

s3tables API requires boto3 >= 1.35.74 (see boto3 changelog) and poetry doesn't allow specifying version requirements in extras
writing metadata with PyarrowFileIO runs into issues when uploading files to S3 (see below)

writing metadata with `PyarrowFileIO` raises `S3TablesUnsupportedHeader`

I am working on supporting table creation but ran into the issue described here: #1404 (comment) which I could work around initially. But then I ran into the issue that the S3 Table buckets don't seem to support older versions of the S3 API (at least that is what the error looks like to me).

This is the pytest output:

    def test_create_table(table_bucket_arn, database_name: str, table_name:str, table_schema_nested: Schema):
        properties = {"warehouse": table_bucket_arn}
        catalog = S3TableCatalog(name="test_s3tables_catalog", **properties)
        identifier = (database_name, table_name)

        catalog.create_namespace(namespace=database_name)
        print(database_name, table_name)
        # this fails with
        # OSError: When completing multiple part upload for key 'metadata/00000-55a9c37c-b822-4a81-ac0e-1efbcd145dba.metadata.json' in bucket '14e4e036-d4ae-44f8-koana45eruw
        # Uunable to parse ExceptionName: S3TablesUnsupportedHeader Message: S3 Tables does not support the following header: x-amz-api-version value: 2006-03-01
>       table = catalog.create_table(identifier=identifier, schema=table_schema_nested)

tests/catalog/test_s3tables.py:70:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pyiceberg/catalog/s3tables.py:146: in create_table
    self._write_metadata(metadata, io, metadata_location, overwrite=True)
pyiceberg/catalog/__init__.py:946: in _write_metadata
    ToOutputFile.table_metadata(metadata, io.new_output(metadata_path), overwrite=overwrite)
pyiceberg/serializers.py:130: in table_metadata
    with output_file.create(overwrite=overwrite) as output_stream:
pyarrow/io.pxi:137: in pyarrow.lib.NativeFile.__exit__
    ???
pyarrow/io.pxi:207: in pyarrow.lib.NativeFile.close
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   OSError: When completing multiple part upload for key 'metadata/00000-6c76904d-0f68-468e-97c0-5110db79e4ec.metadata.json' in bucket 'abb69116-611a-442b-uhe3jwgurwwrnxsbr1otw8gm14po1use1b--table-s3': AWS Error UNKNOWN (HTTP status 400) during CompleteMultipartUpload operation: Unable to parse ExceptionName: S3TablesUnsupportedHeader Message: S3 Tables does not support the following header: x-amz-api-version value: 2006-03-01

felixscherz · 2024-12-14T16:46:16Z

I was able to work around the issue above by using FsspecFileIO instead of the default PyarrowFileIO. Using FsspecFileIO the catalog is now able to create new tables.

kevinjqliu · 2024-12-20T17:02:32Z

Thanks for working on this @felixscherz Feel free to tag me when its ready for review :)

felixscherz · 2024-12-29T15:51:03Z

I think you can now review this PR if you have time @kevinjqliu :)
The biggest issue for now will be that testing is only possible against AWS itself since moto does not support the s3tables API yet. I created an issue on the moto side but have not had the time to implement it myself getmoto/moto#8422.

I currently run tests by setting the ARN env variable to that of an s3 table bucket I created within my personal AWS account:

https://github.com/felixscherz/iceberg-python/blob/feat/s3tables-catalog/tests/catalog/test_s3tables.py#L24-L31

feat: initial setup for S3TablesCatalog

e7cc8d9

felixscherz mentioned this pull request Dec 14, 2024

Support for S3 catalog to work with S3 Tables #1404

Open

feat: support create_table using FsspecFileIO

607c7e3

felixscherz marked this pull request as draft December 14, 2024 16:43

felixscherz added 8 commits December 14, 2024 18:13

feat: implement drop_table

91133b2

feat: implement drop_namespace

afcd5f8

test: validate how version conflict is handled with s3tables API

7f5285a

feat: implement commit_table

be1d29c

feat: implement table_exists

1451a6a

feat: implement list_tables

c40eaa7

refactor: improve list_namespace

19f1fb0

fix: return Identifier from list_tables

efa29ef

felixscherz added 16 commits December 21, 2024 14:22

feat: implement rename table

381588d

feat: implement load_namespace_properties

3d14ca9

refactor: move some methods around

6441c82

feat: raise NotImplementedError for views functionality

ae44208

feat: raise NotImplementedError for purge_table

6f615a5

feat: raise NotImplementedError for update_namespace_properties

7bc4327

feat: raise NotImplementedError for register_table

67762f7

fix: don't override create_table_transaction

4b892ad

chore: run formatter

6484985

feat: raise exceptions if boto3 doesn't support s3tables

e3807a6

feat: make endpoint configurable

099ce52

feat: explicitly configure tableBucketARN

3653848

fix: remove defaulting to FsspecIO

6fe0691

feat: raise exceptions for invalid namespace/table name

b034b94

feat: improve error handling for create_table

6a01ffd

feat: improve error handling for delete_table

744cef5

felixscherz added 6 commits December 29, 2024 14:51

chore: cleanup comments

ac820fb

feat: catch missing metadata for load_table

056e1e1

feat: handle missing namespace and preexisting table

52a54a2

feat: handle versionToken and table in an atomic operation

9f516f0

chore: run formatter

9a774aa

chore: add type hints for tests

f1bfaa5

felixscherz marked this pull request as ready for review December 29, 2024 15:51

felixscherz changed the title ~~WIP: feat: support S3 Table Buckets with S3TablesCatalog~~ feat: support S3 Table Buckets with S3TablesCatalog Dec 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support S3 Table Buckets with S3TablesCatalog #1429

feat: support S3 Table Buckets with S3TablesCatalog #1429

felixscherz commented Dec 14, 2024 •

edited

Loading

felixscherz commented Dec 14, 2024

kevinjqliu commented Dec 20, 2024

felixscherz commented Dec 29, 2024

feat: support S3 Table Buckets with S3TablesCatalog #1429

Are you sure you want to change the base?

feat: support S3 Table Buckets with S3TablesCatalog #1429

Conversation

felixscherz commented Dec 14, 2024 • edited Loading

Features implemented:

Current issues:

writing metadata with PyarrowFileIO raises S3TablesUnsupportedHeader

felixscherz commented Dec 14, 2024

kevinjqliu commented Dec 20, 2024

felixscherz commented Dec 29, 2024

felixscherz commented Dec 14, 2024 •

edited

Loading

writing metadata with `PyarrowFileIO` raises `S3TablesUnsupportedHeader`