`catalog.load_table` raises Invalid JSON error #1328

sandcobainer · 2024-11-15T20:45:50Z

Question

Context: So I'm trying to run a simple proof of concept with PyIceberg, Hive Metastore (with an SQL dump as a hive metastore schema) and an S3 bucket of iceberg tables. I setup a docker compose file with mysql and hive metastore and the containers seem to be running fine. I am able to read the catalog, databases and tables with

catalog = HiveCatalog(
    "s3-bucket-name",
    **{
        "uri":"hive2://localhost:9083",
        "s3.endpoint":"http://s3-website-us-east-1.amazonaws.com",
        "s3.access-key-id":"fake key",
        "s3.secret-access-key":"fake access key"
    },
)
print(catalog.list_namespaces())
print(catalog.list_tables('tenantdb'))

Results:
[('default',), ('default_database',), ('tenantdb',), ('test',), ('testdb',)]
[('tenantdb', 'pinglogs'), ('tenantdb', 'pinglogs1'), ('tenantdb', 'pinglogs_bad'), ('tenantdb', 'pinglogs2'), ('tenantdb', 'pinglogs3')]

Issue: Running pinglogs = catalog.load_table('tenantdb.pinglogs') raises a validation error
ValidationError: 1 validation error for TableMetadataWrapper Invalid JSON: EOF while parsing a value at line 1 column 0 [type=json_invalid, input_value='', input_type=str]"

I looked at the metadata json and can confirm it's a valid json file. Is there a way to debug what the downloaded metadata looks like or what else to pass to load_table?

The text was updated successfully, but these errors were encountered:

kevinjqliu · 2024-11-20T05:28:26Z

Invalid JSON: EOF while parsing a value at line 1 column 0 [type=json_invalid, input_value='', input_type=str]"

I think this usually means the table metadata from the request is empty/invalid json.
To debug, can you check what the hive_table variable produces?

iceberg-python/pyiceberg/catalog/hive.py

Lines 519 to 540 in 93ebd39

    
               def load_table(self, identifier: Union[str, Identifier]) -> Table: 
        
                   """Load the table's metadata and return the table instance. 
        
                   You can also use this method to check for table existence using 'try catalog.table() except TableNotFoundError'. 
        
                   Note: This method doesn't scan data stored in the table. 
        
                   Args: 
        
                       identifier: Table identifier. 
        
                   Returns: 
        
                       Table: the table instance with its metadata. 
        
                   Raises: 
        
                       NoSuchTableError: If a table with the name does not exist, or the identifier is invalid. 
        
                   """ 
        
                   identifier_tuple = self._identifier_to_tuple_without_catalog(identifier) 
        
                   database_name, table_name = self.identifier_to_database_and_table(identifier_tuple, NoSuchTableError) 
        
                   with self._client as open_client: 
        
                       hive_table = self._get_hive_table(open_client, database_name, table_name) 
        
                   return self._convert_hive_into_iceberg(hive_table)

kevinjqliu · 2024-11-20T05:29:23Z

The issue is most likely from reading the table metadata file

iceberg-python/pyiceberg/catalog/hive.py

Lines 300 to 307 in 93ebd39

    
           if prop_metadata_location := properties.get(METADATA_LOCATION): 
        
               metadata_location = prop_metadata_location 
        
           else: 
        
               raise NoSuchPropertyException(f"Table property {METADATA_LOCATION} is missing") 
        
           io = self._load_file_io(location=metadata_location) 
        
           file = io.new_input(metadata_location) 
        
           metadata = FromInputFile.table_metadata(file)

Fokko · 2024-11-20T19:30:52Z

@sandcobainer Thanks for raising this, any chance that you could share the table metadata JSON?

sandcobainer · 2024-11-22T17:17:07Z

@Fokko I've tried to see if my s3 credentials are the issue, but that doesn't seem to be the issue. Here's the metadata file that i downloaded directly from the S3 bucket.
00001-32cbc4e6-ad0d-43c5-8e01-2a29c6f83941.metadata.json

@kevinjqliu it does look like an empty file is being read. could it be something to do with permissions to read?

kevinjqliu · 2024-11-22T22:41:55Z

if it's empty on read, it's most likely related to a permission issue.

Here's something you can run to debug.

metadata_location = "s3://...."
 io = catalog._load_file_io(location=metadata_location) 
 file = io.new_input(metadata_location) 
 metadata = FromInputFile.table_metadata(file)

sandcobainer · 2024-12-02T19:50:28Z

@kevinjqliu tried this snippet by pointing the metadata location directly to the s3 uri, and the error is the same. does this mean it's an s3 access issue?

    raise ValidationError(e) from e
pyiceberg.exceptions.ValidationError: 1 validation error for TableMetadataWrapper
  Invalid JSON: EOF while parsing a value at line 1 column 0 [type=json_invalid, input_value='', input_type=str]
    For further information visit https://errors.pydantic.dev/2.10/v/json_invalid

correction: i suspect if the parsed file is empty only from the error [type=json_invalid, input_value='', input_type=str] . here the input_value is '' but i have no other way to determine if this is an S3 permission issue or a parsing issue. Is anybody else able to replicate it?

kevinjqliu · 2024-12-04T19:03:32Z

does this mean it's an s3 access issue?

likely, to debug you can try reading the file directly

print(input_file.open().read())

this should match when you read the file from S3

sandcobainer · 2024-12-05T16:25:30Z

so i ran boto3 vs pyiceberg's load_file_io with the same credentials. can someone try to reproduce this with an s3 file please?

catalog = HiveCatalog(
    "bucket",
    **{
        "uri":"hive2://localhost:9083",
        "s3.endpoint":"http://s3-website-us-east-1.amazonaws.com",
        "s3.access-key-id":"key id",
        "s3.secret-access-key":"access key",
        "s3.session-token": "session-key"
    },
)
metadata_location = "s3://bucket/warehouse/tenantdb.db/pinglogs/metadata/00001-32cbc4e6-ad0d-43c5-8e01-2a29c6f83941.metadata.json"
io = catalog._load_file_io(location=metadata_location) 
file = io.new_input(metadata_location) 
print(file.open().read())

Output: b''

session = boto3.Session(profile_name="IO-Analytics")
credentials = session.get_credentials()
s3_client = session.client('s3')
s3_client.download_file('bucket', 'warehouse/tenantdb.db/pinglogs/metadata/00001-32cbc4e6-ad0d-43c5-8e01-2a29c6f83941.metadata.json', './metadata.json')
print(open('metadata.json').read())

returns

{
  "format-version" : 1,
  "table-uuid" : "a399a8d6-2a26-4068-a73d-af7e39725b35",
  "location" : "s3a://bucket/warehouse/tenantdb.db/pinglogs",
  "last-updated-ms" : 1729686218008,
  "last-column-id" : 22, ....

kevinjqliu · 2024-12-05T18:08:22Z

Output: b''

that would explain the validation error. Its weird to me that s3 returns 0 bytes instead of an error.

Couple of things you can try.

See if the credential is correct and that these credentials can read that file. I noticed in the 2nd example you passed in profile_name to get credentials.

        "s3.endpoint":"http://s3-website-us-east-1.amazonaws.com",
        "s3.access-key-id":"key id",
        "s3.secret-access-key":"access key",
        "s3.session-token": "session-key"

the endpoint is set correctly.

        "s3.endpoint":"http://s3-website-us-east-1.amazonaws.com",

you can verify like

aws s3 ls s3://bucket/warehouse/tenantdb.db/pinglogs/metadata/ --endpoint-url http://s3-website-us-east-1.amazonaws.com

FileIO. Print the FileIO type. S3 can use both PyArrowFileIO and FsspecFileIO https://py.iceberg.apache.org/configuration/#fileio give the other one a try

print(type(io))

hope this helps!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`catalog.load_table` raises Invalid JSON error #1328

`catalog.load_table` raises Invalid JSON error #1328

sandcobainer commented Nov 15, 2024 •

edited

Loading

kevinjqliu commented Nov 20, 2024

kevinjqliu commented Nov 20, 2024

Fokko commented Nov 20, 2024

sandcobainer commented Nov 22, 2024

kevinjqliu commented Nov 22, 2024

sandcobainer commented Dec 2, 2024 •

edited

Loading

kevinjqliu commented Dec 4, 2024

sandcobainer commented Dec 5, 2024 •

edited

Loading

kevinjqliu commented Dec 5, 2024

catalog.load_table raises Invalid JSON error #1328

catalog.load_table raises Invalid JSON error #1328

Comments

sandcobainer commented Nov 15, 2024 • edited Loading

Question

kevinjqliu commented Nov 20, 2024

kevinjqliu commented Nov 20, 2024

Fokko commented Nov 20, 2024

sandcobainer commented Nov 22, 2024

kevinjqliu commented Nov 22, 2024

sandcobainer commented Dec 2, 2024 • edited Loading

kevinjqliu commented Dec 4, 2024

sandcobainer commented Dec 5, 2024 • edited Loading

kevinjqliu commented Dec 5, 2024

`catalog.load_table` raises Invalid JSON error #1328

`catalog.load_table` raises Invalid JSON error #1328

sandcobainer commented Nov 15, 2024 •

edited

Loading

sandcobainer commented Dec 2, 2024 •

edited

Loading

sandcobainer commented Dec 5, 2024 •

edited

Loading