Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

catalog.load_table raises Invalid JSON error #1328

Open
sandcobainer opened this issue Nov 15, 2024 · 9 comments
Open

catalog.load_table raises Invalid JSON error #1328

sandcobainer opened this issue Nov 15, 2024 · 9 comments

Comments

@sandcobainer
Copy link

sandcobainer commented Nov 15, 2024

Question

Context: So I'm trying to run a simple proof of concept with PyIceberg, Hive Metastore (with an SQL dump as a hive metastore schema) and an S3 bucket of iceberg tables. I setup a docker compose file with mysql and hive metastore and the containers seem to be running fine. I am able to read the catalog, databases and tables with

catalog = HiveCatalog(
    "s3-bucket-name",
    **{
        "uri":"hive2://localhost:9083",
        "s3.endpoint":"http://s3-website-us-east-1.amazonaws.com",
        "s3.access-key-id":"fake key",
        "s3.secret-access-key":"fake access key"
    },
)
print(catalog.list_namespaces())
print(catalog.list_tables('tenantdb'))

Results:
[('default',), ('default_database',), ('tenantdb',), ('test',), ('testdb',)]
[('tenantdb', 'pinglogs'), ('tenantdb', 'pinglogs1'), ('tenantdb', 'pinglogs_bad'), ('tenantdb', 'pinglogs2'), ('tenantdb', 'pinglogs3')]

Issue: Running pinglogs = catalog.load_table('tenantdb.pinglogs') raises a validation error
ValidationError: 1 validation error for TableMetadataWrapper Invalid JSON: EOF while parsing a value at line 1 column 0 [type=json_invalid, input_value='', input_type=str]"

I looked at the metadata json and can confirm it's a valid json file. Is there a way to debug what the downloaded metadata looks like or what else to pass to load_table?

@kevinjqliu
Copy link
Contributor

Invalid JSON: EOF while parsing a value at line 1 column 0 [type=json_invalid, input_value='', input_type=str]"

I think this usually means the table metadata from the request is empty/invalid json.
To debug, can you check what the hive_table variable produces?

def load_table(self, identifier: Union[str, Identifier]) -> Table:
"""Load the table's metadata and return the table instance.
You can also use this method to check for table existence using 'try catalog.table() except TableNotFoundError'.
Note: This method doesn't scan data stored in the table.
Args:
identifier: Table identifier.
Returns:
Table: the table instance with its metadata.
Raises:
NoSuchTableError: If a table with the name does not exist, or the identifier is invalid.
"""
identifier_tuple = self._identifier_to_tuple_without_catalog(identifier)
database_name, table_name = self.identifier_to_database_and_table(identifier_tuple, NoSuchTableError)
with self._client as open_client:
hive_table = self._get_hive_table(open_client, database_name, table_name)
return self._convert_hive_into_iceberg(hive_table)

@kevinjqliu
Copy link
Contributor

The issue is most likely from reading the table metadata file

if prop_metadata_location := properties.get(METADATA_LOCATION):
metadata_location = prop_metadata_location
else:
raise NoSuchPropertyException(f"Table property {METADATA_LOCATION} is missing")
io = self._load_file_io(location=metadata_location)
file = io.new_input(metadata_location)
metadata = FromInputFile.table_metadata(file)

@Fokko
Copy link
Contributor

Fokko commented Nov 20, 2024

@sandcobainer Thanks for raising this, any chance that you could share the table metadata JSON?

@sandcobainer
Copy link
Author

@Fokko I've tried to see if my s3 credentials are the issue, but that doesn't seem to be the issue. Here's the metadata file that i downloaded directly from the S3 bucket.
00001-32cbc4e6-ad0d-43c5-8e01-2a29c6f83941.metadata.json

@kevinjqliu it does look like an empty file is being read. could it be something to do with permissions to read?

@kevinjqliu
Copy link
Contributor

if it's empty on read, it's most likely related to a permission issue.

Here's something you can run to debug.

metadata_location = "s3://...."
 io = catalog._load_file_io(location=metadata_location) 
 file = io.new_input(metadata_location) 
 metadata = FromInputFile.table_metadata(file) 

@sandcobainer
Copy link
Author

sandcobainer commented Dec 2, 2024

@kevinjqliu tried this snippet by pointing the metadata location directly to the s3 uri, and the error is the same. does this mean it's an s3 access issue?

    raise ValidationError(e) from e
pyiceberg.exceptions.ValidationError: 1 validation error for TableMetadataWrapper
  Invalid JSON: EOF while parsing a value at line 1 column 0 [type=json_invalid, input_value='', input_type=str]
    For further information visit https://errors.pydantic.dev/2.10/v/json_invalid

correction: i suspect if the parsed file is empty only from the error [type=json_invalid, input_value='', input_type=str] . here the input_value is '' but i have no other way to determine if this is an S3 permission issue or a parsing issue. Is anybody else able to replicate it?

@kevinjqliu
Copy link
Contributor

does this mean it's an s3 access issue?

likely, to debug you can try reading the file directly

print(input_file.open().read())

this should match when you read the file from S3

@sandcobainer
Copy link
Author

sandcobainer commented Dec 5, 2024

so i ran boto3 vs pyiceberg's load_file_io with the same credentials. can someone try to reproduce this with an s3 file please?

catalog = HiveCatalog(
    "bucket",
    **{
        "uri":"hive2://localhost:9083",
        "s3.endpoint":"http://s3-website-us-east-1.amazonaws.com",
        "s3.access-key-id":"key id",
        "s3.secret-access-key":"access key",
        "s3.session-token": "session-key"
    },
)
metadata_location = "s3://bucket/warehouse/tenantdb.db/pinglogs/metadata/00001-32cbc4e6-ad0d-43c5-8e01-2a29c6f83941.metadata.json"
io = catalog._load_file_io(location=metadata_location) 
file = io.new_input(metadata_location) 
print(file.open().read())

Output: b''

session = boto3.Session(profile_name="IO-Analytics")
credentials = session.get_credentials()
s3_client = session.client('s3')
s3_client.download_file('bucket', 'warehouse/tenantdb.db/pinglogs/metadata/00001-32cbc4e6-ad0d-43c5-8e01-2a29c6f83941.metadata.json', './metadata.json')
print(open('metadata.json').read())

returns

{
  "format-version" : 1,
  "table-uuid" : "a399a8d6-2a26-4068-a73d-af7e39725b35",
  "location" : "s3a://bucket/warehouse/tenantdb.db/pinglogs",
  "last-updated-ms" : 1729686218008,
  "last-column-id" : 22, ....

@kevinjqliu
Copy link
Contributor

Output: b''

that would explain the validation error. Its weird to me that s3 returns 0 bytes instead of an error.

Couple of things you can try.

  1. See if the credential is correct and that these credentials can read that file. I noticed in the 2nd example you passed in profile_name to get credentials.
        "s3.endpoint":"http://s3-website-us-east-1.amazonaws.com",
        "s3.access-key-id":"key id",
        "s3.secret-access-key":"access key",
        "s3.session-token": "session-key"
  1. the endpoint is set correctly.
        "s3.endpoint":"http://s3-website-us-east-1.amazonaws.com",

you can verify like

aws s3 ls s3://bucket/warehouse/tenantdb.db/pinglogs/metadata/ --endpoint-url http://s3-website-us-east-1.amazonaws.com
  1. FileIO. Print the FileIO type. S3 can use both PyArrowFileIO and FsspecFileIO https://py.iceberg.apache.org/configuration/#fileio give the other one a try
print(type(io))

hope this helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants