HiveTableOperations may incorrectly consider a successful commit as failed #11866

lirui-apache · 2024-12-24T06:19:24Z

Apache Iceberg version

1.4.3

Query engine

Spark

Please describe the bug 🐞

We are using NoLock for committing, and we recently hit an issue when HiveTableOperations considered a successful commit as concurrent modification and cleaned up the metadata JSON file that had already been committed, leaving the table in an unusable state.

We configured HMS HA with 3 instances. By examining the logs of these HMS instances, we found they experienced high workload at the time of the issue. And we found alter table requests from the committing job on 2 instances. So I believe the issue happened like this:

The 1st alter table request succeeded but the HMS instance failed to deliver a successful response.
We failed over to another instance, and since the metadata location has been changed, the HMS instance returned an exception containing message like "The table has been modified. ..."
HiveTableOperations checked the exception message, determined this should be a CommitFailedException and deleted the metadata JSON file it created.

There might be two ways to fix the issue:

We don't configure HMS HA, or use RetryingMetaStoreClient for committing. So that concurrent modification exceptions from HMS are more reliable. But then we may need to do retries on thrift exceptions by ourselves.
We do a checkCommitStatus for the concurrent modification exceptions, to make sure we really failed. This is simpler but I believe it brings in an extra refresh from HMS.

Willingness to contribute

I can contribute a fix for this bug independently
I would be willing to contribute a fix for this bug with guidance from the Iceberg community
I cannot contribute a fix for this bug at this time

The text was updated successfully, but these errors were encountered:

lirui-apache · 2024-12-24T06:21:16Z

@pvary What do you think about the issue?

lirui-apache added the bug Something isn't working label Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HiveTableOperations may incorrectly consider a successful commit as failed #11866

HiveTableOperations may incorrectly consider a successful commit as failed #11866

lirui-apache commented Dec 24, 2024

lirui-apache commented Dec 24, 2024

HiveTableOperations may incorrectly consider a successful commit as failed #11866

HiveTableOperations may incorrectly consider a successful commit as failed #11866

Comments

lirui-apache commented Dec 24, 2024

Apache Iceberg version

Query engine

Please describe the bug 🐞

Willingness to contribute

lirui-apache commented Dec 24, 2024