Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HiveTableOperations may incorrectly consider a successful commit as failed #11866

Open
2 of 3 tasks
lirui-apache opened this issue Dec 24, 2024 · 1 comment
Open
2 of 3 tasks
Labels
bug Something isn't working

Comments

@lirui-apache
Copy link
Contributor

Apache Iceberg version

1.4.3

Query engine

Spark

Please describe the bug 🐞

We are using NoLock for committing, and we recently hit an issue when HiveTableOperations considered a successful commit as concurrent modification and cleaned up the metadata JSON file that had already been committed, leaving the table in an unusable state.

We configured HMS HA with 3 instances. By examining the logs of these HMS instances, we found they experienced high workload at the time of the issue. And we found alter table requests from the committing job on 2 instances. So I believe the issue happened like this:

  1. The 1st alter table request succeeded but the HMS instance failed to deliver a successful response.
  2. We failed over to another instance, and since the metadata location has been changed, the HMS instance returned an exception containing message like "The table has been modified. ..."
  3. HiveTableOperations checked the exception message, determined this should be a CommitFailedException and deleted the metadata JSON file it created.

There might be two ways to fix the issue:

  1. We don't configure HMS HA, or use RetryingMetaStoreClient for committing. So that concurrent modification exceptions from HMS are more reliable. But then we may need to do retries on thrift exceptions by ourselves.
  2. We do a checkCommitStatus for the concurrent modification exceptions, to make sure we really failed. This is simpler but I believe it brings in an extra refresh from HMS.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time
@lirui-apache lirui-apache added the bug Something isn't working label Dec 24, 2024
@lirui-apache
Copy link
Contributor Author

@pvary What do you think about the issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant