You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are using NoLock for committing, and we recently hit an issue when HiveTableOperations considered a successful commit as concurrent modification and cleaned up the metadata JSON file that had already been committed, leaving the table in an unusable state.
We configured HMS HA with 3 instances. By examining the logs of these HMS instances, we found they experienced high workload at the time of the issue. And we found alter table requests from the committing job on 2 instances. So I believe the issue happened like this:
The 1st alter table request succeeded but the HMS instance failed to deliver a successful response.
We failed over to another instance, and since the metadata location has been changed, the HMS instance returned an exception containing message like "The table has been modified. ..."
HiveTableOperations checked the exception message, determined this should be a CommitFailedException and deleted the metadata JSON file it created.
There might be two ways to fix the issue:
We don't configure HMS HA, or use RetryingMetaStoreClient for committing. So that concurrent modification exceptions from HMS are more reliable. But then we may need to do retries on thrift exceptions by ourselves.
We do a checkCommitStatus for the concurrent modification exceptions, to make sure we really failed. This is simpler but I believe it brings in an extra refresh from HMS.
Willingness to contribute
I can contribute a fix for this bug independently
I would be willing to contribute a fix for this bug with guidance from the Iceberg community
I cannot contribute a fix for this bug at this time
The text was updated successfully, but these errors were encountered:
Apache Iceberg version
1.4.3
Query engine
Spark
Please describe the bug 🐞
We are using
NoLock
for committing, and we recently hit an issue when HiveTableOperations considered a successful commit as concurrent modification and cleaned up the metadata JSON file that had already been committed, leaving the table in an unusable state.We configured HMS HA with 3 instances. By examining the logs of these HMS instances, we found they experienced high workload at the time of the issue. And we found alter table requests from the committing job on 2 instances. So I believe the issue happened like this:
"The table has been modified. ..."
CommitFailedException
and deleted the metadata JSON file it created.There might be two ways to fix the issue:
checkCommitStatus
for the concurrent modification exceptions, to make sure we really failed. This is simpler but I believe it brings in an extra refresh from HMS.Willingness to contribute
The text was updated successfully, but these errors were encountered: