-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Verification error in TestV3WatchRestoreSnapshotUnsync #13922
Comments
I will take a deep dive into this tomorrow. |
@serathius @ptabor @spzala Please code freeze the main branch before I figure out the root cause |
Having problems with reproducing this even on Running
|
ok, I reproduced issue locally on |
Confirmed that issue is reproducible on |
Problem: can't confirm if this is a new issue in v3.5 as ETCD_VERIFY checks were not present in v3.4 :( Would like got confirmation from the author @ptabor:
|
I would expect them enabled in integrational tests by testing.BeforeTest(...) logic: ? etcd/tests/framework/integration/testing.go Line 104 in dd08e15
I think it's a valid (critical!) problem, related to the consistency-issue fix: https://github.com/etcd-io/etcd/runs/5968635467?check_suite_focus=true m0 was disconnected.... missed some commits... and is reconnected so gets the snapshot (from M2- the new leader):
This enables : etcd/server/etcdserver/snapshot_merge.go Line 42 in dd08e15
that calls: etcd/server/storage/backend/batch_tx.go Lines 242 to 245 in dd08e15
so it does NOT flushes consistencyIndex... but there might be just applied transactions in the backend. Thus the
Solutions:
|
@ptabor I don't think it's the root cause, because we persist the consistent_index on OnPreCommitUnsafe UPDATE: Got it, the reason is that we did not all LockInsideApply!!! |
Great to hear we know the root cause @ahrtr! Awaiting the PR with fix. |
After second thought, I still don't think it's the root cause. If there is at least one entry in the apply.Entries, then the applying consistent_index must have been saved to consistent_index; accordingly it will be persisted to the db at [OnPreCommitUnsafe]( etcd/server/storage/backend/batch_tx.go Line 333 in dd08e15
If there is no any entry in currently apply.Entries, and the createMergedSnapshotMessage is triggered, then the consistent_index will not be persisted just as @ptabor pointed out. But from another perspective, it also means that there is no need to persist consistent_index because its value doesn't change. @serathius can you always/easily reproduce this issue? I tried a couple of times, but couldn't reproduce this issue. |
Yes, I can always reproduce it with command:
One thing I needed to make sure is that TMPDIR environment variable points to in memory mounted filesystem, should be default though on standard filesystem partition ( |
|
Eventually I figured out the root cause, please see the explanation in PR 13930. This should be a legacy issue, so we need to backport to 3.5 as well. Will do it once this once is merged. |
We have recently enabled ETCD_VERIFY in integration tests (#13903) with hope to automatically verify fix for recent data inconsistency issue. Apart of introducing new check dedicated for data inconsistency, we also enabled previously existing check for consistent index. Based on recent weave of flakes on main branch, we have an issue.
All the failures I found happen in TestV3WatchRestoreSnapshotUnsync test:
Examples:
Our first objective is to verify the issue on
release-3.5
branch as this would impact upcoming v3.5.3 release. Second is to fix the issue onmain
branch.The text was updated successfully, but these errors were encountered: