-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tests: reproduce a panic #18588
tests: reproduce a panic #18588
Conversation
Signed-off-by: Clement <[email protected]>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: clement2026 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @clement2026. Thanks for your PR. I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files
... and 87 files with indirect coverage changes @@ Coverage Diff @@
## main #18588 +/- ##
==========================================
- Coverage 68.73% 66.47% -2.26%
==========================================
Files 420 420
Lines 35474 35477 +3
==========================================
- Hits 24382 23583 -799
- Misses 9666 10437 +771
- Partials 1426 1457 +31 Continue to review full report in Codecov by Sentry.
|
Signed-off-by: Clement <[email protected]>
The panic is because you changed the production code in this PR. Previously etcd only performs compaction after each snapshot. Now you perform compaction on each application (applying entries). (Have you evaluated the CPU usage change?). You changed the behaviour, but I do not see a KEP, nor a design doc, let alone careful review by the community. Please draft a google doc or KEP to summary the motivation & design, and request review from the community. I understood that it's related to #17098, and also I see you raised some related small PRs. Each of the PR might seem harmless, but let's get the high level design clarified before moving on. |
Thanks for raising this concern. I understand the importance of KEPs and community review, especially for changes to critical apply code. As you noted, this PR is part of a series of performance improvements outlined in issue #17098. This issue clearly states the motivation for these changes, and includes data gathered by @clement2026. These PRs aim to address the specific problem of etcd not having a snapshot file at startup, as detailed in #18459. Creating a snapshot at startup offers several advantages:
You're right that individual PRs might seem harmless in isolation. However, they are interconnected and build upon each other to achieve the desired performance improvements. A comprehensive overview can be found in #17098. To address your request for a more formalized design document, I'm happy to draft a Google Doc or KEP summarizing the motivation, design decisions, and implementation details of this performance improvement initiative. I'll share it with the community for review and feedback. |
Thanks @serathius for clarifying the ongoing work! If there's anything I can help with, just let me know. |
@clement2026 Please reach out to me on the Kubernetes Slack. I'm little busy this week, but I would want to unblock your work as early as possible. |
After chatting with serathius, I've realized this PR isn’t needed anymore. I'll keep working on #17098 in a new PR. |
This PR is just for demo and discussion, please don't merge it.
@ahrtr Let's continue our discussion in #18459 (comment)
Please forgive me for not explaining the problem clearly. The panic issue I mentioned is not a bug in the main branch; it's actually a side effect of PR #18459.
The panic shows up because PR #18459 separates the raft log compact from the snapshot.
I've made minor changes to
server.go
, to separate the raft log compact from the snapshot. Now, the raft log gets compacted in each apply loop, retaining only SnapshotCatchUpEntries` entries.Two test cases are added to demonstrate when a panic should or shouldn't occur. Please take a look at the files changed.