Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: reproduce a panic #18588

Closed

Conversation

clement2026
Copy link
Contributor

@clement2026 clement2026 commented Sep 14, 2024

This PR is just for demo and discussion, please don't merge it.

@ahrtr Let's continue our discussion in #18459 (comment)

Please forgive me for not explaining the problem clearly. The panic issue I mentioned is not a bug in the main branch; it's actually a side effect of PR #18459.

The panic shows up because PR #18459 separates the raft log compact from the snapshot.

I've made minor changes to server.go, to separate the raft log compact from the snapshot. Now, the raft log gets compacted in each apply loop, retaining only SnapshotCatchUpEntries` entries.

Two test cases are added to demonstrate when a panic should or shouldn't occur. Please take a look at the files changed.

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: clement2026
Once this PR has been reviewed and has the lgtm label, please assign jmhbnz for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link

Hi @clement2026. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@clement2026 clement2026 changed the title reproduce raft panic tests: reproduce raft panic Sep 14, 2024
@clement2026 clement2026 changed the title tests: reproduce raft panic tests: reproduce a panic Sep 14, 2024
@codecov-commenter
Copy link

codecov-commenter commented Sep 14, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 71.42857% with 2 lines in your changes missing coverage. Please review.

Project coverage is 66.47%. Comparing base (981061a) to head (75a1cbd).

Current head 75a1cbd differs from pull request most recent head ba4bc88

Please upload reports for the commit ba4bc88 to get more accurate results.

Files with missing lines Patch % Lines
server/etcdserver/server.go 71.42% 2 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
Files with missing lines Coverage Δ
server/etcdserver/server.go 74.60% <71.42%> (-6.81%) ⬇️

... and 87 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #18588      +/-   ##
==========================================
- Coverage   68.73%   66.47%   -2.26%     
==========================================
  Files         420      420              
  Lines       35474    35477       +3     
==========================================
- Hits        24382    23583     -799     
- Misses       9666    10437     +771     
- Partials     1426     1457      +31     

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 981061a...ba4bc88. Read the comment docs.

@ahrtr
Copy link
Member

ahrtr commented Sep 16, 2024

The panic is because you changed the production code in this PR.

Previously etcd only performs compaction after each snapshot. Now you perform compaction on each application (applying entries). (Have you evaluated the CPU usage change?). You changed the behaviour, but I do not see a KEP, nor a design doc, let alone careful review by the community. Please draft a google doc or KEP to summary the motivation & design, and request review from the community.

I understood that it's related to #17098, and also I see you raised some related small PRs. Each of the PR might seem harmless, but let's get the high level design clarified before moving on.

@serathius
Copy link
Member

Thanks for raising this concern. I understand the importance of KEPs and community review, especially for changes to critical apply code.

As you noted, this PR is part of a series of performance improvements outlined in issue #17098. This issue clearly states the motivation for these changes, and includes data gathered by @clement2026. These PRs aim to address the specific problem of etcd not having a snapshot file at startup, as detailed in #18459.

Creating a snapshot at startup offers several advantages:

  • Minimal Disruption: It doesn't require altering the maybeSendSnapshot logic, preserving backward compatibility and avoiding potential edge cases.
  • Leverages Existing Mechanisms: It reuses etcd's existing snapshotting mechanism, ensuring safety and reliability.

You're right that individual PRs might seem harmless in isolation. However, they are interconnected and build upon each other to achieve the desired performance improvements. A comprehensive overview can be found in #17098.

To address your request for a more formalized design document, I'm happy to draft a Google Doc or KEP summarizing the motivation, design decisions, and implementation details of this performance improvement initiative. I'll share it with the community for review and feedback.

@clement2026
Copy link
Contributor Author

Thanks @serathius for clarifying the ongoing work! If there's anything I can help with, just let me know.

@serathius
Copy link
Member

@clement2026 Please reach out to me on the Kubernetes Slack.

I'm little busy this week, but I would want to unblock your work as early as possible.

@clement2026
Copy link
Contributor Author

After chatting with serathius, I've realized this PR isn’t needed anymore. I'll keep working on #17098 in a new PR.
Closing this.

@clement2026 clement2026 deleted the archive-reproduce-raft-panic branch September 24, 2024 09:21
@clement2026 clement2026 restored the archive-reproduce-raft-panic branch September 24, 2024 09:21
@clement2026 clement2026 deleted the archive-reproduce-raft-panic branch September 24, 2024 09:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants