Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdserver: separate raft log compact from snapshot #18459

Conversation

clement2026
Copy link
Contributor

@clement2026 clement2026 commented Aug 17, 2024

Part of #17098

Key Changes

  • Added a new function compactRaftLog for raft log compaction.
  • Introduce 2 variables: RaftLogCompactionStep and DefaultRaftLogCompactionStep to manage how often the raft log is compacted.
  • Refactor some existing tests. These tests previously expected a compaction immediately after a snapshot to ensure a snapshot was sent to followers. This PR broke those tests, so I modified them to ensure they still work.
  • Add unit and integration tests for this PR.

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: clement2026
Once this PR has been reviewed and has the lgtm label, please assign ahrtr for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link

Hi @clement2026. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@clement2026 clement2026 changed the title etcdserver: separate raft log compact from snapshot [WIP] etcdserver: separate raft log compact from snapshot Aug 17, 2024
@clement2026 clement2026 changed the title [WIP] etcdserver: separate raft log compact from snapshot WIP: etcdserver: separate raft log compact from snapshot Aug 17, 2024
@clement2026
Copy link
Contributor Author

Benchmark Summary

Here are the benchmark results for compacting the raft log every time applied index increases by 1, 10, 100, and 1000 (let's call this "step").

Branch Throughput
main(as the base) read -
write -
step_1 read [2.86%, 17.00%]
write [3.02%, 16.75%]
step_10 read [2.82%, 16.53%]
write [2.98%, 16.95%]
step_100 read [-2.09%, 9.77%]
write [-1.00%, 10.93%]
step_1000 read [-4.13%, 7.27%]
write [-4.91%, 7.65%]
main again read [-4.19%, 7.39%]
write [-4.41%, 9.03%]

Throughput performs best when the step is 1 or 10. There isn’t much difference between the two. I've set the default step value to 10 in the code.

@clement2026
Copy link
Contributor Author

clement2026 commented Aug 17, 2024

Benchmark Details

Here's how I ran the benchmark:

  1. Run benchmark on main, step_1, step_10, step_100, step_1000 and main branch in that order.
  2. Reboot the OS between each run.
  3. Collect .csv files and generate the heat maps.

step_1 vs main

Compact raft log every time applied index increases by 1.

step_1_read

step_10 vs main

Compact raft log every time applied index increases by 10.

step_10_read

step_100 vs main

Compact raft log every time applied index increases by 100.

step_100_read

step_1000 vs main

Compact raft log every time applied index increases by 1000.

step_1000_read

main vs main

Re-run benchmarks on the main branch to ensure results are consistent

main_2_read

step_1 vs step_10

Since the benchmark results for step_1 and step_10 are pretty close, here I compare them to see the difference:

step_1-vs-step_10_read


GitHub workflow: https://github.com/clement2026/etcd-benchmark-action/actions/runs/10424407957

The archive containing .csv files and the script: etcd-benchmark-20240817-06-00-29.zip

The benchmark was run on a cloud VM with:

  • 16GB RAM
  • 8 vCPUs
  • 150GB NVMe SSD
  • Ubuntu 24.04 LTS x64

@clement2026 clement2026 changed the title WIP: etcdserver: separate raft log compact from snapshot etcdserver: separate raft log compact from snapshot Aug 17, 2024
@clement2026 clement2026 marked this pull request as ready for review August 17, 2024 09:59
@codecov-commenter
Copy link

codecov-commenter commented Aug 17, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 39.13043% with 28 lines in your changes missing coverage. Please review.

Project coverage is 68.89%. Comparing base (62e4433) to head (0003fa2).
Report is 34 commits behind head on main.

Current head 0003fa2 differs from pull request most recent head 0e4d412

Please upload reports for the commit 0e4d412 to get more accurate results.

Files with missing lines Patch % Lines
client/v3/kubernetes/client.go 0.00% 27 Missing ⚠️
server/embed/etcd.go 0.00% 0 Missing and 1 partial ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
Files with missing lines Coverage Δ
server/config/config.go 79.76% <ø> (ø)
server/etcdserver/server.go 81.28% <100.00%> (-0.29%) ⬇️
server/features/etcd_features.go 60.00% <ø> (ø)
server/storage/mvcc/kvstore.go 89.76% <100.00%> (ø)
server/embed/etcd.go 75.77% <0.00%> (-0.05%) ⬇️
client/v3/kubernetes/client.go 0.00% <0.00%> (ø)

... and 22 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #18459      +/-   ##
==========================================
+ Coverage   68.79%   68.89%   +0.10%     
==========================================
  Files         420      420              
  Lines       35490    35482       -8     
==========================================
+ Hits        24416    24446      +30     
+ Misses       9647     9613      -34     
+ Partials     1427     1423       -4     

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 62e4433...0e4d412. Read the comment docs.

@jmhbnz
Copy link
Member

jmhbnz commented Aug 18, 2024

/ok-to-test

@clement2026
Copy link
Contributor Author

/retest pull-etcd-robustness-arm64

@k8s-ci-robot
Copy link

@clement2026: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:

  • /test pull-etcd-build
  • /test pull-etcd-e2e-386
  • /test pull-etcd-e2e-amd64
  • /test pull-etcd-e2e-arm64
  • /test pull-etcd-govulncheck
  • /test pull-etcd-integration-1-cpu-amd64
  • /test pull-etcd-integration-2-cpu-amd64
  • /test pull-etcd-integration-4-cpu-amd64
  • /test pull-etcd-unit-test-386
  • /test pull-etcd-unit-test-amd64
  • /test pull-etcd-unit-test-arm64
  • /test pull-etcd-verify

The following commands are available to trigger optional jobs:

  • /test pull-etcd-robustness-amd64
  • /test pull-etcd-robustness-arm64

Use /test all to run all jobs.

In response to this:

/retest pull-etcd-robustness-arm64

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@clement2026
Copy link
Contributor Author

/test pull-etcd-robustness-arm64

1 similar comment
@clement2026
Copy link
Contributor Author

/test pull-etcd-robustness-arm64

@clement2026
Copy link
Contributor Author

Reason for the failure

The robustness test failed at panic("need non-empty snapshot") in the raft library.

This happens before the leader creates any snapshot, and raftStorage.ents has already been compacted and lacks the entries a slow follower needs. So, the slow follower requests a snapshot instead, causing the leader to panic.

https://github.com/etcd-io/raft/blob/ba8d7e340d530ae106118f059d3da4d89f933d23/raft.go#L662C1-L689

// maybeSendSnapshot fetches a snapshot from Storage, and sends it to the given
// node. Returns true iff the snapshot message has been emitted successfully.
func (r *raft) maybeSendSnapshot(to uint64, pr *tracker.Progress) bool {
	if !pr.RecentActive {
		r.logger.Debugf("ignore sending snapshot to %x since it is not recently active", to)
		return false
	}


	snapshot, err := r.raftLog.snapshot()
	if err != nil {
		if err == ErrSnapshotTemporarilyUnavailable {
			r.logger.Debugf("%x failed to send snapshot to %x because snapshot is temporarily unavailable", r.id, to)
			return false
		}
		panic(err) // TODO(bdarnell)
	}
	if IsEmptySnap(snapshot) {
		panic("need non-empty snapshot")
	}
	sindex, sterm := snapshot.Metadata.Index, snapshot.Metadata.Term
	r.logger.Debugf("%x [firstindex: %d, commit: %d] sent snapshot[index: %d, term: %d] to %x [%s]",
		r.id, r.raftLog.firstIndex(), r.raftLog.committed, sindex, sterm, to, pr)
	pr.BecomeSnapshot(sindex)
	r.logger.Debugf("%x paused sending replication messages to %x [%s]", r.id, to, pr)


	r.send(pb.Message{To: to, Type: pb.MsgSnap, Snapshot: &snapshot})
	return true
}

Visualization

Here are some images to help us understand the situation better.

Main branch

Compaction only happens right after a snapshot is created.

Untitled-2024-08-16-18062

This PR

Compaction can happen before a snapshot is created.

Untitled-2024-08-16-1806 3

Solutions

  • Modify the implementation of func (r *raft) maybeSendSnapshot(to uint64, pr *tracker.Progress) bool to send an empty snapshot instead of causing a panic. However, I’m not sure about the expected behavior or potential risks.
  • Ensure a snapshot is created if it doesn’t exist when etcd starts up. This might sound silly, I know 😂

I'm stumped right now. Any ideas would be awesome!

@clement2026
Copy link
Contributor Author

/test pull-etcd-robustness-arm64

2 similar comments
@clement2026
Copy link
Contributor Author

/test pull-etcd-robustness-arm64

@clement2026
Copy link
Contributor Author

/test pull-etcd-robustness-arm64

@serathius
Copy link
Member

Ensure a snapshot is created if it doesn’t exist when etcd starts up. This might sound silly, I know

Sounds reasonable. We are separating mechanisms for raft log compaction and snapshots, meaning that now we need to guarantee that snapshot is always available. Creating empty snapshot at the etcd start sounds like a good first iteration.

@clement2026
Copy link
Contributor Author

Sounds reasonable. We are separating mechanisms for raft log compaction and snapshots, meaning that now we need to guarantee that snapshot is always available. Creating empty snapshot at the etcd start sounds like a good first iteration.

Thank you for the info! I'll give this a shot and open a new PR soon.

@clement2026 clement2026 force-pushed the issue-17098-separate-raft-log-compact branch 3 times, most recently from 3c14a8f to 0e4d412 Compare August 31, 2024 19:22
@k8s-ci-robot
Copy link

@clement2026: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-etcd-robustness-arm64 0e4d412 link false /test pull-etcd-robustness-arm64

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@ahrtr
Copy link
Member

ahrtr commented Sep 13, 2024

Compaction can happen before a snapshot is created.

Untitled-2024-08-16-1806 3

leader sending a snapshot means that the leader sends the whole bbolt db snapshot. It doesn't care about what's the current applied index the follower is.

@clement2026
Copy link
Contributor Author

@ahrtr the image isn't loading; It's not set to public: https://private-user-images.githubusercontent.com/27894831/358932866-df2dbdb7-2165-492c-93b8-907e09651fd4.png

leader sending a snapshot means that the leader sends the whole bbolt db snapshot. It doesn't care about what's the current applied index the follower is.

Though the leader doesn't care about the follower, the leader itself can panic before sending out the snapshot.

Before sending a snapshot to a follower, the leader runs maybeSendSnapshot() and may panic here:

https://github.com/etcd-io/raft/blob/ba8d7e340d530ae106118f059d3da4d89f933d23/raft.go#L678-L680

	if IsEmptySnap(snapshot) {
		panic("need non-empty snapshot")
	}

All the leader wants is actually a bbolt db snapshot, not a raft log snapshot. But it seems I can't walk around maybeSendSnapshot() as it's part of the raft mechanism.

@ahrtr
Copy link
Member

ahrtr commented Sep 13, 2024

Though the leader doesn't care about the follower, the leader itself can panic before sending out the snapshot.

Before sending a snapshot to a follower, the leader runs maybeSendSnapshot() and may panic here:

https://github.com/etcd-io/raft/blob/ba8d7e340d530ae106118f059d3da4d89f933d23/raft.go#L678-L680

	if IsEmptySnap(snapshot) {
		panic("need non-empty snapshot")
	}

It's concerning we mix a potential bug with enhancement. Can you raise a separate issue to clearly clarify how to reproduce the issue?

@clement2026
Copy link
Contributor Author

It's concerning we mix a potential bug with enhancement. Can you raise a separate issue to clearly clarify how to reproduce the issue?

Sure, I'll create the issue a bit later.

@clement2026
Copy link
Contributor Author

After chatting with serathius, I've realized this PR isn’t needed anymore. I'll keep working on #17098 in a new PR.
Closing this.

@clement2026 clement2026 deleted the issue-17098-separate-raft-log-compact branch September 24, 2024 09:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

6 participants