-
Notifications
You must be signed in to change notification settings - Fork 599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v24.2.x] tx/group compaction fixes #24689
Conversation
This is unsafe because it does not do any required checks to see if a particular transaction is in progress and is a candidate for abort. For example if a transaction is committed by the coordinator and pending commit on the group, using this escape hatch to abort the transaction can cause correctness issues. To be used with caution as an escape hatch for aborting transactions that the group has lost track of are ok to be aborted. This situation usually is indicative of a bug in the transaction implementation. (cherry picked from commit 8c5ecca)
(cherry picked from commit 3736f00)
(cherry picked from commit 835f3fc)
Consider group_metadata to determine if a group transaction should be considered open. Eg: if a group if tombstoned, any transaction corresponding to the group is ignored. This invariant is also held in the actual group stm to ensure groups are not tombstoned before any pending transactions are cleaned up (cherry picked from commit 9eee632)
(cherry picked from commit f7191ad)
(cherry picked from commit 6ca81b4)
Retry command for Build#60290please wait until all jobs are finished before running the slash command
|
CI test resultstest results on build#60290
test results on build#60292
|
Retry command for Build#60292please wait until all jobs are finished before running the slash command
|
(cherry picked from commit 6fc62bb)
This will result in hanging transactions and subsequent blocking of compaction. (cherry picked from commit 2b79687)
(cherry picked from commit ac22041)
If a group got tombstoned all the producers to that group should be ignored. The current logic is incorrectly recovering producers and loading them up to expire later. (cherry picked from commit 7c8d633)
(cherry picked from commit c7f953e)
(cherry picked from commit 9958ca6)
.. for a given partition, to be hooked up with REST API in the next commit. (cherry picked from commit 6efd325)
(cherry picked from commit 23c8e29)
/v1/debug/producers/{namespace}/{topic}/{partition} .. includes low level debug information about producers for idempotency/transactional state. (cherry picked from commit 70e36eb)
.. in this case the state machine proceeds on to applying from the log. (cherry picked from commit c833f50)
Bumps the supported snapshot version so the existing snapshots are invalidated as they may contain stale max_collectible_offset. This forces the stm to reconstruct the state form the log and recompute correct max_collectible_offset. (cherry picked from commit 0051463)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are RandomNodeOperationsTest
failures related?
Would be worth adding a note about why 3d65c39 was skipped
There are no failures in the latest run, the jobs all fail probably due to some vtools issues, same is the case with other 24.2.x backport that I spot checked.
Added a note in the PR summary. |
Backport of PR #24637
Fixes #24685
Note on omission of 3d65c39
It exposed what seemed like a race condition in log reader. The commit just blocks the apply fiber that holds a reader instance to a log range. It remains in that state during the entire upgrade and the it turns out that in that state a competing log reader for the same range just times out, in this case it was recovery stm trying to move partitions and the decommission was blocked. Just removed the commit altogether since it was a conservative check added to avoid compaction in mixed mode and has not known correctness implication.s
Backports Required
Release Notes
Bug Fixes