Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v24.2.x] tx/group compaction fixes #24689

Merged
merged 20 commits into from
Jan 6, 2025

Conversation

bharathv
Copy link
Contributor

@bharathv bharathv commented Jan 4, 2025

Backport of PR #24637
Fixes #24685

Note on omission of 3d65c39

It exposed what seemed like a race condition in log reader. The commit just blocks the apply fiber that holds a reader instance to a log range. It remains in that state during the entire upgrade and the it turns out that in that state a competing log reader for the same range just times out, in this case it was recovery stm trying to move partitions and the decommission was blocked. Just removed the commit altogether since it was a conservative check added to avoid compaction in mixed mode and has not known correctness implication.s

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

Bug Fixes

  • Fixes an issue that blocked the compaction of consumer offsets with group transactions.

@bharathv bharathv added this to the v24.2.x-next milestone Jan 4, 2025
@bharathv bharathv requested a review from a team as a code owner January 4, 2025 03:20
This is unsafe because it does not do any required checks to see
if a particular transaction is in progress and is a candidate for abort.
For example if a transaction is committed by the coordinator and
pending commit on the group, using this escape hatch to abort the
transaction can cause correctness issues. To be used with caution as an
escape hatch for aborting transactions that the group has lost track of
are ok to be aborted. This situation usually is indicative of a bug in
the transaction implementation.

(cherry picked from commit 8c5ecca)
Consider group_metadata to determine if a group transaction should be
considered open. Eg: if a group if tombstoned, any transaction
corresponding to the group is ignored. This invariant is also held in
the actual group stm to ensure groups are not tombstoned before any
pending transactions are cleaned up

(cherry picked from commit 9eee632)
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jan 4, 2025

Retry command for Build#60290

please wait until all jobs are finished before running the slash command



/ci-repeat 1
tests/rptest/tests/partition_movement_upgrade_test.py::PartitionMovementUpgradeTest.test_basic_upgrade
tests/rptest/tests/partition_movement_test.py::PartitionMovementTest.test_dynamic@{"num_to_upgrade":2}
tests/rptest/tests/upgrade_test.py::UpgradeWithWorkloadTest.test_rolling_upgrade_with_rollback@{"upgrade_after_rollback":false}
tests/rptest/tests/upgrade_test.py::UpgradeWithWorkloadTest.test_rolling_upgrade_with_rollback@{"upgrade_after_rollback":true}
tests/rptest/tests/upgrade_test.py::UpgradeWithWorkloadTest.test_rolling_upgrade
tests/rptest/tests/partition_movement_test.py::PartitionMovementTest.test_bootstrapping_after_move@{"num_to_upgrade":2}
tests/rptest/tests/partition_movement_test.py::PartitionMovementTest.test_move_consumer_offsets_intranode@{"num_to_upgrade":2}
tests/rptest/tests/partition_movement_test.py::SIPartitionMovementTest.test_shadow_indexing@{"cloud_storage_type":1,"num_to_upgrade":2}
tests/rptest/tests/partition_movement_test.py::SIPartitionMovementTest.test_shadow_indexing@{"cloud_storage_type":2,"num_to_upgrade":2}
tests/rptest/tests/random_node_operations_test.py::RandomNodeOperationsTest.test_node_operations@{"enable_failures":false,"mixed_versions":true,"with_tiered_storage":false}
tests/rptest/tests/random_node_operations_test.py::RandomNodeOperationsTest.test_node_operations@{"enable_failures":true,"mixed_versions":true,"with_tiered_storage":false}
tests/rptest/tests/random_node_operations_test.py::RandomNodeOperationsTest.test_node_operations@{"enable_failures":false,"mixed_versions":true,"with_tiered_storage":true}
tests/rptest/tests/random_node_operations_test.py::RandomNodeOperationsTest.test_node_operations@{"enable_failures":true,"mixed_versions":true,"with_tiered_storage":true}

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jan 4, 2025

CI test results

test results on build#60290
test_id test_kind job_url test_status passed
gtest_raft_rpunit.gtest_raft_rpunit unit https://buildkite.com/redpanda/redpanda/builds/60290#0194303b-2fac-4a02-b0e3-c63938f7aff5 FLAKY 1/2
rptest.tests.partition_movement_test.PartitionMovementTest.test_bootstrapping_after_move.num_to_upgrade=2 ducktape https://buildkite.com/redpanda/redpanda/builds/60290#0194307c-90a6-4923-9a33-199973002815 FAIL 0/1
rptest.tests.partition_movement_test.PartitionMovementTest.test_bootstrapping_after_move.num_to_upgrade=2 ducktape https://buildkite.com/redpanda/redpanda/builds/60290#0194307d-c413-4de7-aae6-76f08d243fc4 FAIL 0/1
rptest.tests.partition_movement_test.PartitionMovementTest.test_dynamic.num_to_upgrade=2 ducktape https://buildkite.com/redpanda/redpanda/builds/60290#0194307c-90a6-4a29-af14-3cfe65641d9f FAIL 0/1
rptest.tests.partition_movement_test.PartitionMovementTest.test_dynamic.num_to_upgrade=2 ducktape https://buildkite.com/redpanda/redpanda/builds/60290#0194307d-c413-4cb7-a5c0-592302ff6699 FAIL 0/1
rptest.tests.partition_movement_test.PartitionMovementTest.test_move_consumer_offsets_intranode.num_to_upgrade=2 ducktape https://buildkite.com/redpanda/redpanda/builds/60290#0194307c-90a5-4105-a425-0cd744cf9528 FAIL 0/1
rptest.tests.partition_movement_test.PartitionMovementTest.test_move_consumer_offsets_intranode.num_to_upgrade=2 ducktape https://buildkite.com/redpanda/redpanda/builds/60290#0194307d-c412-452f-87fa-c2607f22be50 FAIL 0/1
rptest.tests.partition_movement_test.SIPartitionMovementTest.test_shadow_indexing.num_to_upgrade=2.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/60290#0194307d-c412-452f-87fa-c2607f22be50 FAIL 0/1
rptest.tests.partition_movement_test.SIPartitionMovementTest.test_shadow_indexing.num_to_upgrade=2.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/60290#0194307d-c413-4de7-aae6-76f08d243fc4 FAIL 0/1
rptest.tests.partition_movement_upgrade_test.PartitionMovementUpgradeTest.test_basic_upgrade ducktape https://buildkite.com/redpanda/redpanda/builds/60290#0194307c-90a6-4a29-af14-3cfe65641d9f FAIL 0/1
rptest.tests.partition_movement_upgrade_test.PartitionMovementUpgradeTest.test_basic_upgrade ducktape https://buildkite.com/redpanda/redpanda/builds/60290#0194307d-c413-4cb7-a5c0-592302ff6699 FAIL 0/1
rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=False.mixed_versions=True.with_tiered_storage=False ducktape https://buildkite.com/redpanda/redpanda/builds/60290#0194307d-c413-4cb7-a5c0-592302ff6699 FAIL 0/1
rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=False.mixed_versions=True.with_tiered_storage=True ducktape https://buildkite.com/redpanda/redpanda/builds/60290#0194307d-c414-4144-8c9a-e5433a3d0be4 FAIL 0/1
rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=True.mixed_versions=True.with_tiered_storage=False ducktape https://buildkite.com/redpanda/redpanda/builds/60290#0194307d-c413-4cb7-a5c0-592302ff6699 FAIL 0/6
rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=True.mixed_versions=True.with_tiered_storage=True ducktape https://buildkite.com/redpanda/redpanda/builds/60290#0194307d-c414-4144-8c9a-e5433a3d0be4 FAIL 0/6
rptest.tests.upgrade_test.UpgradeWithWorkloadTest.test_rolling_upgrade ducktape https://buildkite.com/redpanda/redpanda/builds/60290#0194307c-90a6-4923-9a33-199973002815 FAIL 0/1
rptest.tests.upgrade_test.UpgradeWithWorkloadTest.test_rolling_upgrade ducktape https://buildkite.com/redpanda/redpanda/builds/60290#0194307d-c414-4144-8c9a-e5433a3d0be4 FAIL 0/1
rptest.tests.upgrade_test.UpgradeWithWorkloadTest.test_rolling_upgrade_with_rollback.upgrade_after_rollback=False ducktape https://buildkite.com/redpanda/redpanda/builds/60290#0194307c-90a6-4a29-af14-3cfe65641d9f FAIL 0/1
rptest.tests.upgrade_test.UpgradeWithWorkloadTest.test_rolling_upgrade_with_rollback.upgrade_after_rollback=False ducktape https://buildkite.com/redpanda/redpanda/builds/60290#0194307d-c412-452f-87fa-c2607f22be50 FAIL 0/1
rptest.tests.upgrade_test.UpgradeWithWorkloadTest.test_rolling_upgrade_with_rollback.upgrade_after_rollback=True ducktape https://buildkite.com/redpanda/redpanda/builds/60290#0194307c-90a4-4183-b73c-8706f34c24b7 FAIL 0/1
rptest.tests.upgrade_test.UpgradeWithWorkloadTest.test_rolling_upgrade_with_rollback.upgrade_after_rollback=True ducktape https://buildkite.com/redpanda/redpanda/builds/60290#0194307d-c413-4de7-aae6-76f08d243fc4 FAIL 0/1
test results on build#60292
test_id test_kind job_url test_status passed
rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=False.mixed_versions=True.with_tiered_storage=False ducktape https://buildkite.com/redpanda/redpanda/builds/60292#019433bf-7925-4643-83b9-3b3cf1f0162e FAIL 0/1
rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=False.mixed_versions=True.with_tiered_storage=True ducktape https://buildkite.com/redpanda/redpanda/builds/60292#019433bf-7926-43f4-8c0b-20013c1dab8e FAIL 0/1
rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=True.mixed_versions=True.with_tiered_storage=False ducktape https://buildkite.com/redpanda/redpanda/builds/60292#019433bf-7925-4643-83b9-3b3cf1f0162e FLAKY 1/6

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jan 5, 2025

Retry command for Build#60292

please wait until all jobs are finished before running the slash command


/ci-repeat 1
tests/rptest/tests/random_node_operations_test.py::RandomNodeOperationsTest.test_node_operations@{"enable_failures":false,"mixed_versions":true,"with_tiered_storage":true}
tests/rptest/tests/random_node_operations_test.py::RandomNodeOperationsTest.test_node_operations@{"enable_failures":false,"mixed_versions":true,"with_tiered_storage":false}

This will result in hanging transactions and subsequent blocking
of compaction.

(cherry picked from commit 2b79687)
If a group got tombstoned all the producers to that group should be
ignored. The current logic is incorrectly recovering producers and
loading them up to expire later.

(cherry picked from commit 7c8d633)
.. for a given partition, to be hooked up with REST API in the next
commit.

(cherry picked from commit 6efd325)
/v1/debug/producers/{namespace}/{topic}/{partition}

.. includes low level debug information about producers for
idempotency/transactional state.

(cherry picked from commit 70e36eb)
.. in this case the state machine proceeds on to applying from the log.

(cherry picked from commit c833f50)
Bumps the supported snapshot version so the existing snapshots are
invalidated as they may contain stale max_collectible_offset. This forces
the stm to reconstruct the state form the log and recompute correct
max_collectible_offset.

(cherry picked from commit 0051463)
Copy link
Contributor

@ztlpn ztlpn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are RandomNodeOperationsTest failures related?

Would be worth adding a note about why 3d65c39 was skipped

@bharathv
Copy link
Contributor Author

bharathv commented Jan 6, 2025

Are RandomNodeOperationsTest failures related?

There are no failures in the latest run, the jobs all fail probably due to some vtools issues, same is the case with other 24.2.x backport that I spot checked.

Would be worth adding a note about why 3d65c39 was skipped

Added a note in the PR summary.

@piyushredpanda piyushredpanda merged commit 7d96ae3 into redpanda-data:v24.2.x Jan 6, 2025
15 of 19 checks passed
@BenPope BenPope modified the milestones: v24.2.x-next, v24.2.15 Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants