[v24.2.x] tx/group compaction fixes #24689

bharathv · 2025-01-04T03:20:21Z

Backport of PR #24637
Fixes #24685

Note on omission of 3d65c39

It exposed what seemed like a race condition in log reader. The commit just blocks the apply fiber that holds a reader instance to a log range. It remains in that state during the entire upgrade and the it turns out that in that state a competing log reader for the same range just times out, in this case it was recovery stm trying to move partitions and the decommission was blocked. Just removed the commit altogether since it was a conservative check added to avoid compaction in mixed mode and has not known correctness implication.s

Backports Required

Release Notes

Bug Fixes

Fixes an issue that blocked the compaction of consumer offsets with group transactions.

(cherry picked from commit 8947848)

(cherry picked from commit cdd274a)

(cherry picked from commit 24c8e89)

This is unsafe because it does not do any required checks to see if a particular transaction is in progress and is a candidate for abort. For example if a transaction is committed by the coordinator and pending commit on the group, using this escape hatch to abort the transaction can cause correctness issues. To be used with caution as an escape hatch for aborting transactions that the group has lost track of are ok to be aborted. This situation usually is indicative of a bug in the transaction implementation. (cherry picked from commit 8c5ecca)

(cherry picked from commit 3736f00)

(cherry picked from commit 835f3fc)

Consider group_metadata to determine if a group transaction should be considered open. Eg: if a group if tombstoned, any transaction corresponding to the group is ignored. This invariant is also held in the actual group stm to ensure groups are not tombstoned before any pending transactions are cleaned up (cherry picked from commit 9eee632)

(cherry picked from commit f7191ad)

(cherry picked from commit 6ca81b4)

vbotbuildovich · 2025-01-04T09:51:41Z

Retry command for Build#60290

please wait until all jobs are finished before running the slash command



/ci-repeat 1
tests/rptest/tests/partition_movement_upgrade_test.py::PartitionMovementUpgradeTest.test_basic_upgrade
tests/rptest/tests/partition_movement_test.py::PartitionMovementTest.test_dynamic@{"num_to_upgrade":2}
tests/rptest/tests/upgrade_test.py::UpgradeWithWorkloadTest.test_rolling_upgrade_with_rollback@{"upgrade_after_rollback":false}
tests/rptest/tests/upgrade_test.py::UpgradeWithWorkloadTest.test_rolling_upgrade_with_rollback@{"upgrade_after_rollback":true}
tests/rptest/tests/upgrade_test.py::UpgradeWithWorkloadTest.test_rolling_upgrade
tests/rptest/tests/partition_movement_test.py::PartitionMovementTest.test_bootstrapping_after_move@{"num_to_upgrade":2}
tests/rptest/tests/partition_movement_test.py::PartitionMovementTest.test_move_consumer_offsets_intranode@{"num_to_upgrade":2}
tests/rptest/tests/partition_movement_test.py::SIPartitionMovementTest.test_shadow_indexing@{"cloud_storage_type":1,"num_to_upgrade":2}
tests/rptest/tests/partition_movement_test.py::SIPartitionMovementTest.test_shadow_indexing@{"cloud_storage_type":2,"num_to_upgrade":2}
tests/rptest/tests/random_node_operations_test.py::RandomNodeOperationsTest.test_node_operations@{"enable_failures":false,"mixed_versions":true,"with_tiered_storage":false}
tests/rptest/tests/random_node_operations_test.py::RandomNodeOperationsTest.test_node_operations@{"enable_failures":true,"mixed_versions":true,"with_tiered_storage":false}
tests/rptest/tests/random_node_operations_test.py::RandomNodeOperationsTest.test_node_operations@{"enable_failures":false,"mixed_versions":true,"with_tiered_storage":true}
tests/rptest/tests/random_node_operations_test.py::RandomNodeOperationsTest.test_node_operations@{"enable_failures":true,"mixed_versions":true,"with_tiered_storage":true}

vbotbuildovich · 2025-01-04T11:08:28Z

CI test results

test results on build#60290

test_id	test_kind	job_url	test_status	passed
gtest_raft_rpunit.gtest_raft_rpunit	unit	https://buildkite.com/redpanda/redpanda/builds/60290#0194303b-2fac-4a02-b0e3-c63938f7aff5	FLAKY	1/2
rptest.tests.partition_movement_test.PartitionMovementTest.test_bootstrapping_after_move.num_to_upgrade=2	ducktape	https://buildkite.com/redpanda/redpanda/builds/60290#0194307c-90a6-4923-9a33-199973002815	FAIL	0/1
rptest.tests.partition_movement_test.PartitionMovementTest.test_bootstrapping_after_move.num_to_upgrade=2	ducktape	https://buildkite.com/redpanda/redpanda/builds/60290#0194307d-c413-4de7-aae6-76f08d243fc4	FAIL	0/1
rptest.tests.partition_movement_test.PartitionMovementTest.test_dynamic.num_to_upgrade=2	ducktape	https://buildkite.com/redpanda/redpanda/builds/60290#0194307c-90a6-4a29-af14-3cfe65641d9f	FAIL	0/1
rptest.tests.partition_movement_test.PartitionMovementTest.test_dynamic.num_to_upgrade=2	ducktape	https://buildkite.com/redpanda/redpanda/builds/60290#0194307d-c413-4cb7-a5c0-592302ff6699	FAIL	0/1
rptest.tests.partition_movement_test.PartitionMovementTest.test_move_consumer_offsets_intranode.num_to_upgrade=2	ducktape	https://buildkite.com/redpanda/redpanda/builds/60290#0194307c-90a5-4105-a425-0cd744cf9528	FAIL	0/1
rptest.tests.partition_movement_test.PartitionMovementTest.test_move_consumer_offsets_intranode.num_to_upgrade=2	ducktape	https://buildkite.com/redpanda/redpanda/builds/60290#0194307d-c412-452f-87fa-c2607f22be50	FAIL	0/1
rptest.tests.partition_movement_test.SIPartitionMovementTest.test_shadow_indexing.num_to_upgrade=2.cloud_storage_type=CloudStorageType.ABS	ducktape	https://buildkite.com/redpanda/redpanda/builds/60290#0194307d-c412-452f-87fa-c2607f22be50	FAIL	0/1
rptest.tests.partition_movement_test.SIPartitionMovementTest.test_shadow_indexing.num_to_upgrade=2.cloud_storage_type=CloudStorageType.S3	ducktape	https://buildkite.com/redpanda/redpanda/builds/60290#0194307d-c413-4de7-aae6-76f08d243fc4	FAIL	0/1
rptest.tests.partition_movement_upgrade_test.PartitionMovementUpgradeTest.test_basic_upgrade	ducktape	https://buildkite.com/redpanda/redpanda/builds/60290#0194307c-90a6-4a29-af14-3cfe65641d9f	FAIL	0/1
rptest.tests.partition_movement_upgrade_test.PartitionMovementUpgradeTest.test_basic_upgrade	ducktape	https://buildkite.com/redpanda/redpanda/builds/60290#0194307d-c413-4cb7-a5c0-592302ff6699	FAIL	0/1
rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=False.mixed_versions=True.with_tiered_storage=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/60290#0194307d-c413-4cb7-a5c0-592302ff6699	FAIL	0/1
rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=False.mixed_versions=True.with_tiered_storage=True	ducktape	https://buildkite.com/redpanda/redpanda/builds/60290#0194307d-c414-4144-8c9a-e5433a3d0be4	FAIL	0/1
rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=True.mixed_versions=True.with_tiered_storage=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/60290#0194307d-c413-4cb7-a5c0-592302ff6699	FAIL	0/6
rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=True.mixed_versions=True.with_tiered_storage=True	ducktape	https://buildkite.com/redpanda/redpanda/builds/60290#0194307d-c414-4144-8c9a-e5433a3d0be4	FAIL	0/6
rptest.tests.upgrade_test.UpgradeWithWorkloadTest.test_rolling_upgrade	ducktape	https://buildkite.com/redpanda/redpanda/builds/60290#0194307c-90a6-4923-9a33-199973002815	FAIL	0/1
rptest.tests.upgrade_test.UpgradeWithWorkloadTest.test_rolling_upgrade	ducktape	https://buildkite.com/redpanda/redpanda/builds/60290#0194307d-c414-4144-8c9a-e5433a3d0be4	FAIL	0/1
rptest.tests.upgrade_test.UpgradeWithWorkloadTest.test_rolling_upgrade_with_rollback.upgrade_after_rollback=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/60290#0194307c-90a6-4a29-af14-3cfe65641d9f	FAIL	0/1
rptest.tests.upgrade_test.UpgradeWithWorkloadTest.test_rolling_upgrade_with_rollback.upgrade_after_rollback=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/60290#0194307d-c412-452f-87fa-c2607f22be50	FAIL	0/1
rptest.tests.upgrade_test.UpgradeWithWorkloadTest.test_rolling_upgrade_with_rollback.upgrade_after_rollback=True	ducktape	https://buildkite.com/redpanda/redpanda/builds/60290#0194307c-90a4-4183-b73c-8706f34c24b7	FAIL	0/1
rptest.tests.upgrade_test.UpgradeWithWorkloadTest.test_rolling_upgrade_with_rollback.upgrade_after_rollback=True	ducktape	https://buildkite.com/redpanda/redpanda/builds/60290#0194307d-c413-4de7-aae6-76f08d243fc4	FAIL	0/1

test results on build#60292

test_id	test_kind	job_url	test_status	passed
rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=False.mixed_versions=True.with_tiered_storage=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/60292#019433bf-7925-4643-83b9-3b3cf1f0162e	FAIL	0/1
rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=False.mixed_versions=True.with_tiered_storage=True	ducktape	https://buildkite.com/redpanda/redpanda/builds/60292#019433bf-7926-43f4-8c0b-20013c1dab8e	FAIL	0/1
rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=True.mixed_versions=True.with_tiered_storage=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/60292#019433bf-7925-4643-83b9-3b3cf1f0162e	FLAKY	1/6

vbotbuildovich · 2025-01-05T01:27:56Z

Retry command for Build#60292

please wait until all jobs are finished before running the slash command


/ci-repeat 1
tests/rptest/tests/random_node_operations_test.py::RandomNodeOperationsTest.test_node_operations@{"enable_failures":false,"mixed_versions":true,"with_tiered_storage":true}
tests/rptest/tests/random_node_operations_test.py::RandomNodeOperationsTest.test_node_operations@{"enable_failures":false,"mixed_versions":true,"with_tiered_storage":false}

(cherry picked from commit 6fc62bb)

This will result in hanging transactions and subsequent blocking of compaction. (cherry picked from commit 2b79687)

(cherry picked from commit ac22041)

If a group got tombstoned all the producers to that group should be ignored. The current logic is incorrectly recovering producers and loading them up to expire later. (cherry picked from commit 7c8d633)

(cherry picked from commit c7f953e)

(cherry picked from commit 9958ca6)

.. for a given partition, to be hooked up with REST API in the next commit. (cherry picked from commit 6efd325)

(cherry picked from commit 23c8e29)

/v1/debug/producers/{namespace}/{topic}/{partition} .. includes low level debug information about producers for idempotency/transactional state. (cherry picked from commit 70e36eb)

.. in this case the state machine proceeds on to applying from the log. (cherry picked from commit c833f50)

Bumps the supported snapshot version so the existing snapshots are invalidated as they may contain stale max_collectible_offset. This forces the stm to reconstruct the state form the log and recompute correct max_collectible_offset. (cherry picked from commit 0051463)

ztlpn

Are RandomNodeOperationsTest failures related?

Would be worth adding a note about why 3d65c39 was skipped

bharathv · 2025-01-06T18:53:10Z

Are RandomNodeOperationsTest failures related?

There are no failures in the latest run, the jobs all fail probably due to some vtools issues, same is the case with other 24.2.x backport that I spot checked.

Would be worth adding a note about why 3d65c39 was skipped

Added a note in the PR summary.

bharathv added 3 commits January 3, 2025 19:04

tx/group: track begin offset of transactions

a98a5d3

(cherry picked from commit 8947848)

tx/group: support describe_producers for group

e71e1f7

(cherry picked from commit cdd274a)

tx/tests/dt: test for describe producers

5dab07b

(cherry picked from commit 24c8e89)

bharathv added this to the v24.2.x-next milestone Jan 4, 2025

bharathv requested a review from a team as a code owner January 4, 2025 03:20

github-actions bot added the area/redpanda label Jan 4, 2025

bharathv force-pushed the 242x-comp branch from 137511f to 9e96b8a Compare January 4, 2025 03:23

bharathv added 4 commits January 3, 2025 19:26

group_tx_tracker/stm: plumb feature_table into the stm

e0bf89f

(cherry picked from commit 3736f00)

k/group_data_parser: reduce ignored batch logging to debug

6692d1b

(cherry picked from commit 835f3fc)

group_tx_tracker/stm: track additional information about fence batches

3e4a421

(cherry picked from commit f7191ad)

bharathv force-pushed the 242x-comp branch from 9e96b8a to 5a501f8 Compare January 4, 2025 03:26

group_tx_tracker/stm: heuristic to ignore stale tx_fence batches

d40ce01

(cherry picked from commit 6ca81b4)

bharathv force-pushed the 242x-comp branch from 5a501f8 to 9ff523f Compare January 4, 2025 07:29

bharathv force-pushed the 242x-comp branch from 9ff523f to 206a512 Compare January 4, 2025 22:33

bharathv added 11 commits January 5, 2025 15:12

group_tracker/stm: add a periodic GC loop to expire stale tx_fence txes

0f7e395

(cherry picked from commit 6fc62bb)

k/group: disallow group deletion while transactions in progress

aa2333b

This will result in hanging transactions and subsequent blocking of compaction. (cherry picked from commit 2b79687)

group_recovery_consumer/logging: tidy up logging

690d61f

(cherry picked from commit ac22041)

group_recovery/tx: fix group recovery for non existent groups

0292bab

If a group got tombstoned all the producers to that group should be ignored. The current logic is incorrectly recovering producers and loading them up to expire later. (cherry picked from commit 7c8d633)

group/tx: add a ducktape test for compactibility of consumer_offsets

fae25ef

(cherry picked from commit c7f953e)

tx/producer_state: add getters for internal state

5dd5339

(cherry picked from commit 9958ca6)

tx/observability: add types and plumbing needed to get producer states

b9ea8dd

.. for a given partition, to be hooked up with REST API in the next commit. (cherry picked from commit 6efd325)

tx/admin: types for exposing producer info in REST api

b45c73d

(cherry picked from commit 23c8e29)

tx/observability: REST endpoint to fetch all producers from a partition

0d3d482

/v1/debug/producers/{namespace}/{topic}/{partition} .. includes low level debug information about producers for idempotency/transactional state. (cherry picked from commit 70e36eb)

raft/persisted_stm: add ability for stms to reject local snapshot

bfa3603

.. in this case the state machine proceeds on to applying from the log. (cherry picked from commit c833f50)

bharathv force-pushed the 242x-comp branch from 206a512 to 8a681c1 Compare January 5, 2025 23:19

bharathv requested review from mmaslankaprv and ztlpn January 6, 2025 06:12

ztlpn approved these changes Jan 6, 2025

View reviewed changes

piyushredpanda merged commit 7d96ae3 into redpanda-data:v24.2.x Jan 6, 2025
15 of 19 checks passed

BenPope modified the milestones: v24.2.x-next, v24.2.15 Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v24.2.x] tx/group compaction fixes #24689

[v24.2.x] tx/group compaction fixes #24689

bharathv commented Jan 4, 2025 •

edited

Loading

vbotbuildovich commented Jan 4, 2025 •

edited

Loading

vbotbuildovich commented Jan 4, 2025 •

edited

Loading

vbotbuildovich commented Jan 5, 2025 •

edited

Loading

ztlpn left a comment •

edited

Loading

bharathv commented Jan 6, 2025

[v24.2.x] tx/group compaction fixes #24689

[v24.2.x] tx/group compaction fixes #24689

Conversation

bharathv commented Jan 4, 2025 • edited Loading

Backports Required

Release Notes

Bug Fixes

vbotbuildovich commented Jan 4, 2025 • edited Loading

Retry command for Build#60290

vbotbuildovich commented Jan 4, 2025 • edited Loading

CI test results

vbotbuildovich commented Jan 5, 2025 • edited Loading

Retry command for Build#60292

ztlpn left a comment • edited Loading

Choose a reason for hiding this comment

bharathv commented Jan 6, 2025

bharathv commented Jan 4, 2025 •

edited

Loading

vbotbuildovich commented Jan 4, 2025 •

edited

Loading

vbotbuildovich commented Jan 4, 2025 •

edited

Loading

vbotbuildovich commented Jan 5, 2025 •

edited

Loading

ztlpn left a comment •

edited

Loading