Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: add log lines about removed data #24659

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from

Conversation

andrwng
Copy link
Contributor

@andrwng andrwng commented Dec 25, 2024

It's very difficult to retroactively debug why records are removed by compaction. This adds some logging about removed data.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

Improvements

  • Adds logging to mention data removed by compaction.

This adds some accounting to the copy_data_segment_reducer to keep track
of records and batches that are removed. The goal here is ultimately to
provide more visibility into removed records, though this accounting
isn't yet plugged into anything.

A later change will expose numbers (likely with logging).
Removing records from the storage layer is currently very silent, making
it difficult to debug. This commit exposes the new
copy_data_segment_reducer stats in as a log message, at info level when
data is removed, and debug level when not (presumably we mostly care
about these stats when data is actually being removed).
@vbotbuildovich
Copy link
Collaborator

CI test results

test results on build#60138
test_id test_kind job_url test_status passed
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/60138#0193fbc0-165c-46d3-8d19-3316953f53ef FLAKY 3/6

const auto& stats = res.reducer_stats;
if (stats.has_removed_data()) {
vlog(
gclog.info,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong feelings that this shouldn't be at the info level, but I can imagine that it would make compaction output for a log with a high number of segments much noisier.

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely more noisy, but we already put up with an info level message when we create a segment, so maybe not too bad. are there situations where we re-compact a segment many times, or is it usually <= 2x?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants