Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve fabric streams cleanup on error and timeouts #5160

Merged
merged 1 commit into from
Aug 2, 2024

Conversation

nickva
Copy link
Contributor

@nickva nickva commented Aug 2, 2024

Previously, we performed cleanup only for specific errors such as ddoc_updated, and insufficient_storage. In case of other errors, or timeouts, there was a chance we would leak workers waiting to be either started or canceled. Those workers would then wait around until the 5 minute rexi timeout fires, and then they emit an error in the logs. It's not a big deal as that happens on errors only, and the processes are all waiting in receive, however, they do hold a Db handle open, so they can waste resources from that point of view.

To fix that, this commit extends cleanup to other errors and timeouts.

Moreover, in case of timeouts, we log fabric worker timeout errors. In order to do that we export the fabric_streams internal #stream_acc record to every fabric_streams user. That's a bit untidy, so make the timeout error return the defunct workers only, and so, we can avoid leaking the #stream_acc record outside the fabric_streams module.

Related to #5127

Copy link
Contributor

@jaydoane jaydoane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another very nice code cleanup! Hopefully this will eliminate some of the last sneaky rexi orphans.

handle_stream_start({ok, Error}, _, St) when Error == ddoc_updated; Error == insufficient_storage ->
WaitingWorkers = [W || {W, _} <- St#stream_acc.workers],
ReadyWorkers = [W || {W, _} <- St#stream_acc.ready],
cleanup(WaitingWorkers ++ ReadyWorkers),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this cleanup no longer necessary because of the new cleanup(Workers0) in the above Else clause?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we always do it on any non {ok, #stream_acc{} result, be it {ok, ddoc_updated}, {error, ...} or {timeout, ...}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I debated for a bit adding all those explicit cases for "non-success" responses but figured it also makes sense to have a general "whatever else happens we clean up" clause.

Comment on lines -41 to +38
fabric_util:log_timeout(
DefunctWorkers,
"replicator docs"
),
{timeout, DefunctWorkers} ->
fabric_util:log_timeout(DefunctWorkers, "replicator docs"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal, but it's curious that erlfmt is fine with this on a single line now, but before it took 3.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was surprised as well. I wonder if we had enforced an 80 columns things before and now we don't (or erlfmt changed). But thought it is a decent cleanup. Probably there are other cases we can prettify a bit.

Previously, we performed cleanup only for specific errors such as
`ddoc_updated`, and `insufficient_storage`. In case of other errors, or
timeouts, there was a chance we would leak workers waiting to be either started
or canceled. Those workers would then wait around until the 5 minute rexi
timeout fires, and then they emit an error in the logs. It's not a big deal as
that happens on errors only, and the processes are all waiting in receive,
however, they do hold a Db handle open, so they can waste resources from that
point of view.

To fix that, this commit extends cleanup to other errors and timeouts.

Moreover, in case of timeouts, we log fabric worker timeout errors. In order to
do that we export the `fabric_streams` internal `#stream_acc` record to every
`fabric_streams` user. That's a bit untidy, so make the timeout error return
the defunct workers only, and so, we can avoid leaking the `#stream_acc` record
outside the fabric_streams module.

Related to #5127
@nickva nickva force-pushed the improve-stream-cleanup branch from 31f0da3 to 73edaed Compare August 2, 2024 17:10
@nickva nickva merged commit 82321a5 into main Aug 2, 2024
17 checks passed
@nickva nickva deleted the improve-stream-cleanup branch August 2, 2024 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants