-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve fabric streams cleanup on error and timeouts #5160
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another very nice code cleanup! Hopefully this will eliminate some of the last sneaky rexi orphans.
handle_stream_start({ok, Error}, _, St) when Error == ddoc_updated; Error == insufficient_storage -> | ||
WaitingWorkers = [W || {W, _} <- St#stream_acc.workers], | ||
ReadyWorkers = [W || {W, _} <- St#stream_acc.ready], | ||
cleanup(WaitingWorkers ++ ReadyWorkers), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this cleanup no longer necessary because of the new cleanup(Workers0)
in the above Else
clause?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we always do it on any non {ok, #stream_acc{}
result, be it {ok, ddoc_updated}
, {error, ...}
or {timeout, ...}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I debated for a bit adding all those explicit cases for "non-success" responses but figured it also makes sense to have a general "whatever else happens we clean up" clause.
fabric_util:log_timeout( | ||
DefunctWorkers, | ||
"replicator docs" | ||
), | ||
{timeout, DefunctWorkers} -> | ||
fabric_util:log_timeout(DefunctWorkers, "replicator docs"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big deal, but it's curious that erlfmt is fine with this on a single line now, but before it took 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was surprised as well. I wonder if we had enforced an 80 columns things before and now we don't (or erlfmt changed). But thought it is a decent cleanup. Probably there are other cases we can prettify a bit.
Previously, we performed cleanup only for specific errors such as `ddoc_updated`, and `insufficient_storage`. In case of other errors, or timeouts, there was a chance we would leak workers waiting to be either started or canceled. Those workers would then wait around until the 5 minute rexi timeout fires, and then they emit an error in the logs. It's not a big deal as that happens on errors only, and the processes are all waiting in receive, however, they do hold a Db handle open, so they can waste resources from that point of view. To fix that, this commit extends cleanup to other errors and timeouts. Moreover, in case of timeouts, we log fabric worker timeout errors. In order to do that we export the `fabric_streams` internal `#stream_acc` record to every `fabric_streams` user. That's a bit untidy, so make the timeout error return the defunct workers only, and so, we can avoid leaking the `#stream_acc` record outside the fabric_streams module. Related to #5127
31f0da3
to
73edaed
Compare
Previously, we performed cleanup only for specific errors such as
ddoc_updated
, andinsufficient_storage
. In case of other errors, or timeouts, there was a chance we would leak workers waiting to be either started or canceled. Those workers would then wait around until the 5 minute rexi timeout fires, and then they emit an error in the logs. It's not a big deal as that happens on errors only, and the processes are all waiting in receive, however, they do hold a Db handle open, so they can waste resources from that point of view.To fix that, this commit extends cleanup to other errors and timeouts.
Moreover, in case of timeouts, we log fabric worker timeout errors. In order to do that we export the
fabric_streams
internal#stream_acc
record to everyfabric_streams
user. That's a bit untidy, so make the timeout error return the defunct workers only, and so, we can avoid leaking the#stream_acc
record outside the fabric_streams module.Related to #5127