Improve fabric streams cleanup on error and timeouts #5160

nickva · 2024-08-02T06:51:15Z

Previously, we performed cleanup only for specific errors such as ddoc_updated, and insufficient_storage. In case of other errors, or timeouts, there was a chance we would leak workers waiting to be either started or canceled. Those workers would then wait around until the 5 minute rexi timeout fires, and then they emit an error in the logs. It's not a big deal as that happens on errors only, and the processes are all waiting in receive, however, they do hold a Db handle open, so they can waste resources from that point of view.

To fix that, this commit extends cleanup to other errors and timeouts.

Moreover, in case of timeouts, we log fabric worker timeout errors. In order to do that we export the fabric_streams internal #stream_acc record to every fabric_streams user. That's a bit untidy, so make the timeout error return the defunct workers only, and so, we can avoid leaking the #stream_acc record outside the fabric_streams module.

Related to #5127

jaydoane

Another very nice code cleanup! Hopefully this will eliminate some of the last sneaky rexi orphans.

jaydoane · 2024-08-02T15:56:23Z

src/fabric/src/fabric_streams.erl

-handle_stream_start({ok, Error}, _, St) when Error == ddoc_updated; Error == insufficient_storage ->
-    WaitingWorkers = [W || {W, _} <- St#stream_acc.workers],
-    ReadyWorkers = [W || {W, _} <- St#stream_acc.ready],
-    cleanup(WaitingWorkers ++ ReadyWorkers),


Is this cleanup no longer necessary because of the new cleanup(Workers0) in the above Else clause?

Yes, we always do it on any non {ok, #stream_acc{} result, be it {ok, ddoc_updated}, {error, ...} or {timeout, ...}

I debated for a bit adding all those explicit cases for "non-success" responses but figured it also makes sense to have a general "whatever else happens we clean up" clause.

jaydoane · 2024-08-02T15:58:28Z

src/couch_replicator/src/couch_replicator_fabric.erl

-                fabric_util:log_timeout(
-                    DefunctWorkers,
-                    "replicator docs"
-                ),
+            {timeout, DefunctWorkers} ->
+                fabric_util:log_timeout(DefunctWorkers, "replicator docs"),


Not a big deal, but it's curious that erlfmt is fine with this on a single line now, but before it took 3.

I was surprised as well. I wonder if we had enforced an 80 columns things before and now we don't (or erlfmt changed). But thought it is a decent cleanup. Probably there are other cases we can prettify a bit.

Previously, we performed cleanup only for specific errors such as `ddoc_updated`, and `insufficient_storage`. In case of other errors, or timeouts, there was a chance we would leak workers waiting to be either started or canceled. Those workers would then wait around until the 5 minute rexi timeout fires, and then they emit an error in the logs. It's not a big deal as that happens on errors only, and the processes are all waiting in receive, however, they do hold a Db handle open, so they can waste resources from that point of view. To fix that, this commit extends cleanup to other errors and timeouts. Moreover, in case of timeouts, we log fabric worker timeout errors. In order to do that we export the `fabric_streams` internal `#stream_acc` record to every `fabric_streams` user. That's a bit untidy, so make the timeout error return the defunct workers only, and so, we can avoid leaking the `#stream_acc` record outside the fabric_streams module. Related to #5127

nickva mentioned this pull request Aug 2, 2024

Potential pattern of ignoring stranded RPC workers #5127

Open

jaydoane approved these changes Aug 2, 2024

View reviewed changes

nickva force-pushed the improve-stream-cleanup branch from 31f0da3 to 73edaed Compare August 2, 2024 17:10

nickva merged commit 82321a5 into main Aug 2, 2024
17 checks passed

nickva deleted the improve-stream-cleanup branch August 2, 2024 17:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve fabric streams cleanup on error and timeouts #5160

Improve fabric streams cleanup on error and timeouts #5160

nickva commented Aug 2, 2024

jaydoane left a comment

jaydoane Aug 2, 2024

nickva Aug 2, 2024

nickva Aug 2, 2024

jaydoane Aug 2, 2024

nickva Aug 2, 2024

Improve fabric streams cleanup on error and timeouts #5160

Improve fabric streams cleanup on error and timeouts #5160

Conversation

nickva commented Aug 2, 2024

jaydoane left a comment

Choose a reason for hiding this comment

jaydoane Aug 2, 2024

Choose a reason for hiding this comment

nickva Aug 2, 2024

Choose a reason for hiding this comment

nickva Aug 2, 2024

Choose a reason for hiding this comment

jaydoane Aug 2, 2024

Choose a reason for hiding this comment

nickva Aug 2, 2024

Choose a reason for hiding this comment