Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support jitter in client reconnect logic #163

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

KirkMartinez
Copy link

We have about 450 clients which all disconnected simultaneously when our Jenkins master went down.
When we brought it back up, most workers failed to connect due to the thundering herd. We could see from logs that most workers were trying to reconnect at the same time. This feature should alleviate that problem. It adds a new cmdline switch for the swarm client -useJitter (default: off) which uses a retryInterval randomly selected between zero and the retryInterval (as opposed to just using the retryInterval directly). The change still respects the absolute maximum (set by maxRetryInterval) and the selected backoff strategy. Happy to modify this if you have any concerns.

@KirkMartinez
Copy link
Author

@basil can you help me get the right people to review this PR? Sorry to @ mention you directly, but there are no OWNERS or CONTRIB for this repo and I'm not sure how to get help. If there is a better way, please let me know and I'll be happy to comply.

@basil
Copy link
Member

basil commented Dec 5, 2019

@basil can you help me get the right people to review this PR? Sorry to @ mention you directly, but there are no OWNERS or CONTRIB for this repo and I'm not sure how to get help. If there is a better way, please let me know and I'll be happy to comply.

Hey Kirk, thanks for your contribution! I'd love to have this be part of the next release. It's unfortunately a bit of a busy time for me, but I will try to review this sometime next week.

@justinseanmartin
Copy link

Bump on this, would be really nice to avoid the thundering herd that not staggering these causes.

@basil
Copy link
Member

basil commented Feb 13, 2020

Hi @KirkMartinez, sorry for the delay in reviewing this. I just took a look now, and while overall the change looks fine, I wanted to suggest a possible alternative.

The current change adds a boolean flag to enable jitter. When set, we use a reconnect interval between zero and retryInterval. One flaw with this scheme is that it doesn't allow for the minimum retry interval to be controlled.

May I suggest making the jitter an integer rather than a boolean, and adding it to the value of retryInterval instead of overriding the retryInterval? This way, the user can control the minimum retry interval (by setting retryInterval) and can also control the amount of jitter (by setting the jitter interval). In other words, the actual retry interval would be retryInterval + rand(0, jitterInterval). This should avoid thundering herds while also giving the user control over the minimum interval.

What do you think?

@moshavnik
Copy link

I'm a heavy user of the swarm plugin and would love to see this feature merged.
Is there a maintainer who can perform the merge and release the new version?

thanks :)

timsutton added a commit to timsutton/swarm-plugin that referenced this pull request May 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants