-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle launcher failures by running the disconnect logic #121
base: master
Are you sure you want to change the base?
Conversation
Launching the agent on the VM can fail. This results in the VM online, but unusable. Ensure the cleanup logic happens in this case, such that the VM will be recycled. In the disconnect and revert case, we need to revert regardless of power state.
Any changes to FYI one thing I see in my own environment is Jenkins sometimes complains that a vSphere VM failed to come online ... and then the VM comes online anyway (because the Windows VM was just being much slower than expected to start up). So, what tests have you done with this PR? |
This change has been running in an environment I support for about 5 months. Typical volume is 1.5K-2K agents/day. The first 3.5 months was with static provisioning using the RECONNECT_AND_REVERT idle action. About 6 weeks ago I transitioned everything to dynamic provisioning with run-once retention strategy (using my other 2 PRs merged with this one). We exclusively use SSH, so I had to rely on code analysis for JNLP. Based on the description of your environment, it sounds like you are using JNLP. I presume the complaint you mention shows up as "EXCEPTION while starting VM" followed by "ERROR: Unexpected error in launching an agent. This is probably a bug in Jenkins" stack trace. If my assumptions are correct, it seems that you have not configured a long enough I think the question therefore is, what should the vSphere plugin behavior be when the agent was not successfully connected during the allowed launch time? I feel the current behavior is not great, especially with static provisioning. In the case of SSH, it logs "Slave online", and since it isn't, nothing else ever happens. This leaves the static node out of service until manual intervention. In the case of JNLP, it throws an exception all the way up the stack. JNLP may connect later, but the plugin already gave up and stopped tracking it against the max online count. Or JNLP may never connect, leaving us in the same bad situation as SSH. In both cases, it is effectively not handling the failure. I think a better behavior is to allow the plugin to recover the VM by using whatever idle action was configured. With dynamic provisioning, the idle timeout checker covers up the launch failure by disconnecting the computer and then terminating the node. This PR is calling that disconnect process a bit earlier. Yes, it may get called an extra time, but the idle action is As an aside, the idle timeout checker is currently disconnecting nodes that are actively connecting. The timeout starts counting prior to launch, whereas I think most users would consider "idle" to mean online and not running any jobs. This is particularly problematic as the run-once retention strategy defaults to 2 minutes, and doesn't take into account delayed launch caused by One potential issue with this PR is that non-launcher exceptions will also trigger the disconnect logic. In some cases this may be desirable (such as the startVm call timing out after 60 seconds due to transient DRS placement slowness). In other cases, this could potentially create a loop on bad config (I'm thinking of a missing VM/snapshot). I can retool the PR to only disconnect after launcher failures, but I no longer have static provisioning load to validate with. |
Launching an agent on the VM can fail. This results in the VM online, but unusable.
Ensure the disconnect logic runs in this case, such that the VM can be recycled.
When using RECONNECT_AND_REVERT, we need to revert regardless of power state.