Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nodes pre-provisioning #127

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

korablin
Copy link

In order to have instances at the ready at all times and
thus speed up CI runs because instance provisioning is done
upfront, add a mechanism and the related parameter to configure
a minimum number of instances that must be ready at all times.

Also add a mechanism to reconcile total number of nodes
based on template/cloud capacity

There is similar functionality presented in openstack cloud plugin:
https://github.com/jenkinsci/openstack-cloud-plugin/blob/master/plugin/src/main/java/jenkins/plugins/openstack/compute/JCloudsPreCreationThread.java

In order to have instances at the ready at all times and
thus speed up CI runs because instance provisioning is done
upfront, add a mechanism and the related parameter to configure
a minimum number of instances that must be ready at all times.

Also add a mechanism to reconcile total number of nodes
based on template/cloud capacity

There is similar functionality in openstack cloud plugin:
https://github.com/jenkinsci/openstack-cloud-plugin/blob/master/plugin/src/main/java/jenkins/plugins/openstack/compute/JCloudsPreCreationThread.java
@pjdarton
Copy link
Member

Thanks for writing this - it's functionality that I approve of 👍
I'm familiar with the openstack plugin and I like that feature 😉

However, I'm not currently in a position to review this in detail right now, so I'll stick to general points:

public stuff: while this plugin code has loads more things public than it should, I don't want the problem made any worse. Please ensure that all externally-visible API additions are marked @Restricted(NoExternalUse.class) ... and use package access instead of public unless public is absolutely necessary.

WebUI: use help pages rather than description attributes. I find that description text bloats the (already long) page more than having a help icon on the right ... and this functionality could do with a big explanation of what it does so the help text would be a perfect place for that.

VSphereCloudRetentionStrategy.check might be better if it did it's thing and then, if the node isn't needed, delegated to super.check rather than copy/paste what super.check currently does ... or else, why not do what RunOnce does and call done(c)? If there is a good reason why it has to be different, please comment why.

Unit tests would be appreciated; I know we can't test against a real vSphere instance, but we can mock things and check the code is requesting the changes we expect.
(And, yes, I admit that most of this code has no unit tests; I'd like to change that by having new functionality have better QA as this plugin is enough of a maintenance-nightmare as it is)

* less 'public' : either @restricted(NoExternalUse) or protected
* add help pop up message to the web ui
* get rid of copy pasting when override retention strategy
* remove unused method (leftover)
@korablin
Copy link
Author

Hi @pjdarton!
Thanks for your fast and informative response. I have addressed your comments with the second commit except for the unit tests. I wouldn't mind to add them, but I don't see an easy way of mocking things that are needed to test this PR.
I would love to cover this code with tests once upstream will provide some guide/mechanism for easy testing.

Could you take a look once again at the code and see if something is still too 'public'? I am not very good at Java, so I could have missed a thing.

Thanks a lot in advance!

Regards,
Nikolai

Copy link
Member

@pjdarton pjdarton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good. All the newly-added public stuff now has @Restricted on it which is good.

WebUI resource files need tweaking.

Disclaimer: I've not tried this out myself yet; I'm just looking at the code changes on GitHub. I will (eventually) get around to trying it at work (and if it works, I'll make use of it right away!) but I'm not going to merge it until I've seen it working myself "for real".

@pjdarton
Copy link
Member

Note: if the Jenkins test fails and you don't think it's your fault, you should be able to get GitHub to re-run the test.
(and if GitHub doesn't let you do it easily, you can force it by closing the PR, waiting 10 minutes, then re-opening the same PR)

@korablin
Copy link
Author

korablin commented Feb 19, 2021

Hey @pjdarton! Glad you are thrilled to get this functionality added. I agree that more docs is better than less.
What do you think about this one?

@korablin
Copy link
Author

Screenshot 2021-02-19 at 21 41 38

@korablin
Copy link
Author

korablin commented Mar 3, 2021

Hey @pjdarton! Hope you are doing well, mate.
Could you take a look at my last commit?

Copy link
Member

@pjdarton pjdarton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Boolean vs boolean is a simple coding error and easily fixed.

I suspect that many new methods are more visible than they need to be. public should be the last-resort; default visibility is package-level and as most of the plugin's code is all in the same package that's often all that's needed. protected is generally only used to give access to subclasses.
...but we do need to ensure that all new non-private code is @Restricted(NoExternalUse.class) unless it's specifically part of the public API (so we don't restrict access to the new getter & setter fot the new field, but all the internal workings should be restricted)

I'm concerned that the "retain VM" functionality will actually defeat / prevent-from-working existing retention-strategy functionality that kills off VMs when they're no longer useful.
It may be better to have this functionality create new VMs to fulfil the instanceMin requirement but allow them to time out normally, instructing users 9in the documentation) that if they don't want instances to time out quickly like that then they'll need to increase the retention time.
The existing "if instancesPresent < instanceMin then ignore the retention strategy and keep it" logic check seems to be too simplistic.

Dealing with the asynchronicity of retention-strategy checks and the instanceMin provisioning background task appears to be improperly handled. I suspect that you'll need some more complicated state handling for this to work properly.
Consider having the periodic task evaluate which instances ought to be kept, and in what priority order, and have the retention strategies then act on that evaluation result "when they get called" ... or (and I suspect this will be the much simpler option) have the periodic task only be responsible for creating new instances if the minimum isn't met yet and let the retention strategies take care of disposing of them.

* Kill shouldSlaveBeRetained() func
* Undo changes to ret policies
* minor fixes
@korablin
Copy link
Author

korablin commented Apr 1, 2021

@pjdarton I have killed shouldSlaveBeRetained() function and removed changes that I did to ret policies, so their work is not affected by this change anymore.
Initially, I added those changes cause I wanted to make this behaviour to be consistent with OpenStack plugin, but I guess we could iterate on this later if people will be complaining.

Regarding your question on using package-level visibility: I don't think it is possible.
Methods that reconcile script uses are located in org.jenkinsci.plugins package, but the script itself is in org.jenkinsci.plugins.vsphere package which appear to me as different packages.
Correct me if I am wrong. Also I am not sure if this is the right place for the script.

@korablin korablin requested a review from pjdarton April 1, 2021 16:29
@korablin
Copy link
Author

korablin commented Apr 5, 2021

I am sorry for the belated response.
It would be nice to communicate quickly. I am willing to address your future comments asap.

@@ -215,7 +217,8 @@ public int getInstanceCap() {
return this.templates;
}

private vSphereCloudSlaveTemplate getTemplateForVM(final String vmName) {
@Restricted(NoExternalUse.class)
public vSphereCloudSlaveTemplate getTemplateForVM(final String vmName) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I think that) This method is only called from elsewhere within this class and the neighbouring vSphereCloudSlaveTemplate class; they're both in the same package.
I believe, therefore, that it doesn't need to be public and that it can be set to default visibility.
i.e.

    @Restricted(NoExternalUse.class)
    vSphereCloudSlaveTemplate getTemplateForVM(final String vmName) {

@@ -133,6 +133,12 @@ public Integer getLimitedTestRunCount() {
return LimitedTestRunCount;
}

@Restricted(NoExternalUse.class)
public vSphereCloudSlaveTemplate getTemplate() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced that this method is used. I'm looking using my browser on github, and it says that "getTemplate(" only occurs here, where it's declared, and nowhere else in this PR.
If that's the case, it can/should be removed.

@korablin
Copy link
Author

korablin commented Apr 7, 2021

@pjdarton thanks for another iteration on this PR. I appreciate it.
So far, I pushed some minor changes removing public level of visibility (those things used to be public due to earlier functionality that I've killed).

Regarding the major things:

  • Tracking state of a request i.e. using templateState when deploying a VM + provisioning VMs per template instead of label
    This sounds right to me. Could you give a hint on how do you envision the implementation for this?
    Should it be a method similar to provision in vSphereCloud.java that takes template and number of instances as arguments and then operates with CloudProvisioningRecord ?

I would love to hear your opinion before I start implementing it.

Also, current combination of provisioning things the simple way (as I do now) + deletion of redundant nodes (using periodic job), kind of makes sure that we meet the instancesMin number, but of course there is a risk of creating excessive VMs that will be killed on the second run of the periodic job.
^^ I am mentioning this just in case, if we would consider merging the code as is and iterate on it later with further improvements - that shouldn't blow up your Jenkins :)

@pjdarton
Copy link
Member

pjdarton commented Apr 8, 2021

Re: hint on how do you envision the implementation

A cloud plugin's method public Collection<PlannedNode> provision(final Label label, int excessWorkload) is how Jenkins asks the cloud-plugin to provide more nodes; this method choses a template to use and then creates a new node (and then repeats until enough executors are in progress, finally returning a set of "planned nodes" to Jenkins).
Within that code, it maintains the internal records (templateState.pruneUnwantedRecords()), ensures it doesn't exceed the cloud capacity (and returns Collections.emptySet() if so) before getting on with deciding which template(s) match the label Jenkins is asking for and then creating new VMs from the template it decided on.

This PR is adding a new method that'll provide more nodes; it'll need to iterate over all the templates checking that there's enough + to satisfy the template's instancesMin field.

In this new case, we don't want to call List<vSphereCloudSlaveTemplate> templates = getTemplates(label) as, instead, we already know exactly which template we want (as it's iterating over all of them) ... but we could just make a list containing just the template we want before going on to call templateState.calculateProvisionableTemplates(templates) etc.

So, I think that what we do is

  1. Refactor the existing vSphereCloud provision method so that some of what it does is split out into separate methods (that our new code can call). This bit looks promising, but there may be others too.
  2. Define a new provisionInstanceMin method in vSphereCloud that'll do synchronized (this) { ensureLists(); } and then, within a synchronized (templateState) block, will iterate over each of the CloudProvisioningRecords in templateState, compare the instanceMin field's value with the number of CloudProvisioningRecord.size() (as that'll include all VMs, not just active nodes) minus the number of active nodes for this template and provision some new ones if more are needed (and there's space for them!). There's a fair amount of decision making etc that'll be identical to that done in the original provision method.
  3. I would suggest you remove the existing code that'll dispose of unwanted instances - disposal is hard to do safely (e.g. as shown in JENKINS-55066 and similar) and there are existing mechanisms that will slowly dispose of unwanted VMs safely.
  4. have the periodic code loop over the vSphereClouds and, for each of those, call the new provisionInstanceMin method.
  5. Finally, compare the new provisionInstanceMin method with the old vSphereCloud provision method and check again for common code that ought to be refactored into its own method that's then used by both.

...and be careful not to change the behaviour of the original provision method - no matter how much the code is rearranged, the functionality of that method needs to remain unchanged.

@korablin korablin force-pushed the node-pre-provision branch 2 times, most recently from 5ef81ba to d843288 Compare May 21, 2021 13:51
@korablin
Copy link
Author

korablin commented May 21, 2021

@pjdarton It's been a while... I am sorry again. Ever-changing priorities are ruinning plans sometimes.
Hopefully, with the last commit that refactors the whole thing for the better way, we will be able to merge it soon :)
Thanks for all of your comments! I have tried my best to address them and things look much simpler now.

I am willing to address further comments asap.
Thank you @pjdarton in advance!

@korablin korablin requested a review from pjdarton May 21, 2021 14:03
}
maxSlavesToProvisionBeforeCloudCapHit = Integer.valueOf(intValue - 1);
}
if (!cloudHasCapacity()) break;
Copy link
Author

@korablin korablin May 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not subtracting 1 here since maxSlavesToProvisionBeforeCloudCapHit is re-calculated on every call to cloudHasCapacity()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood - done like this (re-calculating every time) is not as CPU-efficient as it was, but I guess the additional workload is insigificant compared to everything else.

* Use native logic that takes into account state of the nodes
@korablin korablin force-pushed the node-pre-provision branch from d843288 to 3fe45d5 Compare May 28, 2021 11:18
Copy link
Member

@pjdarton pjdarton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a unit-test for the new method that's been added to CloudProvisioningAlgorithm (that code can be tested in isolation and has good coverage)

I would like to have a unit-test for the new provisoning code too - some test scenarios where the auto-provisioning is called (I think we can assume Jenkins will call it and don't need to test that Jenkins does so) with different situatuons and check that it decides (correctly) what to provision (stubbing/mocking out the bits that'd talk to a vSphere instance).
e.g. "if we've got no instances, no instance limit and instanceMin is 2 then we expect two instances to be provisioned" and "if we've got 2 instances, no instance limit, and instanceMin is 2 then we expect none to be provisioned" etc.
IME unit-tests are a good way of flushing out "out by 1" mistakes (that are hard to spot "in the real world") and they also help protect functionality from getting broken by later changes (changes done by people who don't use this feature and hence won't notice if they break it unless there's a unit test that tells them)

int provisionedNodes = record.getCurrentlyProvisioned().size() + record.getCurrentlyPlanned().size();
int requiredPreProvisionedNodes = record.getTemplate().getInstancesMin();
return requiredPreProvisionedNodes - provisionedNodes;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're going to need a unit test for this new method.

}
maxSlavesToProvisionBeforeCloudCapHit = Integer.valueOf(intValue - 1);
}
if (!cloudHasCapacity()) break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood - done like this (re-calculating every time) is not as CPU-efficient as it was, but I guess the additional workload is insigificant compared to everything else.

public void execute(TaskListener listener) {
for (Cloud cloud : Jenkins.getActiveInstance().clouds) {
if (!(cloud instanceof vSphereCloud)) continue;
for (vSphereCloudSlaveTemplate template : ((vSphereCloud) cloud).getTemplates()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than casting ((vSphereCloud)cloud) every time, cast once into a variable and use that.
e.g.

if (!(cloud instanceof vSphereCloud)) continue;
vSphereCloud vsCloud = (vSphereCloud) cloud;
for (vSphereCloudSlaveTemplate template : vsCloud .getTemplates()) {

Casts are ugly - they're best done as little as possible 😉

assertThat(stubVSphereTemplateState.countNodes(), equalTo(1));

// Here it says that there still should be provisioned 2 nodes, despite the fact there is 1 active already
stubVSphereCloud.preProvisionNodes(template);
Copy link
Author

@korablin korablin Jun 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logs:

Jun 01, 2021 5:54:23 PM org.jenkinsci.plugins.vSphereCloud InternalLog
INFO: STARTING VSPHERE CLOUD

Jun 01, 2021 5:54:23 PM org.jenkinsci.plugins.vsphere.tools.CloudProvisioningState logStateChange
WARNING: Marking templateCapacity10instanceMin21 as active : wasPreviouslyPlanned!=true

Jun 01, 2021 5:54:24 PM org.jenkinsci.plugins.vSphereCloud preProvisionNodes
INFO: preProvisionNodesForTemplate(null): should pre-provision 2 nodes

Jun 01, 2021 5:54:24 PM org.jenkinsci.plugins.vSphereCloud calculateMaxAdditionalSlavesPermitted
INFO: There are 0 VMs in this cloud. The instance cap for the cloud is 100, so we have room for more

Jun 01, 2021 5:54:24 PM org.jenkinsci.plugins.vSphereCloud calculateMaxAdditionalSlavesPermitted
INFO: There are 1 VMs in this cloud. The instance cap for the cloud is 100, so we have room for more

Copy link
Author

@korablin korablin Jun 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, that when preProvision is invoked it actually sees that "There are 1 VMs in this cloud", which was "provisioned" by the method just a moment ago.
So, I would love to achieve the same state without calling preProvision(); I've tried marking the associated record as provisionedSlaveNowActive or provisioningStarted, but that doesn't do the trick, neither calling VSpherePlannedNode.createInstance(stubVSphereTemplateState, nodeName, provisionable);

@korablin
Copy link
Author

korablin commented Jun 1, 2021

Hey @pjdarton ! Thanks for the comments :)
I've added a draft code with unit tests for the new provisioning method: "NodePreProvisionTest.java"
But I can't get it working properly, though. Could you help me to understand what is wrong there?

I couldn't find a way to convince preProvison method that there is something provisoned already.

@korablin
Copy link
Author

Hey @pjdarton! If I remove "NodePreProvisionTest.java" do you think this could be landed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants