Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Add MachineDrainRule "WaitCompleted" #11545

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

vincepri
Copy link
Member

@vincepri vincepri commented Dec 5, 2024

What this PR does / why we need it:

This PR adds the ability for drain to wait for the completion of specific pods. This is useful in scenario where drain is either handled outside the context of kubectl drain after a Node is cordoned, or for long running batch Jobs that should be allowed to terminate on their own.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 5, 2024
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area PR is missing an area label size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 5, 2024
@vincepri vincepri force-pushed the pod-drain-waitcompleted branch 2 times, most recently from bab1c1b to 211839e Compare December 11, 2024 19:41
@vincepri vincepri added the area/machine Issues or PRs related to machine lifecycle management label Dec 12, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/needs-area PR is missing an area label label Dec 12, 2024
@vincepri vincepri added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. do-not-merge/needs-area PR is missing an area label labels Dec 12, 2024
api/v1beta1/machinedrainrules_types.go Show resolved Hide resolved
api/v1beta1/machinedrainrules_types.go Outdated Show resolved Hide resolved
internal/controllers/machine/drain/drain.go Outdated Show resolved Hide resolved
internal/controllers/machine/drain/filters.go Outdated Show resolved Hide resolved
internal/controllers/machine/drain/filters.go Show resolved Hide resolved
internal/controllers/machine/drain/drain.go Show resolved Hide resolved
@@ -498,6 +507,10 @@ func (r EvictionResult) ConditionMessage(nodeDrainStartTime *metav1.Time) string
conditionMessage = fmt.Sprintf("%s\nAfter above Pods have been removed from the Node, the following Pods will be evicted: %s",
conditionMessage, PodListToString(r.PodsToTriggerEvictionLater, 3))
}
if len(r.PodsToWaitCompleted) > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabriziopandini Is this something that should be specifically handled on higher levels when the condition bubbles up? (like we handled PDB's etc.)

(could also be a follow-up PR)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are correct, we have to handle the new use case in:
-

if deletingCondition.Reason == clusterv1.MachineDeletingDrainingNodeV1Beta2Reason && time.Since(machine.Status.Deletion.NodeDrainStartTime.Time) > 5*time.Minute {
delayReasons := []string{}
if strings.Contains(deletingCondition.Message, "cannot evict pod as it would violate the pod's disruption budget.") {
delayReasons = append(delayReasons, "PodDisruptionBudgets")
}
if strings.Contains(deletingCondition.Message, "deletionTimestamp set, but still not removed from the Node") {
delayReasons = append(delayReasons, "Pods not terminating")
}
if strings.Contains(deletingCondition.Message, "failed to evict Pod") {
delayReasons = append(delayReasons, "Pod eviction errors")
}
if len(delayReasons) > 0 {
msg += fmt.Sprintf(", delay likely due to %s", strings.Join(delayReasons, ", "))
}
}

  • for _, machine := range machines {
    if !machine.GetDeletionTimestamp().IsZero() && time.Since(machine.GetDeletionTimestamp().Time) > time.Minute*15 {
    machineNames = append(machineNames, machine.GetName())
    deletingCondition := v1beta2conditions.Get(machine, clusterv1.MachineDeletingV1Beta2Condition)
    if deletingCondition != nil &&
    deletingCondition.Status == metav1.ConditionTrue &&
    deletingCondition.Reason == clusterv1.MachineDeletingDrainingNodeV1Beta2Reason &&
    machine.Status.Deletion != nil && time.Since(machine.Status.Deletion.NodeDrainStartTime.Time) > 5*time.Minute {
    if strings.Contains(deletingCondition.Message, "cannot evict pod as it would violate the pod's disruption budget.") {
    delayReasons.Insert("PodDisruptionBudgets")
    }
    if strings.Contains(deletingCondition.Message, "deletionTimestamp set, but still not removed from the Node") {
    delayReasons.Insert("Pods not terminating")
    }
    if strings.Contains(deletingCondition.Message, "failed to evict Pod") {
    delayReasons.Insert("Pod eviction errors")
    }
    }
    }
    }
    and similar func for MD, KCP

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabriziopandini This PR or follow-up? I lean towards follow-up, mostly just to not expand the scope of this PR further

api/v1beta1/machinedrainrules_types.go Outdated Show resolved Hide resolved
api/v1beta1/machinedrainrules_types.go Outdated Show resolved Hide resolved
internal/controllers/machine/drain/drain.go Outdated Show resolved Hide resolved
internal/controllers/machine/drain/filters.go Show resolved Hide resolved
internal/controllers/machine/drain/filters.go Outdated Show resolved Hide resolved
internal/controllers/machine/drain/filters.go Outdated Show resolved Hide resolved
internal/controllers/machine/drain/drain.go Show resolved Hide resolved
@vincepri vincepri changed the title WIP: ✨ Add MachineDrainRule "WaitCompleted" ✨ Add MachineDrainRule "WaitCompleted" Jan 7, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 7, 2025
@vincepri vincepri added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. and removed do-not-merge/needs-area PR is missing an area label labels Jan 7, 2025
@vincepri vincepri force-pushed the pod-drain-waitcompleted branch from 211839e to 533803e Compare January 7, 2025 19:02
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from vincepri. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 7, 2025
@vincepri vincepri removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 9, 2025
@vincepri vincepri force-pushed the pod-drain-waitcompleted branch 2 times, most recently from 8826d5b to e22bc4d Compare January 9, 2025 15:18
@vincepri
Copy link
Member Author

vincepri commented Jan 9, 2025

/test ?

@k8s-ci-robot
Copy link
Contributor

@vincepri: The following commands are available to trigger required jobs:

/test pull-cluster-api-build-main
/test pull-cluster-api-e2e-blocking-main
/test pull-cluster-api-e2e-conformance-ci-latest-main
/test pull-cluster-api-e2e-conformance-main
/test pull-cluster-api-e2e-latestk8s-main
/test pull-cluster-api-e2e-main
/test pull-cluster-api-e2e-mink8s-main
/test pull-cluster-api-e2e-upgrade-1-32-1-33-main
/test pull-cluster-api-test-main
/test pull-cluster-api-test-mink8s-main
/test pull-cluster-api-verify-main

The following commands are available to trigger optional jobs:

/test pull-cluster-api-apidiff-main

Use /test all to run the following jobs that were automatically triggered:

pull-cluster-api-apidiff-main
pull-cluster-api-build-main
pull-cluster-api-e2e-blocking-main
pull-cluster-api-test-main
pull-cluster-api-verify-main

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@vincepri
Copy link
Member Author

vincepri commented Jan 9, 2025

/test pull-cluster-api-e2e-main

@vincepri vincepri force-pushed the pod-drain-waitcompleted branch from e22bc4d to 1ebe3c2 Compare January 9, 2025 20:10
@vincepri
Copy link
Member Author

vincepri commented Jan 9, 2025

/test pull-cluster-api-e2e-main

@vincepri vincepri force-pushed the pod-drain-waitcompleted branch from 1ebe3c2 to ba555de Compare January 9, 2025 20:31
@vincepri
Copy link
Member Author

vincepri commented Jan 9, 2025

/test pull-cluster-api-e2e-main

@vincepri vincepri force-pushed the pod-drain-waitcompleted branch from ba555de to 6650be9 Compare January 9, 2025 21:36
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 9, 2025
@vincepri vincepri force-pushed the pod-drain-waitcompleted branch from 6650be9 to dbf82f8 Compare January 9, 2025 23:51
@vincepri
Copy link
Member Author

/test pull-cluster-api-e2e-main

@k8s-ci-robot
Copy link
Contributor

@vincepri: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-e2e-main dbf82f8 link true /test pull-cluster-api-e2e-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

// "Drain" means that the Pods to which this MachineDrainRule applies will be drained.
// If behavior is set to "Drain" the order in which Pods are drained can be configured
// with the order field. When draining Pods of a Node the Pods will be grouped by order
// and one group after another will be drained (by increasing order). Cluster API will
// wait until all Pods of a group are terminated / removed from the Node before starting
// with the next group.
// "Skip" means that the Pods to which this MachineDrainRule applies will be skipped during drain.
// "WaitCompleted" means that the pods to which this MachineDrainRule applies will never be evicted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a sentence that we assume Order 0 and that Order cannot be configured if behavior is WaitCompleted?

@@ -147,8 +147,8 @@ metadata:
namespace: default
spec:
drain:
# behavior can be Drain or Skip
behavior: Drain
# behavior can be Drain, Skip, or WaitCompleted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's please add an additional user story above that covers WaitCompleted

@@ -259,6 +260,8 @@ The following describes the new algorithm that decides which Pods should be drai
* \=\> use `behavior: Skip`
* If the Pod has `cluster.x-k8s.io/drain: skip`
* \=\> use `behavior: Skip`
* If the Pod has `cluster.x-k8s.io/drain: wait-completed`
* \=\> use `behavior: WaitCompleted`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* \=\> use `behavior: WaitCompleted`
* \=\> use `behavior: WaitCompleted` and `order: 0`

@@ -498,6 +507,10 @@ func (r EvictionResult) ConditionMessage(nodeDrainStartTime *metav1.Time) string
conditionMessage = fmt.Sprintf("%s\nAfter above Pods have been removed from the Node, the following Pods will be evicted: %s",
conditionMessage, PodListToString(r.PodsToTriggerEvictionLater, 3))
}
if len(r.PodsToWaitCompleted) > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabriziopandini This PR or follow-up? I lean towards follow-up, mostly just to not expand the scope of this PR further

@@ -498,6 +512,10 @@ func (r EvictionResult) ConditionMessage(nodeDrainStartTime *metav1.Time) string
conditionMessage = fmt.Sprintf("%s\nAfter above Pods have been removed from the Node, the following Pods will be evicted: %s",
conditionMessage, PodListToString(r.PodsToTriggerEvictionLater, 3))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should move the if for PodsToWaitCompleted above the one for PodsToTriggerEvictionLater

The message for PodsToTriggerEvictionLater contains "After above Pods ..." and this only makes sense if a potential "Waiting for the following Pods..." message is actually above

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm okay this is a bit more complicated.

The idea is basicallly that this condition message surfaces where the drain is currently stuck (and which Pods we are going to evict next).

The problem with the current behavior is that "WaitCompleted"-Pods are sometimes blocking further progress and sometimes they don't (it depends on the order of the Pods that we have to evict).

I would suggest to treat WaitCompleted Pods similar to PodsToTriggerEviction.

This means splitting them up into podsToWaitCompletedNow and podsToWaitCompletedLater in EvictPods (based on if ptr.Deref(pod.Status.DrainOrder, 0) == minDrainOrder)

I would then also update the log statement in l.303 accordingly.

Once that is done we can also write the condition message here accordingly. I would suggest:

	if len(r.PodsToWaitCompletedNow) > 0 {
		kind := "Pod"
		if len(r.PodsToWaitCompletedNow) > 1 {
			kind = "Pods"
		}
		// Note: the code computing stale warning for the machine deleting condition is making assumptions on the format/content of this message.
		// Same applies for other conditions where deleting is involved, e.g. MachineSet's Deleting and ScalingDown condition.
		conditionMessage = fmt.Sprintf("%s\n* %s %s: waiting for completion",
			conditionMessage, kind, PodListToString(r.PodsDeletionTimestampSet, 3))
	}	
	
	if len(r.PodsToTriggerEvictionLater) > 0 {
		conditionMessage = fmt.Sprintf("%s\nAfter above Pods have been removed from the Node, the following Pods will be evicted: %s",
			conditionMessage, PodListToString(r.PodsToTriggerEvictionLater, 3))
	}

	if len(r.PodsToWaitCompletedLater) > 0 {
		conditionMessage = fmt.Sprintf("%s\nAfter above Pods have been removed from the Node, waiting for the following Pods to complete without eviction: %s",
			conditionMessage, PodListToString(r.PodsToWaitCompletedLater, 3))
	}

Namespace: "unevictable-workload",
NodeSelector: map[string]string{nodeOwnerLabelKey: "KubeadmControlPlane-" + controlplane.Name},
ModifyDeployment: func(deployment *appsv1.Deployment) {
deployment.Spec.Template.Labels["cluster.x-k8s.io/drain"] = string(clusterv1.MachineDrainRuleDrainBehaviorWaitCompleted)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
deployment.Spec.Template.Labels["cluster.x-k8s.io/drain"] = string(clusterv1.MachineDrainRuleDrainBehaviorWaitCompleted)
deployment.Spec.Template.Labels["cluster.x-k8s.io/drain"] = "wait-completed"

nit: I know both works, but maybe let's test the one that we document in the API (same in l.256)

@@ -207,7 +207,7 @@ func NodeDrainTimeoutSpec(ctx context.Context, inputGetter func() NodeDrainTimeo
}

By("Deploy Deployment with unevictable Pods on control plane and MachineDeployment Nodes.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the test docs in l.79++

This should be

// NodeDrainTimeoutSpec goes through the following steps:
// * Create cluster with 3 CP & 1 worker Machine
// * Ensure Node label is set & NodeDrainTimeout is set to 0 (wait forever)
// * Deploy MachineDrainRules
// * Deploy Deployment with unevictable Pods on CP & MD Nodes
// * Deploy Deployment with unevictable Pods with `wait-completed` label on CP & MD Nodes
// * Deploy Deployment with evictable Pods with finalizer on CP & MD Nodes
// * Deploy additional resources if defined in input
// * Trigger Node drain by scaling down the control plane to 1 and MachineDeployments to 0
// * Get draining control plane and MachineDeployment Machines
// * Verify drain of Deployments with order 1
// * Verify drain of Deployments with order 5
// * Verify skipped Pods are still there and don't have a deletionTimestamp
// * Verify wait-completed Pods are still there and don't have a deletionTimestamp
// * Verify Node drains for control plane and MachineDeployment Machines are blocked (by PDBs and WaitCompleted Pods)
// * Set NodeDrainTimeout to 1s to unblock Node drain
// * Verify machine deletion is blocked by waiting for volume detachment (only if VerifyNodeVolumeDetach is enabled)
// * Unblocks waiting for volume detachment (only if VerifyNodeVolumeDetach is enabled)
// * Verify scale down succeeded because Node drains were unblocked.

(we'll have to change the order values, see comment below)

Expect(skippedMDPods.Items).To(HaveLen(1))
Expect(skippedMDPods.Items[0].DeletionTimestamp.IsZero()).To(BeTrue())
}

By("Verify Node drains for control plane and MachineDeployment Machines are blocked (only by PDBs)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
By("Verify Node drains for control plane and MachineDeployment Machines are blocked (only by PDBs)")
By("Verify Node drains for control plane and MachineDeployment Machines are blocked (by PDBs and WaitCompleted Pods)")

@@ -233,6 +233,31 @@ func NodeDrainTimeoutSpec(ctx context.Context, inputGetter func() NodeDrainTimeo
WaitForDeploymentAvailableInterval: input.E2EConfig.GetIntervals(specName, "wait-deployment-available"),
})
}
By("Deploy an unevictable Deployment with `wait-completed` label.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in general the test can work roughly as is.

But as we are now assuming order 0 for WaitCompleted Pods we have to change the order of some other Pods to make sure they can be drained (currently the drain of these Pods is blocked by the waitcompleted Pods).

tl;dr I think the Pods with drain order 1 should get order -5, the Pods with drain order 5 should get order -1

Then all of these Pods can be evicted / drained


// DeployPodAndWait will deploy a Deployment on a ControlPlane or MachineDeployment.
func DeployPodAndWait(ctx context.Context, input DeployPodAndWaitInput) {
Expect(input.DeploymentName).ToNot(BeNil(), "Need a deployment name in DeployUnevictablePod")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Expect(input.DeploymentName).ToNot(BeNil(), "Need a deployment name in DeployUnevictablePod")
Expect(input.DeploymentName).ToNot(BeNil(), "Need a deployment name in DeployPodAndWait")

nit, same in the few lines below

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/machine Issues or PRs related to machine lifecycle management cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants