-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature enforceExpireAfter - respect to expireAfter ttl #1789
Comments
Have you considered using |
yes, we use a 2h terminationGracePeriod value, this does not works, probably since a delete call is not even made against the node (its just "marked for deletion" by Karpenter) |
@jmdeal after reading again the documentation from TerminationGracePeriod, it states:
so in my case: should this issue changed to a bug instead of a feature request? |
Yes, if the node has been draining for longer than your terminationGracePeriod, this would be a bug not a feature. TGP should enforce a maximum grace time which should meet your use case. Are you able to share Karpenter logs / events that were emited? /kind bug |
/remove-kind feature |
Sure 👍, will add logs from historical data by early next week (but will probably going to have fresh info from Sunday/Monday) thanks |
an example from today for a node part of a NodePool with
endless logs of: {"body":"Failed to drain node, 8 pods are waiting to be evicted","severity":"Warning","attributes":{"k8s.event.action":"","k8s.event.count":1546,"k8s.event.name":"ip-10-235-51-74.ec2.internal.1804f311a19b7e76","k8s.event.reason":"FailedDraining","k8s.event.start_time":"2024-11-07 00:17:41 +0000 UTC","k8s.event.uid":"73e98ab4-b698-4f16-90f1-db050a48d744","k8s.namespace.name":""},"resources":{"k8s.node.name":"","k8s.object.api_version":"v1","k8s.object.fieldpath":"","k8s.object.kind":"Node","k8s.object.name":"ip-10-235-51-74.ec2.internal","k8s.object.resource_version":"965976639","k8s.object.uid":"5abd9407-e06d-4e80-b0db-c48444e4f414"}}
{"body":"Cannot disrupt Node: state node is marked for deletion","severity":"Normal","attributes":{"k8s.event.action":"","k8s.event.count":1501,"k8s.event.name":"ip-10-235-51-74.ec2.internal.1804f3123cd3ae94","k8s.event.reason":"DisruptionBlocked","k8s.event.start_time":"2024-11-06 02:22:17 +0000 UTC","k8s.event.uid":"4d6ddbff-1a4b-4fa7-8eb3-dd8ba0c37753","k8s.namespace.name":""},"resources":{"k8s.node.name":"","k8s.object.api_version":"v1","k8s.object.fieldpath":"","k8s.object.kind":"Node","k8s.object.name":"ip-10-235-51-74.ec2.internal","k8s.object.resource_version":"963917187","k8s.object.uid":"5abd9407-e06d-4e80-b0db-c48444e4f414"}} this node contains 8 pods, 7 of them are daemonsets, and single deployment pod that contains the
note that the nodes, the above pod was scheduled for (i.e. NODE-A and NODE-B) contain next events (they both live just less then 4 days):
|
This issue has been inactive for 14 days. StaleBot will close this stale issue after 14 more days of inactivity. |
Same problem. Pods with do-don-disrupt annotation block node deletion after expiry. Same error messages, same experience, easily reproducible. |
This issue has been inactive for 14 days. StaleBot will close this stale issue after 14 more days of inactivity. |
/remove-lifecycle stale |
Description
Karpenter has the expireAfter feature,
lets assume I've configures a
24h
value forexpireAfter
,24 hours passed and node should be terminated, but karpenter will result in
this is because one or more of the pods from that node have the
karpenter.sh/do-not-disrupt: true
annotation.the result is a node that taint with
karpenter.sh/disrupted:NoSchedule
, no new pods will jump onto it, and its in a "stuck" situation.I would like a way to force karpenter to spin new nodes even if I use this annotations.
would it be reasonable to add a
enforceExpireAfter: true|false (default false)
future, so iftrue
is set, karpenter will ignore/remove the do not disrupt annotation and just delete the node?What problem are you trying to solve?
forcefully delete nodes after TTL of
expireAfter
passedHow important is this feature to you?
very, the lack of this features results in underutilized nodes that cannot be auto deleted
The text was updated successfully, but these errors were encountered: