-
Notifications
You must be signed in to change notification settings - Fork 989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add option to configure extra pod capacity for alternate cnis #6042
Conversation
✅ Deploy Preview for karpenter-docs-prod canceled.
|
This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity. |
Sorry this closed and slipped through the cracks. I wanted to truly understand the use-cases here before we make this configurable, but I agree that if the default is the same, this should be easy to reason about. |
Pull Request Test Coverage Report for Build 9407938329Details
💛 - Coveralls |
The hardcoded assumption that there are "2" pods of capacity that can be assumed extra on each node, because of the deamonsets is the core change here we are asking for. With Cilium in kube-proxyless mode, that assumption does not hold. I assume that even more rare multus cni setups or chained cni setups also need a different value. |
@cnmcavoy does this solve your issue? kubernetes-sigs/karpenter#1305 @ellistarn can add more details here, but this could potentially add this as an overlay on the node. Do you think that this expected pod overhead differs by instance type? or should the overhead be the same across the cluster since it's related to the networking setup for the cluster? |
I have some relevant context as we just had the discussion as to whether or not we should lower the ip configurations we generate in the azure provider based on the Static Host Networking addons. I was just looking at IP allocation for static addons in our provider. We had a similar value "StaticAzureCNIHostNetworkAddons", and we decided to just respect the max pods values we pass into kubelet and not add the decifit at all. Im not sure how aws provider generates ip configurations or how eni works at all, but maybe this is a useful datapoint. See this commit for more details cc: @paulgmiller who is our networking lead at AKS who might have more context. |
" NOTE: There are some pods that we see on every node for networking like kube-proxy, and ip-masq-agent). We could be subtracting ips for these pods, and we do for AKSEngine. In the case of AKS However, we have a dynamic number of host networking addon pods. So we are explicitly choosing to not subtract these IPs like AKS engine and some parts of AKS do. The AKS networking team regrets adding this behavior and karpenter is a chance to do better so we shall." |
Interesting, if I am following your context with AKS, you are suggesting Karpenter should remove the hardcoded static limit entirely and present the stated ec2 pod capacity. The consequences of that would be that the pod capacity would not match the AWS eni capacity of the node, for those of us using ENI IPAM. What do you do when pods try to start on a node and can't receive an IP, because the pod capacity != CNI capacity? I opened this PR because we hit this in production @ Indeed. I do not see how node-overlays addresses this bug report. As a user of Karpenter, I want the pod capacity of node's to be correct. I do not want to set my own custom pod capacity. I want the node pod capacity to match the number of ENI's that can be attached to a node. Currently it is not because there is hardcoded magic numbers in the math that assumes too much. Does this help clarify what this request is about? |
There's a broader challenge here that host-networked pods are not correctly accounted for. We make a bad assumption that there will be 2 (e.g. kubeproxy/vpccni), but in reality, there are often far more than this due to other daemons, which artificially restricts nodes from having fewer pods than they could. Another way to achieve your feature request is via a negative overhead node overlay.
This will select all nodes, and tell the scheduler that there's an additional pod slot available. |
This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity. |
f306854
to
a7ce3b5
Compare
This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity. |
a7ce3b5
to
96e8c1f
Compare
This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity. |
This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity. |
Pull Request Test Coverage Report for Build 11278139874Details
💛 - Coveralls |
This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity. |
Fixes #5780
Description
Adds an option to configure extra pod capacity for alternative cni runtimes. Karpenter hardcodes 2 extra pods beyond the ENI IP limits, to account for aws-cni and kube-proxy. This exposes that value as a configurable option while leaving the default untouched.
We want to be able to use Cilium w/kube-proxy replacement mode enabled, and so we need to be able to set this to "1" to account for 1 fewer host networked pod.
How was this change tested?
make presubmit
Does this change impact docs?
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.