-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: CNS checks apiserver in healthz #3269
base: master
Are you sure you want to change the base?
feat: CNS checks apiserver in healthz #3269
Conversation
b78005a
to
20ed505
Compare
ctx := req.Context() | ||
// we just care that we're allowed to List NNCs so set limit to 1 to minimize | ||
// additional load on apiserver | ||
if err := cli.List(ctx, &v1alpha.NodeNetworkConfigList{}, &client.ListOptions{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if listing NNCs is the best we can do - this may be immediately problematic for cilium nodesubnet where we now run CNS but do not expect NNCs (or even to talk to the API server? @santhoshmprabhu)...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anything in the cnsconfig that's reliable to tell us if we need to read NNCs? CNIConflistScenario string
? or maybe try and list pods (keeping limit at 1) if ipamv2 is enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ChannelMode==AzureHost
is how we configure nodesubnet. @rbtr, maybe ChannelMode==CRD || ChannelMode==MultiTenantCRD
is the right check for NNCs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a tl;dr for what these channel modes map to or what they mean?
// ChannelMode :- CNS channel modes
const (
Direct = "Direct"
Managed = "Managed"
CRD = "CRD"
MultiTenantCRD = "MultiTenantCRD"
AzureHost = "AzureHost"
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my understanding, NNCs basically define the CRD, hence my expectation is that CRD
and MultiTenantCRD
are the right checks. AzureHost
indicates nodesubnet - CNS gets IPs from wireserver. Other modes represent direct communication between DNC and CNS I believe, @rbtr would know more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Channel mode ~= how CNS talks to the SDN controlplane (I think it came from communication channel).
Direct is when DNC and CNS can reach each other, managed is mDNC, CRD/MTCRD are via NNC, MTPNC, etc.
I don't want the channel mode config polluting the healthserver, but if we could enable or inject the healthcheck from main based on it that's okay.
It may be as easy as initializing this check at the same spot as we initialize the NNC reconciler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rbtr @santhoshmprabhu lmk what you think now. The nnc checker is only initialized for channelMode CRD
. I guess we could also include MTCRD but I wanted to keep changes minimal for now (although maybe we should start with MTCRD to keep blast radius smaller).
I'm also open to only returning unhealthy if we get a 401 this is mostly to address the "expired token" issue.
1dffb35
to
e7602f7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change should be behind a flag since it is otherwise a breaking change for non-k8s scenarios.
not every instance of CNS will need (or can) check NNCs. The `CRD` channel mode is used by AKS to indicate that CNS will be reading/watching NNCs. `AzureHost` is a newer mode that's used in nodesubnet where NNCs aren't used and therefore CNS has no reason to have its health depend on NNC access.
5f5812c
to
bcc574f
Compare
This pull request is stale because it has been open for 2 weeks with no activity. Remove stale label or comment or this will be closed in 7 days |
Reason for Change:
CNS should regularly check if it is able to reach the apiserver to mitigate the risk of losing access
and then failing silently. The issue linked below is a scenario where CNS was not reloading the
refreshed SA token from disk resulting in controller-runtime's failure to update its cache and watch
pods and NNCs. This unauthorized failure went unnoticed until workloads were scaled up on the node,
at which point CNS was unable to update the NNC to request for more IPs so the pods were stuck in
Pending.
For now, this will just add a checker when the
ChannelMode
isCRD
. Any other modes will get thedefault
Ping
checker built-in to controller-runtime.Issue Fixed:
Follow up for Azure/AKS#4679
Requirements:
Notes: