feat: CNS checks apiserver in healthz #3269

tyler-lloyd · 2024-12-13T18:57:14Z

Reason for Change:

CNS should regularly check if it is able to reach the apiserver to mitigate the risk of losing access
and then failing silently. The issue linked below is a scenario where CNS was not reloading the
refreshed SA token from disk resulting in controller-runtime's failure to update its cache and watch
pods and NNCs. This unauthorized failure went unnoticed until workloads were scaled up on the node,
at which point CNS was unable to update the NNC to request for more IPs so the pods were stuck in
Pending.

For now, this will just add a checker when the ChannelMode is CRD. Any other modes will get the
default Ping checker built-in to controller-runtime.

Issue Fixed:

Follow up for Azure/AKS#4679

Requirements:

uses conventional commit messages
includes documentation
adds unit tests
relevant PR labels added

Notes:

rbtr · 2024-12-13T19:10:07Z

cns/healthserver/healthz.go

+			ctx := req.Context()
+			// we just care that we're allowed to List NNCs so set limit to 1 to minimize
+			// additional load on apiserver
+			if err := cli.List(ctx, &v1alpha.NodeNetworkConfigList{}, &client.ListOptions{


I wonder if listing NNCs is the best we can do - this may be immediately problematic for cilium nodesubnet where we now run CNS but do not expect NNCs (or even to talk to the API server? @santhoshmprabhu)...

anything in the cnsconfig that's reliable to tell us if we need to read NNCs? CNIConflistScenario string? or maybe try and list pods (keeping limit at 1) if ipamv2 is enabled?

ChannelMode==AzureHost is how we configure nodesubnet. @rbtr, maybe ChannelMode==CRD || ChannelMode==MultiTenantCRD is the right check for NNCs?

is there a tl;dr for what these channel modes map to or what they mean?

// ChannelMode :- CNS channel modes const ( Direct = "Direct" Managed = "Managed" CRD = "CRD" MultiTenantCRD = "MultiTenantCRD" AzureHost = "AzureHost" )

In my understanding, NNCs basically define the CRD, hence my expectation is that CRD and MultiTenantCRD are the right checks. AzureHost indicates nodesubnet - CNS gets IPs from wireserver. Other modes represent direct communication between DNC and CNS I believe, @rbtr would know more.

Channel mode ~= how CNS talks to the SDN controlplane (I think it came from communication channel).
Direct is when DNC and CNS can reach each other, managed is mDNC, CRD/MTCRD are via NNC, MTPNC, etc.
I don't want the channel mode config polluting the healthserver, but if we could enable or inject the healthcheck from main based on it that's okay.
It may be as easy as initializing this check at the same spot as we initialize the NNC reconciler.

@rbtr @santhoshmprabhu lmk what you think now. The nnc checker is only initialized for channelMode CRD. I guess we could also include MTCRD but I wanted to keep changes minimal for now (although maybe we should start with MTCRD to keep blast radius smaller).

I'm also open to only returning unhealthy if we get a 401 this is mostly to address the "expired token" issue.

cns/healthserver/healthz.go

cns/service/main.go

ramiro-gamarra

This change should be behind a flag since it is otherwise a breaking change for non-k8s scenarios.

not every instance of CNS will need (or can) check NNCs. The `CRD` channel mode is used by AKS to indicate that CNS will be reading/watching NNCs. `AzureHost` is a newer mode that's used in nodesubnet where NNCs aren't used and therefore CNS has no reason to have its health depend on NNC access.

…nicking

github-actions · 2025-01-02T00:01:31Z

This pull request is stale because it has been open for 2 weeks with no activity. Remove stale label or comment or this will be closed in 7 days

tyler-lloyd force-pushed the tyler-lloyd/cns-health-checker branch from b78005a to 20ed505 Compare December 13, 2024 18:57

rbtr reviewed Dec 13, 2024

View reviewed changes

rbtr added enhancement cns Related to CNS. labels Dec 13, 2024

timraymond requested changes Dec 16, 2024

View reviewed changes

cns/healthserver/healthz.go Outdated Show resolved Hide resolved

cns/healthserver/healthz.go Outdated Show resolved Hide resolved

tyler-lloyd marked this pull request as ready for review December 16, 2024 17:23

tyler-lloyd requested a review from a team as a code owner December 16, 2024 17:23

tyler-lloyd requested a review from ramiro-gamarra December 16, 2024 17:23

tyler-lloyd force-pushed the tyler-lloyd/cns-health-checker branch from 1dffb35 to e7602f7 Compare December 16, 2024 17:26

ramiro-gamarra reviewed Dec 16, 2024

View reviewed changes

cns/service/main.go Show resolved Hide resolved

ramiro-gamarra requested changes Dec 16, 2024

View reviewed changes

tyler-lloyd added 7 commits December 18, 2024 09:42

feat: CNS checks apiserver in healthz

085fbc5

test: add unit tests

375fb6a

refactor: return error from NewHealthzHandlerWithChecks instead of pa…

e099c53

…nicking

chore: address lint errors

af97565

refactor: only get kubeConfig when in CRD mode

50e1057

chore: fix lint errors

bcc574f

tyler-lloyd force-pushed the tyler-lloyd/cns-health-checker branch from 5f5812c to bcc574f Compare December 18, 2024 14:42

tyler-lloyd requested review from timraymond and ramiro-gamarra December 18, 2024 14:43

github-actions bot added the stale Stale due to inactivity. label Jan 2, 2025

rbtr removed the stale Stale due to inactivity. label Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: CNS checks apiserver in healthz #3269

feat: CNS checks apiserver in healthz #3269

tyler-lloyd commented Dec 13, 2024 •

edited

Loading

rbtr Dec 13, 2024

tyler-lloyd Dec 13, 2024

santhoshmprabhu Dec 16, 2024

tyler-lloyd Dec 16, 2024

santhoshmprabhu Dec 16, 2024

rbtr Dec 16, 2024

tyler-lloyd Dec 17, 2024

ramiro-gamarra left a comment

github-actions bot commented Jan 2, 2025

feat: CNS checks apiserver in healthz #3269

Are you sure you want to change the base?

feat: CNS checks apiserver in healthz #3269

Conversation

tyler-lloyd commented Dec 13, 2024 • edited Loading

rbtr Dec 13, 2024

Choose a reason for hiding this comment

tyler-lloyd Dec 13, 2024

Choose a reason for hiding this comment

santhoshmprabhu Dec 16, 2024

Choose a reason for hiding this comment

tyler-lloyd Dec 16, 2024

Choose a reason for hiding this comment

santhoshmprabhu Dec 16, 2024

Choose a reason for hiding this comment

rbtr Dec 16, 2024

Choose a reason for hiding this comment

tyler-lloyd Dec 17, 2024

Choose a reason for hiding this comment

ramiro-gamarra left a comment

Choose a reason for hiding this comment

github-actions bot commented Jan 2, 2025

tyler-lloyd commented Dec 13, 2024 •

edited

Loading