Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: CNS checks apiserver in healthz #3269

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

tyler-lloyd
Copy link
Contributor

@tyler-lloyd tyler-lloyd commented Dec 13, 2024

Reason for Change:

CNS should regularly check if it is able to reach the apiserver to mitigate the risk of losing access
and then failing silently. The issue linked below is a scenario where CNS was not reloading the
refreshed SA token from disk resulting in controller-runtime's failure to update its cache and watch
pods and NNCs. This unauthorized failure went unnoticed until workloads were scaled up on the node,
at which point CNS was unable to update the NNC to request for more IPs so the pods were stuck in
Pending.

For now, this will just add a checker when the ChannelMode is CRD. Any other modes will get the
default Ping checker built-in to controller-runtime.

Issue Fixed:

Follow up for Azure/AKS#4679

Requirements:

Notes:

@tyler-lloyd tyler-lloyd force-pushed the tyler-lloyd/cns-health-checker branch from b78005a to 20ed505 Compare December 13, 2024 18:57
ctx := req.Context()
// we just care that we're allowed to List NNCs so set limit to 1 to minimize
// additional load on apiserver
if err := cli.List(ctx, &v1alpha.NodeNetworkConfigList{}, &client.ListOptions{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if listing NNCs is the best we can do - this may be immediately problematic for cilium nodesubnet where we now run CNS but do not expect NNCs (or even to talk to the API server? @santhoshmprabhu)...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anything in the cnsconfig that's reliable to tell us if we need to read NNCs? CNIConflistScenario string? or maybe try and list pods (keeping limit at 1) if ipamv2 is enabled?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ChannelMode==AzureHost is how we configure nodesubnet. @rbtr, maybe ChannelMode==CRD || ChannelMode==MultiTenantCRD is the right check for NNCs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a tl;dr for what these channel modes map to or what they mean?

// ChannelMode :- CNS channel modes
const (
	Direct         = "Direct"
	Managed        = "Managed"
	CRD            = "CRD"
	MultiTenantCRD = "MultiTenantCRD"
	AzureHost      = "AzureHost"
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my understanding, NNCs basically define the CRD, hence my expectation is that CRD and MultiTenantCRD are the right checks. AzureHost indicates nodesubnet - CNS gets IPs from wireserver. Other modes represent direct communication between DNC and CNS I believe, @rbtr would know more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Channel mode ~= how CNS talks to the SDN controlplane (I think it came from communication channel).
Direct is when DNC and CNS can reach each other, managed is mDNC, CRD/MTCRD are via NNC, MTPNC, etc.
I don't want the channel mode config polluting the healthserver, but if we could enable or inject the healthcheck from main based on it that's okay.
It may be as easy as initializing this check at the same spot as we initialize the NNC reconciler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rbtr @santhoshmprabhu lmk what you think now. The nnc checker is only initialized for channelMode CRD. I guess we could also include MTCRD but I wanted to keep changes minimal for now (although maybe we should start with MTCRD to keep blast radius smaller).

I'm also open to only returning unhealthy if we get a 401 this is mostly to address the "expired token" issue.

@rbtr rbtr added enhancement cns Related to CNS. labels Dec 13, 2024
cns/healthserver/healthz.go Outdated Show resolved Hide resolved
cns/healthserver/healthz.go Outdated Show resolved Hide resolved
@tyler-lloyd tyler-lloyd marked this pull request as ready for review December 16, 2024 17:23
@tyler-lloyd tyler-lloyd requested a review from a team as a code owner December 16, 2024 17:23
@tyler-lloyd tyler-lloyd force-pushed the tyler-lloyd/cns-health-checker branch from 1dffb35 to e7602f7 Compare December 16, 2024 17:26
Copy link
Contributor

@ramiro-gamarra ramiro-gamarra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change should be behind a flag since it is otherwise a breaking change for non-k8s scenarios.

not every instance of CNS will need (or can) check
NNCs. The `CRD` channel mode is used by AKS to indicate
that CNS will be reading/watching NNCs. `AzureHost` is
a newer mode that's used in nodesubnet where NNCs aren't
used and therefore CNS has no reason to have its health
depend on NNC access.
@tyler-lloyd tyler-lloyd force-pushed the tyler-lloyd/cns-health-checker branch from 5f5812c to bcc574f Compare December 18, 2024 14:42
Copy link

github-actions bot commented Jan 2, 2025

This pull request is stale because it has been open for 2 weeks with no activity. Remove stale label or comment or this will be closed in 7 days

@github-actions github-actions bot added the stale Stale due to inactivity. label Jan 2, 2025
@rbtr rbtr removed the stale Stale due to inactivity. label Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cns Related to CNS. enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants