change the kubelet service crash loop behavior #2178

neolit123 · 2020-06-10T20:52:49Z

over time we have seen a number of complains related to the crash loop of the kubelet service in the DEB/RPMs. when the kubelet is installed, the service is enabled but it would fail because it's missing its config.yaml (KubeletConfiguration), unless something like kubeadm creates one for it.

this has caused problems for:

Windows support
supporting other service manager like OpenRC

after a discussion of the kubeadm office hours of 10.06.2020 we've agreed that it might be a good idea to change this behavior and keep the service disabled by default. but this would require changes in both kubeadm and the kubelet systemd specs.

the idea we are thinking of is the following:

modify kubeadm to always enable the service if its not enabled on kubeadm init/join runtime.
note that, currently kubeadm just has a preflight check that fails if the service is not enabled and instructs the user how to enable it manually.
modify the kubelet systemd service files in the kubernetes/release repository to have the kubelet service disabled by default. this change will require a release note with "action-required" as non-kubeadm users would have to manually enable it (e.g. using: "sysctl enable kubelet").

/kind feature
/priority important-longterm

neolit123 · 2020-06-10T20:54:08Z

the kubeadm change is doable for 1.19, but arguably the k/release change needs wider agreement.

neolit123 · 2020-06-10T20:54:21Z

xref kubernetes/release#1352

neolit123 · 2020-06-10T22:20:59Z

cc @rosti
so i think there is a limitation of our release process as i mentioned here:
kubernetes/release#1352

which means that the PR might be much better than the above proposal.

BenTheElder · 2020-06-10T22:35:56Z

I thought kubeadm already had code related to doing this https://github.com/kubernetes/kubernetes/blob/875f31e98878fd199a76fd0ba2465d14558788cd/cmd/kubeadm/app/phases/kubelet/kubelet.go#L38

neolit123 · 2020-06-10T22:40:50Z

it only starts and restarts the service but does not manage "enable" status.
it does tell the user how to enable, though.

rosti · 2020-06-11T09:30:56Z

The necessary modifications here are:

Remove the kubelet service running check (from here)
Enable the service during init and join. One way is to add the enabling code where the TryStartKubelet calls are. A safer and somewhat backwards compatible alternative is to add new kubelet enable phases to init and join that would just enable permanently the kubelet service after all of the rest of the init/join process has been successfully completed.
Disable the kubelet service during kubeadm reset. Again, this can be sitting next to the TryStopKubelet call or be in a separate reset phase.

As said previously. It doesn't hurt for us to implement this. It's not that big of a change on our side. The problem is what to do with the packages where the kubelet service continues to be automatically enabled and crash looping (since it doesn't have a config file).

neolit123 · 2020-06-11T13:10:52Z

your proposed plan for kubeadm seems fine to me.

The problem is what to do with the packages where the kubelet service continues to be automatically enabled and crash looping (since it doesn't have a config file).

so i think ideally we should be making this change only for the latest MINOR.
but in kubernetes/release#1352 we are also discussing the fact that currently such changes are applied to all PATCH release of k8s packages.

currently the kubeadm package has the kubelet package as a dependency (there were discussions to change this too), which supposedly installs the same versions for both package for most users.

there could be users that are bypassing that and installing e.g. kubeadm 1.x and kubelet 1.x-1, due to factor X, and this is a supported skew. for such a skew the old kubelet service may be enabled by default (crashloop) but the new kubeadm could be managing "enable" already.

testing something like systemctl enable kubelet on a service that is already enabled doesn't have an exit status != 0, for windows Set-Service ... -StartupType.. should not return errors. no idea about OpenRC. in any case we may have to call the potential InitSystem#ServiceEnable() without catching its errors, but later fail on InitSystem#ServiceStart().

so overall even if a kubeadm binary encounters an older kubelet service, for the crashloop problem in particular this should be fine, unless i'm missing something.

however for the kubelet flag removal and instance specific CC problem, the kubeadm 1.x and kubelet 1.x-1 skew might bite people that are passing flags to a kublet that no longer supports flags at all.
i can see us removing kubeletExtraArgs from the kubeadm API at some point. but that's a separate topic.

xlgao-zju · 2020-07-01T07:58:01Z

Tested systemctl enable kubelet.service on a service that is already enabled, and get exit code = 0.

I think @rosti 's plan is fine, and I'd like to help with this.

xlgao-zju · 2020-07-02T12:53:28Z

/assign

neolit123 · 2020-08-10T11:28:22Z

note, if 1.20 is a "stabilization release" we should not be making this change.
i also think that a KEP is appropriate as it requires changes in both the packages (and in the krel tool to branch?) and kubeadm.

neolit123 · 2020-08-10T11:29:12Z

/sig release cluster-lifecycle

xlgao-zju · 2020-08-10T12:17:33Z

@neolit123

the part one may not need the KEP

modify kubeadm to always enable the service if its not enabled on kubeadm init/join runtime.

the part two may need the KEP

modify the kubelet systemd service files in the kubernetes/release repository to have the kubelet service disabled by default.

neolit123 · 2020-08-10T12:38:42Z

yet, we can avoid doing part one, if part two never happens, thus documenting the proposed change in one place feels right.

xlgao-zju · 2020-08-10T12:46:44Z

@neolit123 yes, you are right. so, where should we file the kep?

neolit123 · 2020-08-10T13:06:28Z

the KEP process:
https://github.com/kubernetes/enhancements/tree/master/keps#kubernetes-enhancement-proposals-keps

it should be either here (A):
https://github.com/kubernetes/enhancements/tree/master/keps/sig-cluster-lifecycle/kubeadm
or here (B):
https://github.com/kubernetes/enhancements/tree/master/keps/sig-release

my preference is for A, but let's hold until we decide if we are making this change in 1.20:

note, if 1.20 is a "stabilization release" we should not be making this change.

BenTheElder · 2020-08-11T00:49:50Z

i can see us removing kubeletExtraArgs from the kubeadm API at some point. but that's a separate topic.

quick note that we cannot do that while the kubelet config is not respected in join.

+1 for #2178 (comment)
speicfically 2.) kubelet enabling should happen after writing the file(s) / config, so as to avoid any crash loops.

this will also have the benefit of allowing CI tooling to look for panics in the logs without creating loads of noise. (something that would be pointless for us to add today, due to kubeadm)

BenTheElder · 2020-08-13T05:30:04Z

Let me know if I can help. I would like to eliminate the crashlooping in CI, and I think this will avoid a lot of confused users.

xlgao-zju · 2020-08-13T07:22:38Z

@BenTheElder Since we will not change the kubernetes/release repository to have the kubelet service disabled by default for now. I think we can stop kubelet, when we do not have the kubelet config. And start the kubelet, until we get the kubelet config?

neolit123 · 2020-09-03T14:45:39Z

@xlgao-zju looks like 1.20 is a regular release so we can proceed with the KEP if you have the time:

my preference is for A, but let's hold until we decide if we are making this change in 1.20:

note, if 1.20 is a "stabilization release" we should not be making this change.

let me know if you have questions.
we can skip some of the details in the KEP, but IMO we need to focus more on the topic how breaking the change can be.

also feedback from the release-eng team will be required.

BenTheElder · 2021-06-23T09:42:54Z

so FWIW this is really easy to fix and works great so far, the issue is the Kubernetes packaging and systemd spec sources are not ideal (there are various other issues referencing this) so it's not possible to roll this out only to future releases, and it's technically a breaking change.

k8s-triage-robot · 2021-10-03T20:01:06Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-02-21T18:55:07Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

BenTheElder · 2022-02-22T07:43:20Z

/remove-lifecycle stale
IMHO this is a good candidate for frozen, it's blocked on the packaging situation upstream improving, it's a fairly trivial change once upstream packaging lets us do per-kubernetes-version packaging (!) ... someday

neolit123 · 2022-02-22T10:57:19Z

/lifecycle frozen

neolit123 · 2024-03-10T14:20:54Z

modify the kubelet systemd service files in the kubernetes/release repository to have the kubelet service disabled by default. this change will require a release note with "action-required" as non-kubeadm users would have to manually enable it (e.g. using: "sysctl enable kubelet").

now that we have new packages this change is easier to do.
but according to this discussion the kubelet service is already disabled by default:
kubernetes/website#45489 (comment)
maybe something changed around the krel migration in k/release.

modify kubeadm to always enable the service if its not enabled on kubeadm init/join runtime.
note that, currently kubeadm just has a preflight check that fails if the service is not enabled and instructs the user how to enable it manually.

this can be done for 1.31:

init/join enable the kubelet service
reset disables it

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Jun 10, 2020

neolit123 added this to the v1.19 milestone Jun 10, 2020

neolit123 mentioned this issue Jun 10, 2020

Configue kubelet.service to avoid crashlooping before config is present kubernetes/release#1352

Closed

k8s-ci-robot assigned xlgao-zju Jul 2, 2020

neolit123 mentioned this issue Jul 7, 2020

Kubernetes failed to start after installation throwing errors kubernetes/kubernetes#92865

Closed

neolit123 modified the milestones: v1.19, v1.20 Jul 27, 2020

k8s-ci-robot added sig/release Categorizes an issue or PR as relevant to SIG Release. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. labels Aug 10, 2020

neolit123 removed the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label Sep 3, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 10, 2021

neolit123 modified the milestones: v1.22, v1.23 Jul 5, 2021

neolit123 mentioned this issue Aug 13, 2021

ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS (code=exited, status=255) kubernetes/kubernetes#83936

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 3, 2021

neolit123 removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 21, 2021

neolit123 modified the milestones: v1.23, v1.24 Nov 23, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 21, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 22, 2022

k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Feb 22, 2022

neolit123 modified the milestones: v1.24, v1.25 Mar 29, 2022

neolit123 modified the milestones: v1.25, Next May 11, 2022

neolit123 unassigned xlgao-zju May 11, 2022

neolit123 mentioned this issue Mar 10, 2024

Improvements in "Installing kubeadm" document kubernetes/website#45489

Merged

neolit123 modified the milestones: Next, v1.31 Mar 10, 2024

neolit123 modified the milestones: v1.31, v1.32 Aug 7, 2024

neolit123 modified the milestones: v1.32, v1.33 Nov 27, 2024

neolit123 mentioned this issue Jan 4, 2025

Check if all images already exist, before pulling the images #3145

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

change the kubelet service crash loop behavior #2178

change the kubelet service crash loop behavior #2178

neolit123 commented Jun 10, 2020 •

edited

Loading

neolit123 commented Jun 10, 2020

neolit123 commented Jun 10, 2020

neolit123 commented Jun 10, 2020

BenTheElder commented Jun 10, 2020

neolit123 commented Jun 10, 2020 •

edited

Loading

rosti commented Jun 11, 2020

neolit123 commented Jun 11, 2020

xlgao-zju commented Jul 1, 2020

xlgao-zju commented Jul 2, 2020

neolit123 commented Aug 10, 2020 •

edited

Loading

neolit123 commented Aug 10, 2020

xlgao-zju commented Aug 10, 2020

neolit123 commented Aug 10, 2020

xlgao-zju commented Aug 10, 2020

neolit123 commented Aug 10, 2020 •

edited

Loading

BenTheElder commented Aug 11, 2020 •

edited

Loading

BenTheElder commented Aug 13, 2020

xlgao-zju commented Aug 13, 2020

neolit123 commented Sep 3, 2020

BenTheElder commented Jun 23, 2021

k8s-triage-robot commented Oct 3, 2021

k8s-triage-robot commented Feb 21, 2022

BenTheElder commented Feb 22, 2022

neolit123 commented Feb 22, 2022

neolit123 commented Mar 10, 2024

change the kubelet service crash loop behavior #2178

change the kubelet service crash loop behavior #2178

Comments

neolit123 commented Jun 10, 2020 • edited Loading

neolit123 commented Jun 10, 2020

neolit123 commented Jun 10, 2020

neolit123 commented Jun 10, 2020

BenTheElder commented Jun 10, 2020

neolit123 commented Jun 10, 2020 • edited Loading

rosti commented Jun 11, 2020

neolit123 commented Jun 11, 2020

xlgao-zju commented Jul 1, 2020

xlgao-zju commented Jul 2, 2020

neolit123 commented Aug 10, 2020 • edited Loading

neolit123 commented Aug 10, 2020

xlgao-zju commented Aug 10, 2020

neolit123 commented Aug 10, 2020

xlgao-zju commented Aug 10, 2020

neolit123 commented Aug 10, 2020 • edited Loading

BenTheElder commented Aug 11, 2020 • edited Loading

BenTheElder commented Aug 13, 2020

xlgao-zju commented Aug 13, 2020

neolit123 commented Sep 3, 2020

BenTheElder commented Jun 23, 2021

k8s-triage-robot commented Oct 3, 2021

k8s-triage-robot commented Feb 21, 2022

BenTheElder commented Feb 22, 2022

neolit123 commented Feb 22, 2022

neolit123 commented Mar 10, 2024

neolit123 commented Jun 10, 2020 •

edited

Loading

neolit123 commented Jun 10, 2020 •

edited

Loading

neolit123 commented Aug 10, 2020 •

edited

Loading

neolit123 commented Aug 10, 2020 •

edited

Loading

BenTheElder commented Aug 11, 2020 •

edited

Loading