Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change the kubelet service crash loop behavior #2178

Open
2 tasks
neolit123 opened this issue Jun 10, 2020 · 28 comments
Open
2 tasks

change the kubelet service crash loop behavior #2178

neolit123 opened this issue Jun 10, 2020 · 28 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/release Categorizes an issue or PR as relevant to SIG Release.
Milestone

Comments

@neolit123
Copy link
Member

neolit123 commented Jun 10, 2020

over time we have seen a number of complains related to the crash loop of the kubelet service in the DEB/RPMs. when the kubelet is installed, the service is enabled but it would fail because it's missing its config.yaml (KubeletConfiguration), unless something like kubeadm creates one for it.

this has caused problems for:

  • Windows support
  • supporting other service manager like OpenRC

after a discussion of the kubeadm office hours of 10.06.2020 we've agreed that it might be a good idea to change this behavior and keep the service disabled by default. but this would require changes in both kubeadm and the kubelet systemd specs.

the idea we are thinking of is the following:

  • modify kubeadm to always enable the service if its not enabled on kubeadm init/join runtime.
    note that, currently kubeadm just has a preflight check that fails if the service is not enabled and instructs the user how to enable it manually.
  • modify the kubelet systemd service files in the kubernetes/release repository to have the kubelet service disabled by default. this change will require a release note with "action-required" as non-kubeadm users would have to manually enable it (e.g. using: "sysctl enable kubelet").

/kind feature
/priority important-longterm

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Jun 10, 2020
@neolit123
Copy link
Member Author

the kubeadm change is doable for 1.19, but arguably the k/release change needs wider agreement.

@neolit123 neolit123 added this to the v1.19 milestone Jun 10, 2020
@neolit123
Copy link
Member Author

xref kubernetes/release#1352

@neolit123
Copy link
Member Author

cc @rosti
so i think there is a limitation of our release process as i mentioned here:
kubernetes/release#1352

which means that the PR might be much better than the above proposal.

@BenTheElder
Copy link
Member

@neolit123
Copy link
Member Author

neolit123 commented Jun 10, 2020

it only starts and restarts the service but does not manage "enable" status.
it does tell the user how to enable, though.

@rosti
Copy link

rosti commented Jun 11, 2020

The necessary modifications here are:

  1. Remove the kubelet service running check (from here)
  2. Enable the service during init and join. One way is to add the enabling code where the TryStartKubelet calls are. A safer and somewhat backwards compatible alternative is to add new kubelet enable phases to init and join that would just enable permanently the kubelet service after all of the rest of the init/join process has been successfully completed.
  3. Disable the kubelet service during kubeadm reset. Again, this can be sitting next to the TryStopKubelet call or be in a separate reset phase.

As said previously. It doesn't hurt for us to implement this. It's not that big of a change on our side. The problem is what to do with the packages where the kubelet service continues to be automatically enabled and crash looping (since it doesn't have a config file).

@neolit123
Copy link
Member Author

your proposed plan for kubeadm seems fine to me.

The problem is what to do with the packages where the kubelet service continues to be automatically enabled and crash looping (since it doesn't have a config file).

so i think ideally we should be making this change only for the latest MINOR.
but in kubernetes/release#1352 we are also discussing the fact that currently such changes are applied to all PATCH release of k8s packages.

currently the kubeadm package has the kubelet package as a dependency (there were discussions to change this too), which supposedly installs the same versions for both package for most users.

there could be users that are bypassing that and installing e.g. kubeadm 1.x and kubelet 1.x-1, due to factor X, and this is a supported skew. for such a skew the old kubelet service may be enabled by default (crashloop) but the new kubeadm could be managing "enable" already.

testing something like systemctl enable kubelet on a service that is already enabled doesn't have an exit status != 0, for windows Set-Service ... -StartupType.. should not return errors. no idea about OpenRC. in any case we may have to call the potential InitSystem#ServiceEnable() without catching its errors, but later fail on InitSystem#ServiceStart().

so overall even if a kubeadm binary encounters an older kubelet service, for the crashloop problem in particular this should be fine, unless i'm missing something.

however for the kubelet flag removal and instance specific CC problem, the kubeadm 1.x and kubelet 1.x-1 skew might bite people that are passing flags to a kublet that no longer supports flags at all.
i can see us removing kubeletExtraArgs from the kubeadm API at some point. but that's a separate topic.

@xlgao-zju
Copy link

Tested systemctl enable kubelet.service on a service that is already enabled, and get exit code = 0.

I think @rosti 's plan is fine, and I'd like to help with this.

@xlgao-zju
Copy link

/assign

@neolit123
Copy link
Member Author

neolit123 commented Aug 10, 2020

note, if 1.20 is a "stabilization release" we should not be making this change.
i also think that a KEP is appropriate as it requires changes in both the packages (and in the krel tool to branch?) and kubeadm.

@neolit123
Copy link
Member Author

/sig release cluster-lifecycle

@k8s-ci-robot k8s-ci-robot added sig/release Categorizes an issue or PR as relevant to SIG Release. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. labels Aug 10, 2020
@xlgao-zju
Copy link

@neolit123

the part one may not need the KEP

modify kubeadm to always enable the service if its not enabled on kubeadm init/join runtime.

the part two may need the KEP

modify the kubelet systemd service files in the kubernetes/release repository to have the kubelet service disabled by default.

@neolit123
Copy link
Member Author

yet, we can avoid doing part one, if part two never happens, thus documenting the proposed change in one place feels right.

@xlgao-zju
Copy link

@neolit123 yes, you are right. so, where should we file the kep?

@neolit123
Copy link
Member Author

neolit123 commented Aug 10, 2020

the KEP process:
https://github.com/kubernetes/enhancements/tree/master/keps#kubernetes-enhancement-proposals-keps

it should be either here (A):
https://github.com/kubernetes/enhancements/tree/master/keps/sig-cluster-lifecycle/kubeadm
or here (B):
https://github.com/kubernetes/enhancements/tree/master/keps/sig-release

my preference is for A, but let's hold until we decide if we are making this change in 1.20:

note, if 1.20 is a "stabilization release" we should not be making this change.

@BenTheElder
Copy link
Member

BenTheElder commented Aug 11, 2020

i can see us removing kubeletExtraArgs from the kubeadm API at some point. but that's a separate topic.

quick note that we cannot do that while the kubelet config is not respected in join.

+1 for #2178 (comment)
speicfically 2.) kubelet enabling should happen after writing the file(s) / config, so as to avoid any crash loops.

this will also have the benefit of allowing CI tooling to look for panics in the logs without creating loads of noise. (something that would be pointless for us to add today, due to kubeadm)

@BenTheElder
Copy link
Member

Let me know if I can help. I would like to eliminate the crashlooping in CI, and I think this will avoid a lot of confused users.

@xlgao-zju
Copy link

@BenTheElder Since we will not change the kubernetes/release repository to have the kubelet service disabled by default for now. I think we can stop kubelet, when we do not have the kubelet config. And start the kubelet, until we get the kubelet config?

@neolit123
Copy link
Member Author

@xlgao-zju looks like 1.20 is a regular release so we can proceed with the KEP if you have the time:

my preference is for A, but let's hold until we decide if we are making this change in 1.20:

note, if 1.20 is a "stabilization release" we should not be making this change.

let me know if you have questions.
we can skip some of the details in the KEP, but IMO we need to focus more on the topic how breaking the change can be.

also feedback from the release-eng team will be required.

@neolit123 neolit123 removed the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label Sep 3, 2020
@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 10, 2021
@BenTheElder
Copy link
Member

so FWIW this is really easy to fix and works great so far, the issue is the Kubernetes packaging and systemd spec sources are not ideal (there are various other issues referencing this) so it's not possible to roll this out only to future releases, and it's technically a breaking change.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 3, 2021
@neolit123 neolit123 removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 21, 2021
@neolit123 neolit123 modified the milestones: v1.23, v1.24 Nov 23, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 21, 2022
@BenTheElder
Copy link
Member

/remove-lifecycle stale
IMHO this is a good candidate for frozen, it's blocked on the packaging situation upstream improving, it's a fairly trivial change once upstream packaging lets us do per-kubernetes-version packaging (!) ... someday

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 22, 2022
@neolit123
Copy link
Member Author

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Feb 22, 2022
@neolit123 neolit123 modified the milestones: v1.24, v1.25 Mar 29, 2022
@neolit123 neolit123 modified the milestones: v1.25, Next May 11, 2022
@neolit123 neolit123 modified the milestones: Next, v1.31 Mar 10, 2024
@neolit123
Copy link
Member Author

modify the kubelet systemd service files in the kubernetes/release repository to have the kubelet service disabled by default. this change will require a release note with "action-required" as non-kubeadm users would have to manually enable it (e.g. using: "sysctl enable kubelet").

now that we have new packages this change is easier to do.
but according to this discussion the kubelet service is already disabled by default:
kubernetes/website#45489 (comment)
maybe something changed around the krel migration in k/release.

modify kubeadm to always enable the service if its not enabled on kubeadm init/join runtime.
note that, currently kubeadm just has a preflight check that fails if the service is not enabled and instructs the user how to enable it manually.

this can be done for 1.31:

  • init/join enable the kubelet service
  • reset disables it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/release Categorizes an issue or PR as relevant to SIG Release.
Projects
None yet
Development

No branches or pull requests

8 participants