From 9cacda92fe1fcc52a88202e818f16b9d00862d9c Mon Sep 17 00:00:00 2001 From: Gernot Feichter Date: Fri, 12 Jul 2024 10:29:54 +0200 Subject: [PATCH] feat: autorecover from stuck situations Signed-off-by: Gernot Feichter --- hips/hip-9999.md | 102 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 102 insertions(+) create mode 100644 hips/hip-9999.md diff --git a/hips/hip-9999.md b/hips/hip-9999.md new file mode 100644 index 00000000..9603d7bf --- /dev/null +++ b/hips/hip-9999.md @@ -0,0 +1,102 @@ +--- +hip: 9999 +title: "Autorecover from stuck situations" +authors: [ "Gernot Feichter " ] +created: "2024-07-12" +type: "feature" +status: "draft" +helm-version: 3 +--- + +## Abstract + +The idea is to simplify the handling for both manual users and CI/CD pipelines, +to auto-recover from a state of stuck deployments, which is currently not possible unless users implement +boilerplate code around their helm invocations. + +## Motivation + +If a helm deployment fails, I want to be able to retry it, +ideally by running the same command again to keep things simple. + +There are two known situations how the user can run into such a situation where a retry will NOT work: +1. A helm upgrade/install process is killed while the release is in state `PENDING-UPGRADE` or `PENDING-INSTALL`. +2. The initial helm release installation (as performed via `helm upgrade --install`) is in state `FAILED`. + +Known Workarounds that should become OBSOLETE when this HIP is implemented to recover from such situations: +1. `kubectl delete secret ''.` (Not possible if you don't want to lose all history) +2. `helm delete` your release. (Not possible if you don't want to lose all history) +3. `helm rollback` your release. (Not possibly if it is the first installation) + +## Rationale + +The proposed solution uses a locking mechanism that is stored in k8s, such that all clients know whether a helm +release is locked by themselves or not and for how long the lock is valid. + +It uses existing helm parameters like the --timeout parameter (which defaults to 5m) to determine for how long a helm release +may be stuck in a pending state. + +## Specification + +The --timout parameter gets a deeper meaning. +Previously the --timout parameter only had an effect on the helm process running on the respective client. +After implementation, the --timout parameter will be stored in the helm release object (secret) in k8s and +have an indirect impact on possible parallel processes. + +`helm ls -a` shows two new columns, regular `helm ls` does NOT show those: +- LOCKED TILL + calculated by the helm client: k8s server time + timeout parameter value +- SESSION ID + Unique, random session id generated by the client + +Furthermore, if the helm client process gets killed (SIGTERM), it tries to clear the LOCKED TILL value, +SESSION ID and sets the release into a failed state before terminating in order to free the lock. + +## Backwards compatibility + +It is assumed that the helm release object as stored in k8s will not break +older clients if new fields are added while existing fields are untouched. + +Backwards compatibility will be tested during implementation! + +## Security implications + +The proposed solution should not have an impact on security. + +## How to teach this + +Since the way that helm is invoked is not altered, there will not be much to teach here. +The usage of the timeout parameter is encouraged, but since the default timeout is already 5m, not even that +needs to be encouraged. + +It should just reduce the amount of frustration when dealing with pending and failed helm releases. + +A retry of a failed command should just work (assuming the retry happens when no other client has a valid lock). + +## Reference implementation + +helm: https://github.com/gerrnot/helm/tree/feat/autorecover-from-stuck-situations + +acceptance-testing: https://github.com/gerrnot/acceptance-testing/tree/feat/autorecover-from-stuck-situations + +## Rejected ideas + +None + +## Open issues + +[] HIP status `accepted' + +[x] Reference implementation + +[x] Test for concurrent upgrade (valid lock should still block concurrent upgrade attempts) + +[] Test for kill scenario (forever stuck in pending) + +[] Backwards compatibility check (looking good already) + +## References + +https://github.com/helm/helm/issues/7476 +https://github.com/rancher/rancher/issues/44530 +https://github.com/helm/helm/issues/11863