Skip to content

Commit

Permalink
feat: autorecover from stuck situations
Browse files Browse the repository at this point in the history
  • Loading branch information
gerrnot committed Jul 12, 2024
1 parent 36ac1d9 commit d9870d6
Showing 1 changed file with 96 additions and 0 deletions.
96 changes: 96 additions & 0 deletions hips/hip-9999.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
---
hip: 9999
title: "Autorecover from stuck situations"
authors: [ "Gernot Feichter <[email protected]>" ]
created: "2024-07-12"
type: "feature"
status: "draft"
helm-version: 3
---

## Abstract

The idea is to simplify the handling for both manual users and CI/CD pipelines,
to auto-recover from a state of stuck deployments, which is currently not possible unless users implement
boilerplate code around their helm invocations.

## Motivation

If a helm deployment fails, I want to be able to retry it,
ideally by running the same command again to keep things simple.

There are two known situations how the user can run into such a situation where a retry will NOT work:
1. A helm upgrade/install process is killed while the release is in state `PENDING-UPGRADE` or `PENDING-INSTALL`.
2. The initial helm release installation (as performed via `helm upgrade --install`) is in state `FAILED`.

Known Workarounds that should become OBSOLETE when this HIP is implemented to recover from such situations:
1. `kubectl delete secret '<the name of the secret where helm stores release information>'.` (Not possible if you don't want to lose all history)
2. `helm delete` your release. (Not possible if you don't want to lose all history)
3. `helm rollback` your release. (Not possibly if it is the first installation)

## Rationale

The proposed solution uses a locking mechanism that is stored in k8s, such that all clients know whether a helm
release is locked by themselves or not and for how long the lock is valid.

It uses existing helm parameters like the --timeout parameter (which defaults to 5m) to determine for how long a helm release
may be stuck in a pending state.

## Specification

The --timout parameter gets a deeper meaning.
Previously the --timout parameter only had an effect on the helm process running on the respective client.
After implementation, the --timout parameter will be stored in the helm release object (secret) in k8s and
have an indirect impact on possible parallel processes.

`helm ls -a` shows two new columns, regular `helm ls` does NOT show those:
- LOCKED TILL
<datetime> calculated by the helm client: k8s server time + timeout parameter value
- SESSION ID
Unique, random session id generated by the client

Furthermore, if the helm client process gets killed (SIGTERM), it tries to clear the LOCKED TILL value,
SESSION ID and sets the release into a failed state before terminating in order to free the lock.

## Backwards compatibility

It is assumed that the helm release object as stored in k8s will not break
older clients if new fields are added while existing fields are untouched.

Backwards compatibility will be tested during implementation!

## Security implications

The proposed solution should not have an impact on security.

## How to teach this

Since the way that helm is invoked is not altered, there will not be much to teach here.
The usage of the timeout parameter is encouraged, but since the default timeout is already 5m, not even that
needs to be encouraged.

It should just reduce the amount of frustration when dealing with pending and failed helm releases.

A retry of a failed command should just work (assuming the retry happens when no other client has a valid lock).

## Reference implementation

TODO

## Rejected ideas

None

## Open issues

[] HIP status `accepted'

[] Reference implementation

[] Backwards compatibility check

## References

https://github.com/helm/helm/issues/7476
https://github.com/rancher/rancher/issues/44530
https://github.com/helm/helm/issues/11863

0 comments on commit d9870d6

Please sign in to comment.