H4HIP: Wait with kstatus #374

AustinAbro321 · 2024-12-12T17:29:27Z

proposal to replace the current wait logic in Helm with kstatus

Signed-off-by: Austin Abro <[email protected]>

gjenkins8

Thanks for the HIP! I have been wanting to write this one myself for some time. I agree, kstatus is where the Kubernetes community has put significant effort into thinking about Kubernetes resource "readiness". And Helm would do well to reuse this effort.

I have put some comments. They are mostly centered around what noticable (if any) behaviors users would notice from the existing mechanism. And how to mitigate/manage those.

gjenkins8 · 2024-12-14T01:36:38Z

hips/hip-0999.md

+
+<!-- TODO: Decide if we want more than alphabetically, such as - The APIVersion/Kind of the resource will determine it's priority for being logged. For example, the first log messages will always describe deployments. All deployments will be logged first. Once all deployments are in ready status, all stateful sets will be logged, and so forth.  -->
+
+## Backwards compatibility


Curiosity: will kstatus require additional rbac rules than existing watch/ready mechanism?

Great question! I made this repo to test it out - https://github.com/AustinAbro321/kstatus-rbac-test. It looks to be pretty minimal. In my case, I tested a deployment, and only these RBAC permissions were necessary. I will add this to the doc.

rules: - apiGroups: ["apps"] resources: ["deployments"] verbs: ["list", "watch"] - apiGroups: ["apps"] resources: ["replicasets"] verbs: ["list"]

What really surprised me was that events weren't necessary. I thought for sure they would be.

I added a section in backwards compatibility. Let me know thoughts / if you want a deeper evaluation.

core/v1.Events and events/v1.Event are different from watch events. The former are a regular k8s resource providing information about specific occurrences, whereas the latter are strictly tied to watch API and inform about the type of event that happened (addition, modification, deletion) to watched resource.

The RBAC provided above looks reasonable, I'd assume that the current wait mechanism utilizes only the list, so watch will be the only required expansion. Going back to my other comment, we can transparently use watch where we have access and upon missing RBAC rules provide a warning and fallback to just poll operation.

gjenkins8 · 2024-12-14T01:37:06Z

hips/hip-0999.md

+
+<!-- TODO: Decide if we want more than alphabetically, such as - The APIVersion/Kind of the resource will determine it's priority for being logged. For example, the first log messages will always describe deployments. All deployments will be logged first. Once all deployments are in ready status, all stateful sets will be logged, and so forth.  -->
+
+## Backwards compatibility


Is there any situation where kstatus will not return ready, but existing logic would?

Besides the two called our here, where kstatus will wait to return ready until reconciliation is complete, and waiting for CRDs I am not thinking of any, but I am not 100% sure.

gjenkins8 · 2024-12-14T01:52:42Z

hips/hip-0999.md

+
+## Backwards compatibility
+
+Waiting for custom resources and for reconciliation to complete for every resource could lead to charts timing out that weren't previously.


I'm wondering if we want an "opt-in" (or opt-out) mechanism for charts to specify they are compatible with new a new ready logic? At least initially. And/or a CLI flag for users to control the behavior?

While one of the premises of Helm 4 is that we can/do want to move Helm functionality forward. We do want/need to remain compatible with existing user workflows as much as possible. So while it would certainly be okay to introduce new wait functionality, I think we would want a path for users to either fall back to the old functionality if their current situation warranted. Or for a chart to opt-in to the new functionality, if the chart author could deem the chart to be compatible with the new functionality.

What we should do IMHO depends on how much we think kstatus is a drop-in replacement for the existing wait functionality (ie. whether kstatus should become the default in Helm 4). And whether we think it would be better for existing charts to opt-in to new functionality. Or whether we would want chart users to be able to opt-out if tney need.

I will leave the final call to you guys, I suspect kstatus will be a drop in replacement. I'm not sure if it will work 90%, 99%, or 99.9% of the time with existing deployments. I think it's most likely closer to the latter percentages, but I would love a way to test that out and gain additional confidence.

My confidence so far comes from the fact that in Zarf, we changed the logic so kstatus is run by default for all charts without wait explicitly turned off. We did not expose a way to turn off kstatus separately, and I have not heard any users complain or say they've run into problems

Helm seems to have a feature gate capability. I'd imagine we can start with dropping kstatus as an experimental feature which would allow interested users to switch to new logic, and slowly rollout the change over several releases. Eventually going to a point where kstatus will be the new default.

gjenkins8 · 2024-12-14T02:00:22Z

hips/hip-0999.md

+
+## Motivation
+
+Certain workflows require custom resources to be ready. There is no way to tell Helm to wait for custom resources to be ready, so anyone that has this requirement must write their own logic to wait for their custom resources.


comment: I agree, this is something Helm needs to be able to address in the future. Custom resources IMHO are becoming more prolific, as e.g. the Kubernetes community tries to have less "in-core" but still official types (e.g. Gateway API). Or simply, folk attempt to extend Kubernetes APIs for their purpose at hand.

gjenkins8 · 2024-12-14T02:01:25Z

hips/hip-0999.md

+
+Certain workflows require custom resources to be ready. There is no way to tell Helm to wait for custom resources to be ready, so anyone that has this requirement must write their own logic to wait for their custom resources.
+
+Certain workflows requires resources to be fully reconciled. For example, Helm waits for all new pods in an upgraded deployment to be ready. However, Helm does not wait for the previous pods in that deployment to be removed.


comment: not exactly sure how this fits as a motivation? I think it is trying to say Helm doesn't currently / correctly handle this situation, but kstatus would?

Yeah kstatus handles that situation, I will add that.

gjenkins8 · 2024-12-14T02:04:34Z

hips/hip-0999.md

+
+## Specification
+
+From a CLI user's perspective there will be no changes in how waits are called, they will still use the `--wait` flag.


On the below subject of compatibility, and the how how waits are action, we might want e.g. --wait=watch|poll|legacy. Iiuc, kstatus has a watch based mechanism for actioning readiness? And we may want to allow falling back to the "legacy" mechanism (to be decided) (I would propose --wait=watch is the default)

Are there cases where the watch version would not work?

I've not run into any issues with watch. I know flux uses the poll method, not sure if watch was out when they implemented kstatus, or if there was a reason they decided to go with poll

Given how extensively watches have been battle tested in k8s I think we can safely assume that using watches as the default solution and falling back to polling were that mechanism fails is sufficient. As mentioned in the other comment we should not expose internal information about how the wait logic works to users.

Yeah that would definitely resolve the extra RBAC permissions. However, I do think it's worth considering the extra maintenance cost of adding both implementations. It might be worth adding both, but if we don't mind the extra 1-2 seconds between polls it may be worth just sticking with polling. Likewise, if we don't mind the additional "watch" RBAC permission required, it might make sense to only use watch.

For the transition period, you'll likely end up with both, just to ensure people 1. update their RBAC and 2. update their expectations wrt additional wait time, which wasn't previously taken into account. Eventually allowing you to drop the polling entirely. At least, that's how I'd roll something like that in kubectl, for example 😉

Ah I should clarify that there are two different types of polling here. The existing Helm wait implementation that has custom logic to poll resources, and the kstatus polling methods. I believe we'll keep the existing Helm implementation in the transition period, but I'm not sure we'll have both the kstatus polling and kstatus watcher.

That makes sense, I was alluding to the existing polling mechanism vs the new watch-based one, only.

mattfarina

Thanks for the HIP. I like the idea of using something from the Kubernetes community to know the status. When Helm's current code was built, nothing like this was available.

mattfarina · 2024-12-14T15:34:34Z

hips/hip-0999.md

+
+Leveraging a existing status management library maintained by the Kubernetes team will simplify the code and documentation that Helm needs to maintain and improve the functionality of `--wait`. 
+
+## Specification


I would like to see kstatus behind an adapter/interface. Helm should use it but not expose it in the API. There are two reasons I would like to see this:

Helm has been long lived. Helm v3 has been GA for more than 5 years. Other projects come and go. If kstatus goes and something replaces it, we would like to be able to do that without it impacting the public API to the Helm SDK. While I don't expect a change like this, we have seen this kind of thing happen in the past.

kstatus has yet to reach 1.0.0 status. There could be breaking changes. We want to shield the Helm SDK public API from any of those changes.

Makes perfect sense, I'll add that to the doc.

Signed-off-by: Austin Abro <[email protected]>

AustinAbro321 · 2024-12-14T18:37:55Z

Thank you guys for the feedback, I am aiming to create a draft PR sometime next week so we can get a sense for what it will look like.

Signed-off-by: Austin Abro <[email protected]>

AustinAbro321 · 2025-01-06T17:34:06Z

@mattfarina @gjenkins8 I created a draft PR with my implementation and updated this proposal with some of the finer details. LMK what feedback / questions you have

Draft PR - helm/helm#13604

soltysh · 2025-01-08T12:15:34Z

hips/hip-0999.md

+
+## Specification
+
+From a CLI user's perspective there will be no changes in how waits are called, they will still use the `--wait` flag.


Given how extensively watches have been battle tested in k8s I think we can safely assume that using watches as the default solution and falling back to polling were that mechanism fails is sufficient. As mentioned in the other comment we should not expose internal information about how the wait logic works to users.

soltysh · 2025-01-08T12:23:00Z

hips/hip-0999.md

+}
+```
+
+`WaitAndGetCompletedPodPhase` is an exported function that is not called anywhere within the Helm repository. It will be removed. 


I can't speak for helm maintainers, but it looks like this method is part of their public API, and not deprecated, so I'd be careful with removing it right away.

I believe since this is targeted at Helm v4 breaking changes in the public API are acceptable

soltysh · 2025-01-08T12:24:06Z

hips/hip-0999.md

+
+`WaitAndGetCompletedPodPhase` is an exported function that is not called anywhere within the Helm repository. It will be removed. 
+
+`WatchUntilReady` is used only for hooks. It has custom wait logic different from the Helm 3 general logic. Ideally, this could be replaced with a regular `Wait()` call. If there is any historical context as to why this logic is the way it is, please share. 


Again similar comment, I'd be careful breaking API. Either a deprecation or just wire the method to invoke the same underlying code.

Ditto with Helm 4 comment

soltysh · 2025-01-08T12:33:09Z

hips/hip-0999.md

+
+<!-- TODO: Decide if we want more than alphabetically, such as - The APIVersion/Kind of the resource will determine it's priority for being logged. For example, the first log messages will always describe deployments. All deployments will be logged first. Once all deployments are in ready status, all stateful sets will be logged, and so forth.  -->
+
+## Backwards compatibility


core/v1.Events and events/v1.Event are different from watch events. The former are a regular k8s resource providing information about specific occurrences, whereas the latter are strictly tied to watch API and inform about the type of event that happened (addition, modification, deletion) to watched resource.

The RBAC provided above looks reasonable, I'd assume that the current wait mechanism utilizes only the list, so watch will be the only required expansion. Going back to my other comment, we can transparently use watch where we have access and upon missing RBAC rules provide a warning and fallback to just poll operation.

soltysh · 2025-01-08T12:38:48Z

hips/hip-0999.md

+
+## Backwards compatibility
+
+Waiting for custom resources and for reconciliation to complete for every resource could lead to charts timing out that weren't previously.


Helm seems to have a feature gate capability. I'd imagine we can start with dropping kstatus as an experimental feature which would allow interested users to switch to new logic, and slowly rollout the change over several releases. Eventually going to a point where kstatus will be the new default.

hips/hip-0999.md

Signed-off-by: Austin Abro <[email protected]>

AustinAbro321 added 4 commits December 6, 2024 20:17

start helm hip

6d420e8

Signed-off-by: Austin Abro <[email protected]>

updates

39a043c

Signed-off-by: Austin Abro <[email protected]>

grammar

c9dbf03

Signed-off-by: Austin Abro <[email protected]>

hip

472f81a

Signed-off-by: Austin Abro <[email protected]>

pull-request-size bot added the size/M label Dec 12, 2024

banjoh mentioned this pull request Dec 13, 2024

H4HIP: Helm Sequencing Proposal #373

Open

gjenkins8 reviewed Dec 14, 2024

View reviewed changes

mattfarina reviewed Dec 14, 2024

View reviewed changes

updates

dcf6c8b

Signed-off-by: Austin Abro <[email protected]>

AustinAbro321 added 2 commits January 6, 2025 15:25

updates to new architecture

c049e80

Signed-off-by: Austin Abro <[email protected]>

updates to new architecture

0c7da3f

Signed-off-by: Austin Abro <[email protected]>

pull-request-size bot added size/L and removed size/M labels Jan 6, 2025

AustinAbro321 mentioned this pull request Jan 6, 2025

refactor wait to use kstatus helm/helm#13604

Draft

3 tasks

soltysh reviewed Jan 8, 2025

View reviewed changes

AustinAbro321 added 3 commits January 8, 2025 13:50

mention watch and polling

467dde6

Signed-off-by: Austin Abro <[email protected]>

mention poller vs watcher

bb2b190

Signed-off-by: Austin Abro <[email protected]>

update why around watch

5275514

Signed-off-by: Austin Abro <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

H4HIP: Wait with kstatus #374

H4HIP: Wait with kstatus #374

AustinAbro321 commented Dec 12, 2024

gjenkins8 left a comment

gjenkins8 Dec 14, 2024

AustinAbro321 Dec 14, 2024 •

edited

Loading

AustinAbro321 Dec 14, 2024

AustinAbro321 Dec 14, 2024

soltysh Jan 8, 2025

gjenkins8 Dec 14, 2024

AustinAbro321 Dec 14, 2024

gjenkins8 Dec 14, 2024

AustinAbro321 Dec 14, 2024 •

edited

Loading

soltysh Jan 8, 2025

gjenkins8 Dec 14, 2024

gjenkins8 Dec 14, 2024

AustinAbro321 Dec 14, 2024

gjenkins8 Dec 14, 2024 •

edited

Loading

mattfarina Dec 14, 2024

AustinAbro321 Dec 14, 2024

soltysh Jan 8, 2025

AustinAbro321 Jan 8, 2025

soltysh Jan 8, 2025

AustinAbro321 Jan 8, 2025

soltysh Jan 9, 2025

mattfarina left a comment

mattfarina Dec 14, 2024

AustinAbro321 Dec 14, 2024

AustinAbro321 commented Dec 14, 2024

AustinAbro321 commented Jan 6, 2025

soltysh Jan 8, 2025

soltysh Jan 8, 2025

AustinAbro321 Jan 8, 2025

soltysh Jan 8, 2025

AustinAbro321 Jan 8, 2025

soltysh Jan 8, 2025

soltysh Jan 8, 2025


		<!-- TODO: Decide if we want more than alphabetically, such as - The APIVersion/Kind of the resource will determine it's priority for being logged. For example, the first log messages will always describe deployments. All deployments will be logged first. Once all deployments are in ready status, all stateful sets will be logged, and so forth. -->

		## Backwards compatibility


		## Backwards compatibility

		Waiting for custom resources and for reconciliation to complete for every resource could lead to charts timing out that weren't previously.


		## Motivation

		Certain workflows require custom resources to be ready. There is no way to tell Helm to wait for custom resources to be ready, so anyone that has this requirement must write their own logic to wait for their custom resources.


		Certain workflows require custom resources to be ready. There is no way to tell Helm to wait for custom resources to be ready, so anyone that has this requirement must write their own logic to wait for their custom resources.

		Certain workflows requires resources to be fully reconciled. For example, Helm waits for all new pods in an upgraded deployment to be ready. However, Helm does not wait for the previous pods in that deployment to be removed.


		## Specification

		From a CLI user's perspective there will be no changes in how waits are called, they will still use the `--wait` flag.


		Leveraging a existing status management library maintained by the Kubernetes team will simplify the code and documentation that Helm needs to maintain and improve the functionality of `--wait`.

		## Specification


		`WaitAndGetCompletedPodPhase` is an exported function that is not called anywhere within the Helm repository. It will be removed.

		`WatchUntilReady` is used only for hooks. It has custom wait logic different from the Helm 3 general logic. Ideally, this could be replaced with a regular `Wait()` call. If there is any historical context as to why this logic is the way it is, please share.

H4HIP: Wait with kstatus #374

Are you sure you want to change the base?

H4HIP: Wait with kstatus #374

Conversation

AustinAbro321 commented Dec 12, 2024

gjenkins8 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AustinAbro321 Dec 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AustinAbro321 Dec 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gjenkins8 Dec 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattfarina left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AustinAbro321 commented Dec 14, 2024

AustinAbro321 commented Jan 6, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AustinAbro321 Dec 14, 2024 •

edited

Loading

AustinAbro321 Dec 14, 2024 •

edited

Loading

gjenkins8 Dec 14, 2024 •

edited

Loading