Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-48641: config/v1/types_cluster_version: Explain image and version both set #2158

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

wking
Copy link
Member

@wking wking commented Jan 20, 2025

Catching up with openshift/cluster-version-operator@9be6175c5f (openshift/cluster-version-operator#431), which uses the version property as a sanity check for "is this pullspec the version I'm expecting?". This protects users from compromised or man-in-the-middled upstream update services who attempt downgrade and similar attacks by misrepresenting a recommended update.

The text I'm adjusting landed in 354e2fb (#1339), but version-ignoring was never implemented, so nobody can be relying on that nominal behavior. And as the man-in-the-middle use case demonstrates, version-ignoring would be less safe than the version-match-enforcing behavior that the cluster-version operator has used since 2020.

Copy link
Contributor

openshift-ci bot commented Jan 20, 2025

Hello @wking! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

@openshift-ci openshift-ci bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jan 20, 2025
@openshift-ci openshift-ci bot requested review from deads2k and JoelSpeed January 20, 2025 21:12
@wking wking changed the title config/v1/types_cluster_version: Explain image and version both set OCPBUGS-48641: config/v1/types_cluster_version: Explain image and version both set Jan 20, 2025
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Jan 20, 2025
@openshift-ci-robot
Copy link

@wking: This pull request references Jira Issue OCPBUGS-48641, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @shellyyang1989

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Catching up with openshift/cluster-version-operator@9be6175c5f (openshift/cluster-version-operator#431), which uses the version property as a sanity check for "is this pullspec the version I'm expecting?". This protects users from compromised or man-in-the-middled upstream update services who attempt downgrade and similar attacks by misrepresenting a recommended update.

The text I'm adjusting landed in 354e2fb (#1339), but version-ignoring was never implemented, so nobody can be relying on that nominal behavior. And as the man-in-the-middle use case demonstrates, version-ignoring would be less safe than the version-match-enforcing behavior that the cluster-version operator has used since 2020.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from shellyyang1989 January 20, 2025 21:42
@@ -62,7 +62,7 @@ type ClusterVersionSpec struct {
//
// Some of the fields are inter-related with restrictions and meanings described here.
// 1. image is specified, version is specified, architecture is specified. API validation error.
// 2. image is specified, version is specified, architecture is not specified. You should not do this. version is silently ignored and image is used.
// 2. image is specified, version is specified, architecture is not specified. The version metadata in the referenced image must match the specified version.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the changes below about Version, is it more complete this way?

Suggested change
// 2. image is specified, version is specified, architecture is not specified. The version metadata in the referenced image must match the specified version.
// 2. image is specified, version is specified, architecture is not specified. image is used if the version metadata in the referenced image matches the specified version. API validation error otherwise.

Copy link
Member

@hongkailiu hongkailiu Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lifted from Slack:

"""
I think "API validation error" in the Godocs is trying to describe the CEL kubebuilder:validation:XValidation:rule here, because those are enforced by the Kube API server's validation, and clients cannot push invalid combinations. For version, there's no CEL enforcement, it's just up to the CVO to decide how to handle the "image and version both set" situation, and report its thoughts in status.conditions
"""

:TIL

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What validations are made? Is it possible that they could be moved to CEL? Or does it need to introspect the image to validate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs to introspect the image, checking the version in the release image's release-metadata file against the spec version string that the cluster admin said they expected. See OCPBUGS-48641's:

Verifying payload failed version="4.17.99" image="quay.io/openshift-release-dev/ocp-release@sha256:82aa2a914d4cd964deda28b99049abbd1415f96c0929667b0499dd968864a8dd" failure=release image version 4.17.13 does not match the expected upstream version 4.17.99

error message for an example, where the CVO looks inside the sha256:82aa2a9... release image, sees that the release-metadata file claims that release image is 4.17.13, and then complains that the release image's 4.17.13 diverges from the spec version's 4.17.99 expectation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, and is this validation via a webhook or part of the controller? The API validation error otherwise message seems odd here, especially if this is a controller based validation

@@ -702,16 +702,16 @@ type Update struct {
Architecture ClusterVersionArchitecture `json:"architecture"`

// version is a semantic version identifying the update version.
// version is ignored if image is specified and required if
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Poking a bit more deeply into where the nominal "ignored" came from, on 2022-11-08 I claimed it was ignored. I'm not sure what 2022-me was thinking there; possibly I was just focused on how the CVO looks up which image to use (and that logic doesn't run when image is explicitly set in spec), and I overlooked the sync-worker validation as it judges the requested desiredUpdate for ReleaseAccepted?

@hongkailiu
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 21, 2025
Copy link
Contributor

openshift-ci bot commented Jan 21, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: hongkailiu, wking
Once this PR has been reviewed and has the lgtm label, please assign bparees for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@wking wking force-pushed the godocs-for-ClusterVersion-image-with-version branch from 965895d to 387fac3 Compare January 21, 2025 01:48
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 21, 2025
Copy link
Contributor

openshift-ci bot commented Jan 21, 2025

New changes are detected. LGTM label has been removed.

@openshift-ci openshift-ci bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 21, 2025
Catching up with openshift/cluster-version-operator@9be6175c5f
(pkg/cvo/sync_worker: Make expected/actual version mismatch fatal,
2020-08-09, openshift/cluster-version-operator#431), which uses the
'version' property as a sanity check for "is this pullspec the version
I'm expecting?".  This protects users from compromised or
man-in-the-middled upstream update services who attempt downgrade and
similar attacks by misrepresenting a recommended update.

The text I'm adjusting landed in 354e2fb
(config/v1/types_cluster_version: Add Architecture to DesiredUpdate,
2022-12-07, openshift#1339), but version-ignoring was never implemented, so
nobody can be relying on that nominal behavior.  And as the
man-in-the-middle use case demonstrates, version-ignoring would be
less safe than the version-match-enforcing behavior that the
cluster-version operator has used since 2020.

I edited types_cluster_version.go by hand, and then updated the other
files with:

  $ hack/update-codegen-crds.sh
  $ hack/update-openapi.sh
  $ hack/update-swagger-docs.sh
@wking wking force-pushed the godocs-for-ClusterVersion-image-with-version branch from 387fac3 to 435a43a Compare January 21, 2025 05:31
Copy link
Contributor

openshift-ci bot commented Jan 21, 2025

@wking: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@petr-muller
Copy link
Member

/cc

@openshift-ci openshift-ci bot requested a review from petr-muller January 21, 2025 13:05
@@ -62,7 +62,7 @@ type ClusterVersionSpec struct {
//
// Some of the fields are inter-related with restrictions and meanings described here.
// 1. image is specified, version is specified, architecture is specified. API validation error.
// 2. image is specified, version is specified, architecture is not specified. You should not do this. version is silently ignored and image is used.
// 2. image is specified, version is specified, architecture is not specified. The version metadata in the referenced image must match the specified version.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What validations are made? Is it possible that they could be moved to CEL? Or does it need to introspect the image to validate?

@@ -702,16 +702,16 @@ type Update struct {
Architecture ClusterVersionArchitecture `json:"architecture"`

// version is a semantic version identifying the update version.
// version is ignored if image is specified and required if
// architecture is specified.
// version is required if architecture is specified.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could add a CEL rule to validate this. We can test that it ratchets so that existing broken resources do not suddenly become broken

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"version is ... required if architecture is specified" dates back to #1339, and seems orthogonal to the change I'm suggesting here. And actually, oc adm upgrade --to-multi-arch is setting both architecture and version, and I don't see a reason to block that; it's the same sanity-check of "yes, the image the cluster retrieved seems like the release the cluster admin was expecting" for folks where version numbers are more recognizable than image digests (everybody? Definitely me, anyway). Should I drop that unnecessary constraint from the docs in this pull request, or can I file a follow-up pull request dropping that constraint once this one merges? Or...?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's orthogonal, but it's good to make compatible incremental change as we are touching areas of APIs.

What happens if version is missing when architecture is specified today? Ignoring CLI tooling that would set it, since folks can manipulate these resources themselves, would it cause CVO to return errors when it processes the object?

If so, adding a CEL rule as below would give more immediate feedback to a user, and is relatively free to us to implement. As of 4.18 this should ratchet itself, but we would need to test it.

// +kubebuilder:validation:XValidation:rule="!has(self.architecture) || has(self.version)",message="version if required when architecture is set"

A self ratcheting version

// +kubebuilder:validation:XValidation:rule="!has(self.architecture) || has(self.version) || has(oldSelf.architecture)",message="version if required when architecture is set"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if version is missing when architecture is specified today?

Looks like that's already guarded here. With a launch 4.17.12 aws Cluster Bot cluster:

$ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/desiredUpdate", "value": {"architecture": "Multi"}}]'
The ClusterVersion "version" is invalid: spec.desiredUpdate: Invalid value: "object": no such key: version evaluating rule: Version must be set if Architecture is set

So I can leave the version is required if architecture is specified docs in place here, and don't need to add additional CEL.

// When image is set, architecture cannot be specified.
// If both version and image are set, the version metadata in the referenced image must match the specified version.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does version metadata actually mean?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ oc adm release info -o json quay.io/openshift-release-dev/ocp-release:4.17.12-x86_64 | jq -r .metadata.version
4.17.12

Docs discussing the release-metadata file that holds that as part of the release image.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an admin were unsure, would they be able to check this themselves from the docs on the API? Perhaps linking out to this doc would be a useful help for users of this API?

Copy link
Member Author

@wking wking Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not much precedent to linking out to the enhancements repo for more details:

api$ git --no-pager grep github.com/openshift/enhancements/blob/
README.md:conventions](https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#api),
machine/v1beta1/types_machine.go:       // https://github.com/openshift/enhancements/blob/master/enhancements/machine-instance-lifecycle.md

How about inlining something more here to make it clear that it's metadata extracted from the release image? Maybe "the version metadata extracted from the referenced image" would be sufficient? Or "the version extracted from the referenced image"? Or "the version string..."? Or...?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants