Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle steps state per build #167

Merged
merged 6 commits into from
Dec 16, 2024
Merged

Conversation

thatportugueseguy
Copy link
Collaborator

Description of the task

This PR changes the way monorobot stores state for the failed builds. We now store states per build per branch individually and calculate notifications with that information rather than having state per build step but shared across the branch for all the builds at the same time.

Some tests were updated and now have a different output because they weren't correct before.

I have left out logic for handling long running builds that finish later than subsequent shorter ones. For now we just ignore them, as I think this will not be a common issue and the PR is quite big in changes already. It's better to fix later when we have a better understanding of how those long running builds will impact the notifications.

@thatportugueseguy thatportugueseguy merged commit d2f47e3 into master Dec 16, 2024
2 of 3 checks passed
@thatportugueseguy thatportugueseguy deleted the step-state-per-build branch December 16, 2024 16:15
Copy link
Collaborator

@yasunariw yasunariw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update docs with build tracking logic (with basic examples)
I didn't follow the updates to the tests, too many unrelated changes

@@ -76,13 +76,13 @@ let hook_of_channel ctx channel_name =
(** [is_pipeline_allowed ctx repo_url ~pipeline] returns [true] if [status_rules]
doesn't define a whitelist of allowed pipelines in the config of [repo_url],
or if the list contains [pipeline]; returns [false] otherwise. *)
let is_pipeline_allowed ctx repo_url ~pipeline =
let is_pipeline_allowed ctx repo_url ~context =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are ctx and context? also the comment appears to be outdated

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ctx is of Context.t type and context is the notification context value. I didn't realize that the two arguments had the same name, this was part of changes to make things clearer regarding the usage of the notification.context and the pipeline name value.

I've updated to use the notification instead and access the context value inside.

re the comments, they are correct, it's just that the match case was written in the negative form. I've updated it to be in the positive.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, yes I mean the part of the comment that refers to the old arg ~pipeline

Printf.sprintf "<%s|%s> " l step
let slack_step_link (s : State_t.failed_step) =
let step = Stre.drop_prefix s.name (pipeline ^ "/") in
Printf.sprintf "<%s|%s> " s.build_url step
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Printf is opened at top level

let rec get_name context build_url =
match String.length context with
| 0 -> Error "failed to get pipeline name from notification - empty string"
| _ when Devkit.Stre.exists build_url (context ^ "/") ->
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devkit is opened in scope

| Ok { pipeline_name; _ } ->
(* build urls are in the form of .../<base_pipeline_name>/builds/<build_number>.
Pipeline steps have an html anchor after the build number *)
let re = Re2.create_exn (Printf.sprintf ".*/%s/builds/(\\d+)#?" pipeline_name) in
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't be compiling the regex for every event

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally, not, but this regex depends on the pipeline name, so it can't be compiled ahead of time, unfortunately. The option would be to create a map of regexes and only create on the first time we encounter a pipeline, but that feels a bit overkill, we don't seem to have any slowness due to this here.

Another option is to create a map with only the allowed pipelines at startup. So if the pipeline in the notification isn't allowed, it would fail here. But again, I'm not sure we are experiencing a downgrade in performance here.

WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure why we need both the context and build_url to find the build number

for buildkite, url has predictable structure of

https://buildkite.com/<org_name>/<pipeline_name>/builds/<build_id>#<job_id>

where #<job_id> is optional

default context also has predictable structure (though can be customized)

buildkite/<pipeline_name>/<job_name>

where /<job_name> is optional

So it seems to me like we can extract the <build_id> from the url in one line
and pipeline_check from the context in one line

And these assumptions about structure are meaningless outside of BK

So echoing my other comment, I suggest to

  1. detect provider from status notification (currently only support BK, detect by buildkite/ prefix in context)
  2. use method defined for provider to extract pipeline metadata from notification, handle the "smart" way that is the focus of this PR
  3. if unknown (unsupported) provider, handle the default way

Then in 2. we only care about BK, so parsing is simple, can expect predefined structure

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, i didn't want to go this way straight away as we had some support for other providers, but it doesn't make sense to try to support anything that might come our way. I've updated these to support primarily buildkite and have some other default way for other CIs.

Thanks

Comment on lines +72 to +76
(match Re2.find_submatches_exn re build_url with
| [| _; Some build_number |] -> Ok build_number
| _ | (exception _) ->
(* we either match on the first case or get an exception. *)
Error "failed to get build number from url")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(match Re2.find_submatches_exn re build_url with
| [| _; Some build_number |] -> Ok build_number
| _ | (exception _) ->
(* we either match on the first case or get an exception. *)
Error "failed to get build number from url")
(match Re2.find_first_exn ~sub:(`Index 1) re build_url with
| build_number -> Ok build_number
| exception _ -> Error "failed to get build number from url")

let buildkite_is_failed_re = Re2.create_exn {|^Build #\d+ failed|}

let parse_context ~context ~build_url =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we return a result type if the error message is ignored everywhere

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and get_build_number below do a lot of string manipulation just to obtain the pipeline name. Can it be simpler? It reads like the function wants to target Buildkite mainly but also somewhat support status contexts in general.
Instead if there is an interface to extract the necessary build info, and we implement a Buildkite module that programs to that interface, then we expect a predictable structure for both build context and url, and can extract the info in 3 lines instead of 30

| false -> build_status.failed_steps @ acc
in
let previous_failed_steps =
StringMap.fold (to_previous_failed_steps @@ int_of_string current_build_number) builds_maps []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

current_build_number should return an int (as the function name suggests) to avoid this adhoc conversion

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought about it, but in the end the build number needs to be a string to be the key in the state json file (i.e., the build number key in the build_statuses map), and we'd get a string from the regex anyway, so these two need to convert at some point.

i didn't want to go with an IntMap wrapper module that needs to convert all the keys every time we write to disk, but looking at the actual state in the deployed app, there are very few builds stored at the same time, so it shouldn't be too much of an overhead. I'll add the IntMap wrapper module

let context = n.context in
let { Util.Build.is_pipeline_step; pipeline_name } = Util.Build.parse_context_exn ~context ~build_url in
let build_number = Util.Build.get_build_number_exn ~context ~build_url in
let is_finished = (not is_pipeline_step) && (n.state = Success || n.state = Failure || n.state = Error) in
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pattern match

let repo_state = find_or_add_repo' state n.repository.url in
let pipeline =
(* We only need to track messages for the base pipeline, not the steps *)
match Util.Build.parse_context ~context:n.context ~build_url:(Option.default "" n.target_url) with
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iiuc calling the function with build_url as empty string will never succeed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it shouldn't happen in buildkite, but for other CIs this value might be empty. See this case.

So here we protect against it.

Copy link
Collaborator

@yasunariw yasunariw Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better if we don't intentionally evaluate parse_context with an argument we know is invalid (creates more assumptions for the callee to manage)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

creates more assumptions for the callee to manage

Makes sense. Good call, thanks

Comment on lines +112 to +113
(* When this function is called, the current build is finished *)
Option.get @@ StringMap.find_opt current_build_number builds_maps
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then the comment should be part of the exception raised when it's None

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already the case above at the beginning of the new_failed_steps function.

The comment is only here to assure the reader that we are not getting an exception here. i will update the comment to make it clearer

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then maybe just StringMap.find with comment

@@ -76,13 +76,13 @@ let hook_of_channel ctx channel_name =
(** [is_pipeline_allowed ctx repo_url ~pipeline] returns [true] if [status_rules]
doesn't define a whitelist of allowed pipelines in the config of [repo_url],
or if the list contains [pipeline]; returns [false] otherwise. *)
let is_pipeline_allowed ctx repo_url ~pipeline =
let is_pipeline_allowed ctx repo_url ~context =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, yes I mean the part of the comment that refers to the old arg ~pipeline

let repo_state = find_or_add_repo' state n.repository.url in
let pipeline =
(* We only need to track messages for the base pipeline, not the steps *)
match Util.Build.parse_context ~context:n.context ~build_url:(Option.default "" n.target_url) with
Copy link
Collaborator

@yasunariw yasunariw Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better if we don't intentionally evaluate parse_context with an argument we know is invalid (creates more assumptions for the callee to manage)

| Ok { pipeline_name; _ } ->
(* build urls are in the form of .../<base_pipeline_name>/builds/<build_number>.
Pipeline steps have an html anchor after the build number *)
let re = Re2.create_exn (Printf.sprintf ".*/%s/builds/(\\d+)#?" pipeline_name) in
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure why we need both the context and build_url to find the build number

for buildkite, url has predictable structure of

https://buildkite.com/<org_name>/<pipeline_name>/builds/<build_id>#<job_id>

where #<job_id> is optional

default context also has predictable structure (though can be customized)

buildkite/<pipeline_name>/<job_name>

where /<job_name> is optional

So it seems to me like we can extract the <build_id> from the url in one line
and pipeline_check from the context in one line

And these assumptions about structure are meaningless outside of BK

So echoing my other comment, I suggest to

  1. detect provider from status notification (currently only support BK, detect by buildkite/ prefix in context)
  2. use method defined for provider to extract pipeline metadata from notification, handle the "smart" way that is the focus of this PR
  3. if unknown (unsupported) provider, handle the default way

Then in 2. we only care about BK, so parsing is simple, can expect predefined structure

Comment on lines +112 to +113
(* When this function is called, the current build is finished *)
Option.get @@ StringMap.find_opt current_build_number builds_maps
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then maybe just StringMap.find with comment

@@ -1,24 +1,31 @@
{
"pipeline_statuses": {
"buildkite/pipeline2/failed-step": {
"pipeline2": {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure it's a good idea to lose the buildkite/ prefix from the state, it is useful to know that it is a BK pipeline (and imagine in the future the monorepo has two providers with pipelines both named monorepo)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent was to separate the pipeline name from the provider context, but it is more future proof to keep it, indeed. I will add this change to the backlog, as I have other tasks that need to be more prioritary at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants