Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add recovery strategy for error waiting for container: unexpected EOF #703

Closed
wants to merge 3 commits into from

Conversation

claraberendsen
Copy link
Contributor

Description

This PR adds the functionality for recovering the agent by restarting it when an specific error occurs on the log. In particular the case that was applied is error waiting for container: unexpected EOF.

@claraberendsen
Copy link
Contributor Author

@nuclearsandwich I have a draft of the process here. I need some insight on how to test this... I have a separate PR that I think should go first to this one, that adds the necessary plugins for this to work.
At the moment I created a test job in the create_jobs.py file to test that the configuration is correct. However I'm not certain on how to build that and run it, since I don't know when and where is this file being run. Would appreciate your insight.

@claraberendsen claraberendsen self-assigned this May 19, 2023
@nuclearsandwich
Copy link
Member

I have a separate PR that I think should go first to this one, that adds the necessary plugins for this to work.

Can you link / reference that here. As discussed in the Infrastructure meeting we're blocked on Jenkins plugin upgrades generally but that's an important item to address in the near future.

I need some insight on how to test this...

  1. Get this draft ready for a test deployment from the current branch
  2. Deploy this PR scoped just to test_ci_linux
  3. Manually reconfigure test_ci_linux to echo the expected error string
  4. Run test_ci_linux and validate the behavior
  5. Re-deploy to remove the manual reconfiguration and run an (expected) successful job to confirm that nothing breaks when the job doesn't exhibit the failure.

I'm in favor of just hand-hacking the test_ci_linux job to display the error string and exit, rather than adding a separate test job just for this.
The jobs get re-configured on every deploy and I don't think there's currently enough of a justification to create job for doing those kinds of tests.

At the moment I created a test job in the create_jobs.py file to test that the configuration is correct. However I'm not certain on how to build that and run it, since I don't know when and where is this file being run. Would appreciate your insight.

This script is generally run by ROS 2 devs locally after merging a PR and making a change. Running it requires Jenkins admin configuration and a configured python environment. There are details and links to further details in the readme.

Copy link
Member

@nuclearsandwich nuclearsandwich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a first pass review in addition to the general feedback and attempts to answer the raised questions.

Comment on lines +220 to +223
create_job(os_name, 'test_ci_recovery_' + os_name, 'ci_job.xml.em', {
'cmake_build_type': 'None',
'with_eof_recovery': True
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in #703 (comment) I'm in favor of hand-hacking test_ci_linux rather than creating a one-time job for testing.

throw ex
}
}
println('# END SECTION: NVIDIA MISMATCH RECOVERY')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it was copy-pasted from the build.osrfoundation.org recovery without updating the section output.

I'm also not sure that this script needs to be wrapped in a script since the Groovy DSL isn't used.

Comment on lines +10 to +14
@[if script_file]@
<source class="hudson.plugins.groovy.FileSystemScriptSource">
<scriptFile>@ESCAPE(script_file)</scriptFile>
</source>
@[end if]@
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this snippet copied from ros_buildfarm? Is there an anticipated use case for this branch of the snippet?

@claraberendsen
Copy link
Contributor Author

Closing this since we have tracked the actual error and this strategy is not the appropriate solution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants