Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase build timeouts #213

Closed

Conversation

jlblancoc
Copy link

@jlblancoc jlblancoc commented Mar 7, 2022

This issue happens for both ros1 and ros2 build farms, but let's start with this one to hear your opinion on this change.
We have this package which, after recent changes so more features are enabled in the build, times out after 120 minutes.

Possible solutions I see:

  • Increase the time outs! Easiest, but you should evaluate the potential impact. That's this PR.
  • Investigate why ccache seems not to be doing its job (AFAIK). See for example this log:

09:09:13 -- Build files have been written to: /tmp/ws/build_isolated/mrpt2/install
09:09:13 ==> make in '/tmp/ws/build_isolated/mrpt2/install'
09:09:13 Scanning dependencies of target DocumentationFiles
09:09:13 [ 0%] Built target DocumentationFiles
...
10:06:19 -- Build files have been written to: /tmp/ws/build_isolated/mrpt2/devel
10:06:19 ==> make in '/tmp/ws/build_isolated/mrpt2/devel'
10:06:19 Scanning dependencies of target DocumentationFiles
10:06:19 [ 0%] Built target DocumentationFiles
...
(1+ hour again!)

I'm not sure if configuring cmake and building twice, one under $pkg/install and one under $pkg/devel, is expected behavior (?), nor whether ccache should be smart enough to detect the same sources are being compiled (perhaps there is any different macro definition, etc. and then ccache is ok?).

Both issues are orthogonal. Just increasing the timeout will make me happy, but solving the other doubt would be great too!

Signed-off-by: Jose Luis Blanco-Claraco <[email protected]>
@jlblancoc
Copy link
Author

Any opinion on this proposal to increase the timeout a bit?
Without it, these new package releases (in ros2 and in ros1) will probably fail due to timeout...

Thanks.

@clalancette
Copy link

Any opinion on this proposal to increase the timeout a bit?
Without it, these new package releases (in ros2 and in ros1) will probably fail due to timeout...

This raises the timeout from 2 hours to 3 hours, which is relatively OK by me. I'd want confirmation from @nuclearsandwich before moving forward, though, since this has consequences for the resources the buildfarms use.

@nuclearsandwich
Copy link

Investigate why ccache seems not to be doing its job (AFAIK). See for example this log:

ccache sharing between builds is no longer enabled by default (as of ros-infrastructure/ros_buildfarm#844) and cannot be re-enabled as is due to issues with ccache's security model.

This raises the timeout from 2 hours to 3 hours, which is relatively OK by me. I'd want confirmation from @nuclearsandwich

I think the "recommended" solution would be to break the package up along modular boundaries so that it isn't as large of a build (55min) but I also know that that's not necessarily a realistic demand or constraint to place on the project especially as our infrastructure changes are the reason that the second build has to start with an empty cache.

However, two things give me pause.

We'd have to raise this timeout for every devel job, not just this package and that means that we'd be spending an extra hour spent spinning on every job that hangs. I would prefer to do a sweep to see what consistently failing devel jobs are doing so due to timeouts before we decide to bump here.

The second reservation is questioning whether three hours would even be sufficient.
Does mrpt2 have a test suite? How long does it take to run? The second build will likely take 50 minutes as well given that the build cache is cleaned between them, so the test suite would need to complete in under ~30min (there's about 20min of overhead, setting up image builds and whatnot).

@jlblancoc
Copy link
Author

Thanks, Steven for the thoughtful answer. I expected something alike, I understand the reservations.

Since this PR was opened, I made further findings:

  • "*-dev" jobs are the ones that always time out (they are just on the border line, sometimes they success within 1h50m - 1h57m). Increasing to something like 2h15m or 2h20m would probably be enough.
  • "*-bin" (releases) builds never fail, but are close to: they take 1h30m to 1h50m.

However, increasing the timeout for all packages just because of this one is perhaps not the best way without trying anything else beforehand...

Splitting it into several packages would be the ideal, but it would require quite a lot of effort, and as usual, there are other urgent stuff...

Does mrpt2 have a test suite? How long does it take to run?

Building the test suite and running it actually amounts to a significant fraction of the total...
Just one test from my local machine:

  • build just the libraries (make -j12):

      real	4m28.443s
      user	46m26.883s
    
  • build the apps (depend on libs) (-j12):

      real	2m17.284s
      user	25m9.338s
    
  • build the tests (-j12):

      real	1m10.145s
      user	11m5.900s
    
  • run the tests (-j12):

      real	0m29.169s
      user	0m44.506s
    

So, build+run tests amounts to the 20% of the total time.

As a quick alternative, I'll try to disable all the tests (by default) when the package detects it's being built from a ROS environment. Since we have CI already covering the package in non-ROS builds, it shouldn't be a big deal, and if it solves the timeouts, I'll be happy...

@nuclearsandwich
Copy link

As a quick alternative, I'll try to disable all the tests (by default) when the package detects it's being built from a ROS environment. Since we have CI already covering the package in non-ROS builds, it shouldn't be a big deal, and if it solves the timeouts, I'll be happy...

I hate disabling tests as a solution even if it makes some sense since test coverage is checked elsewhere. However, given that case does it make sense to turn off the devel jobs for mrpt2 entirely or is there value in checking that they build in a ROS workspace outside of the binarydeb jobs.

Splitting it into several packages would be the ideal, but it would require quite a lot of effort, and as usual, there are other urgent stuff...

I also have loads on my plate but I wonder if I could at least get ccache functioning for the devel jobs so we're not building twice in a row without ccache.

@jlblancoc
Copy link
Author

Peeking at ros-infrastructure/ros_buildfarm#844 , here's one naive idea for ccache: Instead of creating and using /home/buildfarm/.ccache, why not creating a new directory for each build under /tmp using the package name and build ID or whatever to make it unique, then exposing it inside the docker container at the expected location via -v xx:xx so ccache uses it only during the life of that container?

If it makes sense, I might even try to implement and test it (though, I'm not used to run ros_buildfarm locally!).

@nuclearsandwich
Copy link

Peeking at ros-infrastructure/ros_buildfarm#844 , here's one naive idea for ccache: Instead of creating and using /home/buildfarm/.ccache, why not creating a new directory for each build under /tmp using the package name and build ID or whatever to make it unique, then exposing it inside the docker container at the expected location via -v xx:xx so ccache uses it only during the life of that container?

This is more or less where I was headed. Great minds and all that.

@nuclearsandwich
Copy link

@jlblancoc
Copy link
Author

For the records:

  • disabling unit tests didn't work, those builds still hit the 120m limits.
  • all "release" builds for this package build great in 1h30m.

I will close and discard this PR in favor of ros-infrastructure/ros_buildfarm#966 which would be beneficial for all packages anyway.

Thanks!

@jlblancoc jlblancoc closed this May 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants