Skip to content
This repository has been archived by the owner on Dec 10, 2024. It is now read-only.

Limiting the building threads to 1 for compiling in the ROS buildfarm #1

Closed
j-rivero opened this issue Dec 28, 2023 · 6 comments
Closed

Comments

@j-rivero
Copy link
Collaborator

While testing the compilation of Drake I've found that it can take several dozens of Gb of RAM specially when processing the python bindings since bazel will launch a bunch of compilation threads.

The Drake buildfarm (if I'm not wrong) uses a user.bazelrc that define a --jobs parameter calculated using the number of processors and the logic in bazel.cmake. In the ROS buildfarm the rule is to use single threaded builds for memory and cpu predictability.

For transforming the Bazel build in Drake to a single thread build, one option is to use the ament_vendor CMake API to include a simple patch againts tools/bazel.rc:

diff --git a/tools/bazel.rc b/tools/bazel.rc
index 59aedf4..909aac4 100644
--- a/tools/bazel.rc
+++ b/tools/bazel.rc
@@ -1,6 +1,9 @@
 # Don't use bzlmod yet.
 common --enable_bzlmod=false
 
+# Limit the building threads to 1
+build --jobs=1
+
 # Default to an optimized build.
 build -c opt

I did not find a better way by using environment variables or other approaches that don't require to patch the source code.

@jwnimmer-tri
Copy link
Contributor

... bazel will launch a bunch of compilation threads.

Yes. By default, all CPU and RAM resources on the machine will try to be used.

In the ROS buildfarm the rule is to use single threaded builds for memory and cpu predictability.

Given the current packaging build timing, I'd estimate that a ROS build of Drake using a single-threaded build will take approximately 4 hours (assuming no caching from prior builds). Is that satisfactory?

I did not find a better way ...

If the buildfarm builds are only supposed to use 1 CPU, then to me the obvious way to implement that would be to only provide a single virtualized CPU in the build machine VMs, at the infrastructure level. Why would the buildfarm VMs provide >1 CPU when the policy is that more than one CPU must not be used? Solving this by dialing back every build tool's individual limit seems like playing whack-a-mole.

In any case, if we assume that this needs to be a bazel-specific option, then see the docs at https://bazel.build/run/bazelrc. Instead of patching the source tree, we can put the build --jobs=1 line into an rcfile in either /etc or $HOME. Since this is a buildfarm-specific rule, having the buildfarm set up the file in /etc to match it's policies seems like the right place.

@j-rivero
Copy link
Collaborator Author

In the ROS buildfarm the rule is to use single threaded builds for memory and cpu predictability.

Given the current packaging build timing, I'd estimate that a ROS build of Drake using a single-threaded build will take approximately 4 hours (assuming no caching from prior builds). Is that satisfactory?

4 hours might be problematic, if I'm not wrong the limit of the ROS buildfarm release jobs is set to 120 minutes right now for Rolling amd64. I'll check with the rest of the infra team but will open another issue to discuss potential reductions of this time.

I did not find a better way ...

If the buildfarm builds are only supposed to use 1 CPU, then to me the obvious way to implement that would be to only provide a single virtualized CPU in the build machine VMs, at the infrastructure level. Why would the buildfarm VMs provide >1 CPU when the policy is that more than one CPU must not be used? Solving this by dialing back every build tool's individual limit seems like playing whack-a-mole.

There is parallelization done in the ROS buildarm but it happens at the executor level rather than build level (it can parallelize across packages but use a single thread for each package).

In any case, if we assume that this needs to be a bazel-specific option, then see the docs at https://bazel.build/run/bazelrc. Instead of patching the source tree, we can put the build --jobs=1 line into an rcfile in either /etc or $HOME. Since this is a buildfarm-specific rule, having the buildfarm set up the file in /etc to match it's policies seems like the right place.

+1 I'll send the PR for patching the ROS buildfarm agents.

@j-rivero
Copy link
Collaborator Author

j-rivero commented Jan 4, 2024

Drafted a PR to be discussed with the ROS infra team ros-infrastructure/ros_buildfarm#1016

@jwnimmer-tri
Copy link
Contributor

4 hours might be problematic, ... I'll check with the rest of the infra team but will open another issue to discuss potential reductions of this time.

Are there any updates on this side of the question?

I do anticipate that the Drake build will keep growing in size (build time) in future versions, so I'd like to get out in front of any potential challenges there.

@j-rivero
Copy link
Collaborator Author

Are there any updates on this side of the question?

We have discussed this internally in the OSRF infra team. The decision of not supporting long (and/or memory intensive) builds was made consciously for trying to facilitate the operations (and the cost) of the ROS buildfarm by encouraging users to optimize for resource consumption and build times. This place us here in a special use case. That said, we have plans to support the Drake compilation:

  • Short term: a per-package timeout implementation could be enough for supporting the build of Drake by extending the build time. Even without that implementation we could manually change the timeout value if that is needed before.
  • Long term: the ros_buildfarm need to implement the allocation of packages to use a full single worker in an agent (currently shared by 4) and several compilation threads. That dispatching mechanism requires a non trivial effort of implementation.

@j-rivero
Copy link
Collaborator Author

ros-infrastructure/ros_buildfarm#1016 was merged.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants