Update PkgCI test_amd to use MI300x conductor cluster #19517

yamiyysu · 2024-12-18T18:17:51Z

We want to migrate the workflows use MI300 and do not require cache support to migrate to our conductor cluster. A new runner with one GPU has been created

label: linux-mi300-gpu-1
namespace: arc-iree-gpu-1
gitconfig url: https://github.com/iree-org/iree

This PR is to update the run label.

ScottTodd · 2024-12-18T19:12:31Z

.github/workflows/pkgci_test_amd_mi300.yml

@@ -19,7 +19,7 @@ on:

 jobs:
  test_mi300:
-    runs-on: nodai-amdgpu-mi300-x86-64
+    runs-on: linux-mi300-gpu-1


https://github.com/iree-org/iree/actions/runs/12399057884/job/34613576470?pr=19517#step:8:195

iree/runtime/src/iree/hal/drivers/hip/dynamic_symbols.c:160: UNAVAILABLE; HIP runtime library 'amdhip64.dll'/'libamdhip64.so' not available: please ensure installed and in dynamic library search path: Tried: libamdhip64.so iree/runtime/src/iree/base/internal/dynamic_library_posix.c:165: NOT_FOUND; failed to load dynamic library (possibly not found on any search path): libamdhip64.so: cannot open shared object file: No such file or directory; creating driver for device 'hip'; resolving dependencies for 'module'

What drivers and software are installed on these new runners? Should we run under Docker (either in the runners themselves or in the job)?

Yeah discussing with Jodie offline. We can use the rocm gh runner docker for the runner instantiation, but I think it is probably best to specify the docker in the job (workflow file), so it is visible to everyone what rocm is being used

I think it may depend on how long it takes to download and start the docker image. If it takes 5 minutes to download a 70GB docker image, configuring that in the runner itself could help hide some latency from workflows?

We could revive these dockerfiles as needed:

https://github.com/iree-org/base-docker-images/blob/main/dockerfiles/amdgpu_ubuntu_jammy_x86_64.Dockerfile

https://github.com/iree-org/base-docker-images/blob/main/dockerfiles/amdgpu_ubuntu_jammy_ghr_x86_64.Dockerfile

If we need to use Docker, I'd definitely prefer to use either upstream public images if no extra deps are needed or those iree-org ones.

Yeah I was thinking about just using the latest released: rocm/dev-ubuntu-22.04:6.3. This one shouldn't be too bad to pull down I think. We can give it a shot and see how long it takes

That would be great. These jobs do need CMake and ninja but otherwise the deps are pretty minimal.

Agreed that it's clearer to set a container image in the workflow, but I'm not sure about the image download time. Updated and let's see how long it takes.

Update workflow to use conductor runner

be7d840

yamiyysu requested a review from ScottTodd as a code owner December 18, 2024 18:17

saienduri self-requested a review December 18, 2024 18:19

saienduri changed the title ~~Update one workflow to use conductor runner~~ Update one workflow to use MI300x conductor cluster Dec 18, 2024

saienduri changed the title ~~Update one workflow to use MI300x conductor cluster~~ Update PkgCI test_amd to use MI300x conductor cluster Dec 18, 2024

Corrected to use label not namespace

8e583ab

ScottTodd reviewed Dec 18, 2024

View reviewed changes

Explicit set container image for dependencies

760a675

ScottTodd mentioned this pull request Dec 20, 2024

Update IREE third-party/benchmark for RISC-V Compatibility #19538

Merged

yamiyysu added 2 commits January 3, 2025 17:52

Add missing depedencies when switching docker image

84dfa4a

Remove trailing white spaces

5f49858

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update PkgCI test_amd to use MI300x conductor cluster #19517

Update PkgCI test_amd to use MI300x conductor cluster #19517

yamiyysu commented Dec 18, 2024 •

edited

Loading

ScottTodd Dec 18, 2024

saienduri Dec 18, 2024

ScottTodd Dec 18, 2024

saienduri Dec 18, 2024

ScottTodd Dec 18, 2024

yamiyysu Dec 18, 2024

Update PkgCI test_amd to use MI300x conductor cluster #19517

Are you sure you want to change the base?

Update PkgCI test_amd to use MI300x conductor cluster #19517

Conversation

yamiyysu commented Dec 18, 2024 • edited Loading

ScottTodd Dec 18, 2024

Choose a reason for hiding this comment

saienduri Dec 18, 2024

Choose a reason for hiding this comment

ScottTodd Dec 18, 2024

Choose a reason for hiding this comment

saienduri Dec 18, 2024

Choose a reason for hiding this comment

ScottTodd Dec 18, 2024

Choose a reason for hiding this comment

yamiyysu Dec 18, 2024

Choose a reason for hiding this comment

yamiyysu commented Dec 18, 2024 •

edited

Loading