Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update PkgCI test_amd to use MI300x conductor cluster #19517

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

yamiyysu
Copy link

@yamiyysu yamiyysu commented Dec 18, 2024

We want to migrate the workflows use MI300 and do not require cache support to migrate to our conductor cluster. A new runner with one GPU has been created

This PR is to update the run label.

@yamiyysu yamiyysu requested a review from ScottTodd as a code owner December 18, 2024 18:17
@saienduri saienduri self-requested a review December 18, 2024 18:19
@saienduri saienduri changed the title Update one workflow to use conductor runner Update one workflow to use MI300x conductor cluster Dec 18, 2024
@saienduri saienduri changed the title Update one workflow to use MI300x conductor cluster Update PkgCI test_amd to use MI300x conductor cluster Dec 18, 2024
@@ -19,7 +19,7 @@ on:

jobs:
test_mi300:
runs-on: nodai-amdgpu-mi300-x86-64
runs-on: linux-mi300-gpu-1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/iree-org/iree/actions/runs/12399057884/job/34613576470?pr=19517#step:8:195

 iree/runtime/src/iree/hal/drivers/hip/dynamic_symbols.c:160: UNAVAILABLE; HIP runtime library 'amdhip64.dll'/'libamdhip64.so' not available: please ensure installed and in dynamic library search path: 
  Tried: libamdhip64.so
    iree/runtime/src/iree/base/internal/dynamic_library_posix.c:165: NOT_FOUND; failed to load dynamic library (possibly not found on any search path): libamdhip64.so: cannot open shared object file: No such file or directory; creating driver for device 'hip'; resolving dependencies for 'module'

What drivers and software are installed on these new runners? Should we run under Docker (either in the runners themselves or in the job)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah discussing with Jodie offline. We can use the rocm gh runner docker for the runner instantiation, but I think it is probably best to specify the docker in the job (workflow file), so it is visible to everyone what rocm is being used

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may depend on how long it takes to download and start the docker image. If it takes 5 minutes to download a 70GB docker image, configuring that in the runner itself could help hide some latency from workflows?

We could revive these dockerfiles as needed:

If we need to use Docker, I'd definitely prefer to use either upstream public images if no extra deps are needed or those iree-org ones.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was thinking about just using the latest released: rocm/dev-ubuntu-22.04:6.3. This one shouldn't be too bad to pull down I think. We can give it a shot and see how long it takes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be great. These jobs do need CMake and ninja but otherwise the deps are pretty minimal.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that it's clearer to set a container image in the workflow, but I'm not sure about the image download time. Updated and let's see how long it takes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants