70B-Prefill-2048-input-token unshared: ABORTED; the semaphore was aborted; while invoking native function hal.fence.await; while calling import; #19569

pdhirajkumarprasad · 2024-12-30T18:25:56Z

What happened?

When running 70B prefill, with 2048 token on unsharded model, it's failing with following error:

iree/runtime/src/iree/hal/drivers/hip/event_semaphore.c:673: ABORTED; the semaphore was aborted; while invoking native function hal.fence.await; while calling import; 
[ 0] bytecode module.prefill_bs4:90 prefill_70b_unsharded.mlir:727:3
Abort (core dumped)

it's coming with iree-benchmark-module, iree-run-module works fine

command :

python3 -m sharktank.examples.export_paged_llm_v1 \
  --bs=4 \
  --irpa-file=/data/llama3.1/weights/70b/fp16/llama3.1_70b_f16.irpa \
  --output-mlir=prefill_70b_unsharded.mlir \
  --output-config=prefill_70b_unsharded.json \
  --skip-decode

iree-compile prefill_70b_unsharded.mlir \
  --iree-hip-target=gfx942 \
  -o=prefill_70b_unsharded.vmfb \
  --iree-hal-target-device=hip \
  --iree-dispatch-creation-enable-aggressive-fusion=true \
  --iree-global-opt-propagate-transposes=true \
  --iree-opt-aggressively-propagate-transposes=true \
  --iree-opt-data-tiling=false \
  --iree-preprocessing-pass-pipeline='builtin.module(util.func(iree-preprocessing-generalize-linalg-matmul-experimental))' \
  --iree-hal-indirect-command-buffers=true \
  --iree-stream-resource-memory-model=discrete \
  --iree-hip-legacy-sync=false \
  --iree-hal-memoization=true \
  --iree-opt-strip-assertions 


ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
iree-benchmark-module \
  --hip_use_streams=true \
  --module=prefill_70b_unsharded.vmfb \
  --parameters=model=/data/llama3.1/weights/70b/fp16/llama3.1_70b_f16.irpa \
  --device=hip://4 \
  --function=prefill_bs4 \
  --input=@/shark-dev/70b//prefill_args_bs4_2048_stride_32/tokens.npy \
  --input=@/shark-dev/70b//prefill_args_bs4_2048_stride_32/seq_lens.npy \
  --input=@/shark-dev/70b//prefill_args_bs4_2048_stride_32/seq_block_ids.npy \
  --input=@/shark-dev/70b//prefill_args_bs4_2048_stride_32/cs_f16.npy --benchmark_repetitions=8

Go to Shark MI300X machine and use above command.

Build : a43d893

Steps to reproduce your issue

No response

What component(s) does this issue relate to?

Runtime

Version information

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

AWoloszyn · 2024-12-31T16:02:10Z

Looking at rocm-smi, there is another process running on --hip://4, which is taking up 23% of the available ram. With 70b running iree-run-module (or benchmark-module with -benchmark-repetitions=1) we hit 98% memory usage. A larger number of repetitions increases the memory usage enough to hit the limit.

Using a GPU that is not currently in-use lets it complete.

pdhirajkumarprasad · 2025-01-08T08:59:39Z

The issue is not seen anymore so closing this

pdhirajkumarprasad added the bug 🐞 Something isn't working label Dec 30, 2024

AWoloszyn self-assigned this Dec 31, 2024

pdhirajkumarprasad closed this as completed Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

70B-Prefill-2048-input-token unshared: ABORTED; the semaphore was aborted; while invoking native function hal.fence.await; while calling import; #19569

70B-Prefill-2048-input-token unshared: ABORTED; the semaphore was aborted; while invoking native function hal.fence.await; while calling import; #19569

pdhirajkumarprasad commented Dec 30, 2024

AWoloszyn commented Dec 31, 2024

pdhirajkumarprasad commented Jan 8, 2025

70B-Prefill-2048-input-token unshared: ABORTED; the semaphore was aborted; while invoking native function hal.fence.await; while calling import; #19569

70B-Prefill-2048-input-token unshared: ABORTED; the semaphore was aborted; while invoking native function hal.fence.await; while calling import; #19569

Comments

pdhirajkumarprasad commented Dec 30, 2024

What happened?

it's coming with iree-benchmark-module, iree-run-module works fine

Steps to reproduce your issue

What component(s) does this issue relate to?

Version information

Additional context

AWoloszyn commented Dec 31, 2024

pdhirajkumarprasad commented Jan 8, 2025