Llama 3.1 8b decode fails during benchmarking #19533

aviator19941 · 2024-12-19T23:08:28Z

What happened?

Compiling the updated 8b f16 bs4 TP1 IR with 4c00a22 compiles, but fails to benchmark with this error:

Running ../iree-build-no-trace/tools/iree-benchmark-module
Run on (192 X 3810.79 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x96)
  L1 Instruction 32 KiB (x96)
  L2 Unified 1024 KiB (x96)
  L3 Unified 32768 KiB (x16)
Load Average: 9.97, 10.13, 12.66
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
***WARNING*** Library was built as DEBUG. Timings may be affected.
:0:rocdevice.cpp            :2984: 2446864599827 us: [pid:1329572 tid:0x75d371000640] Callback: Queue 0x75d370500000 aborting with error : HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to access memory beyond the largest legal address. code: 0x29
Aborted (core dumped)

Steps to reproduce your issue

wget the 8b f16 bs4 IR
Compile using 4c00a22:

../iree-build-no-trace/tools/iree-compile \
8b_f16_bs4_tp1_tokens_128_stride_32.mlir  \
--iree-hip-target=gfx942  \
-o=8b_f16_bs4_tp1_tokens_128_stride_32.vmfb \
--iree-hal-target-device=hip \
--iree-dispatch-creation-enable-aggressive-fusion=true  \
--iree-global-opt-propagate-transposes=true  \
--iree-opt-aggressively-propagate-transposes=true  \
--iree-opt-data-tiling=false   \
--iree-preprocessing-pass-pipeline='builtin.module(util.func(iree-preprocessing-generalize-linalg-matmul-experimental))'   \
--iree-hal-indirect-command-buffers=true   \
--iree-stream-resource-memory-model=discrete   \
--iree-hip-legacy-sync=false   \
--iree-hal-memoization=true   \
--iree-opt-strip-assertions

Benchmark decode:

ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7    \
../iree-build-no-trace/tools/iree-benchmark-module   \
--hip_use_streams=true   \
--module=8b_f16_bs4_tp1_tokens_128_stride_32.vmfb   \
--parameters=model=/data/llama3.1/weights/8b/fp16/llama3.1_8b_fp16.irpa   \
--device=hip://4   \
--function=decode_bs4   \
--input=@/data/llama3.1/weights/8b/decode_args_bs4_128_stride_32/next_tokens.npy   \
--input=@/data/llama3.1/weights/8b/decode_args_bs4_128_stride_32/seq_lens.npy   \
--input=@/data/llama3.1/weights/8b/decode_args_bs4_128_stride_32/start_positions.npy   \
--input=@/data/llama3.1/weights/8b/decode_args_bs4_128_stride_32/seq_block_ids.npy   \
--input=@/data/llama3.1/weights/8b/decode_args_bs4_128_stride_32/cs_f16.npy   \
--benchmark_repetitions=3

See benchmarking error:

Running ../iree-build-no-trace/tools/iree-benchmark-module
Run on (192 X 3810.79 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x96)
  L1 Instruction 32 KiB (x96)
  L2 Unified 1024 KiB (x96)
  L3 Unified 32768 KiB (x16)
Load Average: 9.97, 10.13, 12.66
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
***WARNING*** Library was built as DEBUG. Timings may be affected.
:0:rocdevice.cpp            :2984: 2446864599827 us: [pid:1329572 tid:0x75d371000640] Callback: Queue 0x75d370500000 aborting with error : HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to access memory beyond the largest legal address. code: 0x29
Aborted (core dumped)

What component(s) does this issue relate to?

Runtime

Version information

4c00a22

Additional context

No response

The text was updated successfully, but these errors were encountered:

aviator19941 · 2024-12-20T03:04:07Z

decode numpy inputs:
wget https://sharkpublic.blob.core.windows.net/sharkpublic/halo-models/llm-dev/llama3_8b/decode_args_bs4_128_stride_32/next_tokens.npy
wget https://sharkpublic.blob.core.windows.net/sharkpublic/halo-models/llm-dev/llama3_8b/decode_args_bs4_128_stride_32/seq_lens.npy
wget https://sharkpublic.blob.core.windows.net/sharkpublic/halo-models/llm-dev/llama3_8b/decode_args_bs4_128_stride_32/start_positions.npy
wget https://sharkpublic.blob.core.windows.net/sharkpublic/halo-models/llm-dev/llama3_8b/decode_args_bs4_128_stride_32/seq_block_ids.npy
wget https://sharkpublic.blob.core.windows.net/sharkpublic/halo-models/llm-dev/llama3_8b/decode_args_bs4_128_stride_32/cs_f16.npy

This reverts commit 4c00a22. Seems to be cause of #19533

aviator19941 added the bug 🐞 Something isn't working label Dec 19, 2024

aviator19941 changed the title ~~Llama 3.1 8b decode fails to benchmark~~ Llama 3.1 8b decode fails during benchmarking Dec 19, 2024

MaheshRavishankar mentioned this issue Dec 20, 2024

Revert "Enable scatter fusion with index operand. (#19198)" #19535

Merged

MaheshRavishankar added a commit that referenced this issue Dec 20, 2024

Revert "Enable scatter fusion with index operand. (#19198)" (#19535)

07f81f0

This reverts commit 4c00a22. Seems to be cause of #19533

IanWood1 mentioned this issue Dec 27, 2024

HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to access memory beyond the largest legal address. code: 0x29 #19564

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama 3.1 8b decode fails during benchmarking #19533

Llama 3.1 8b decode fails during benchmarking #19533

aviator19941 commented Dec 19, 2024

aviator19941 commented Dec 20, 2024

Llama 3.1 8b decode fails during benchmarking #19533

Llama 3.1 8b decode fails during benchmarking #19533

Comments

aviator19941 commented Dec 19, 2024

What happened?

Steps to reproduce your issue

What component(s) does this issue relate to?

Version information

Additional context

aviator19941 commented Dec 20, 2024