Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to access memory beyond the largest legal address. code: 0x29 #19564

Open
pdhirajkumarprasad opened this issue Dec 27, 2024 · 5 comments
Labels
bug 🐞 Something isn't working

Comments

@pdhirajkumarprasad
Copy link

pdhirajkumarprasad commented Dec 27, 2024

What happened?

For Llama 8B-FP16, prefill, sharded, getting following error during runtime

:0:rocdevice.cpp            :2984: 3125069028064 us: [pid:1435220 tid:0x709b7a600640] Callback: Queue 0x709b49500000 aborting with error : HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to access memory beyond the largest legal address. code: 0x29

Steps to reproduce your issue

python3 -m sharktank.examples.export_paged_llm_v1 \
  --bs=4 \
  --irpa-file=/data/llama3.1/weights/8b/fp16/tp8/llama3.1_8b_fp16_tp8_parameters.irpa \
  --output-mlir=8b_decode_sharded.mlir \
  --output-config=8b_decode_sharded.json \


iree-compile \
  8b_decode_sharded.mlir \
  --iree-hip-target=gfx942 \
  -o=8b_decode_sharded.vmfb \
  --iree-hal-target-device="hip[0]" \
  --iree-hal-target-device="hip[1]" \
  --iree-hal-target-device="hip[2]" \
  --iree-hal-target-device="hip[3]" \
  --iree-hal-target-device="hip[4]" \
  --iree-hal-target-device="hip[5]" \
  --iree-hal-target-device="hip[6]" \
  --iree-hal-target-device="hip[7]" \
  --iree-dispatch-creation-enable-aggressive-fusion=true \
  --iree-global-opt-propagate-transposes=true \
  --iree-opt-aggressively-propagate-transposes=true \
  --iree-opt-data-tiling=false \
  --iree-preprocessing-pass-pipeline='builtin.module(util.func(iree-preprocessing-generalize-linalg-matmul-experimental))' \
  --iree-hal-indirect-command-buffers=true \
  --iree-stream-resource-memory-model=discrete \
  --iree-hip-legacy-sync=false \
  --iree-hal-memoization=true \
  --iree-opt-strip-assertions



ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
iree-benchmark-module \
  --hip_use_streams=true \
  --module=8b_decode_sharded.vmfb \
  --parameters=model=/data/llama3.1/weights/8b/fp16/tp8/llama3.1_8b_fp16_tp8_parameters.irpa \
  --parameters=model=/data/llama3.1/weights/8b/fp16/tp8/llama3.1_8b_fp16_tp8_parameters.rank0.irpa \
  --parameters=model=/data/llama3.1/weights/8b/fp16/tp8/llama3.1_8b_fp16_tp8_parameters.rank1.irpa \
  --parameters=model=/data/llama3.1/weights/8b/fp16/tp8/llama3.1_8b_fp16_tp8_parameters.rank2.irpa \
  --parameters=model=/data/llama3.1/weights/8b/fp16/tp8/llama3.1_8b_fp16_tp8_parameters.rank3.irpa \
  --parameters=model=/data/llama3.1/weights/8b/fp16/tp8/llama3.1_8b_fp16_tp8_parameters.rank4.irpa \
  --parameters=model=/data/llama3.1/weights/8b/fp16/tp8/llama3.1_8b_fp16_tp8_parameters.rank5.irpa \
  --parameters=model=/data/llama3.1/weights/8b/fp16/tp8/llama3.1_8b_fp16_tp8_parameters.rank6.irpa \
  --parameters=model=/data/llama3.1/weights/8b/fp16/tp8/llama3.1_8b_fp16_tp8_parameters.rank7.irpa \
  --device=hip://0 \
  --device=hip://1 \
  --device=hip://2 \
  --device=hip://3 \
  --device=hip://4 \
  --device=hip://5 \
  --device=hip://6 \
  --device=hip://7 \
  --function=prefill_bs4 \
  --input=@/data/llama3.1/weights/8b/decode_args_bs4_128_stride_32_tp8/next_tokens.npy \
  --input=@/data/llama3.1/weights/8b/decode_args_bs4_128_stride_32_tp8/seq_lens.npy \
  --input=@/data/llama3.1/weights/8b/decode_args_bs4_128_stride_32_tp8/seq_block_ids.npy \
  --input=@/data/llama3.1/weights/8b/decode_args_bs4_128_stride_32_tp8/cs_f16_shard_0.npy \
  --input=@/data/llama3.1/weights/8b/decode_args_bs4_128_stride_32_tp8/cs_f16_shard_1.npy \
  --input=@/data/llama3.1/weights/8b/decode_args_bs4_128_stride_32_tp8/cs_f16_shard_2.npy \
  --input=@/data/llama3.1/weights/8b/decode_args_bs4_128_stride_32_tp8/cs_f16_shard_3.npy \
  --input=@/data/llama3.1/weights/8b/decode_args_bs4_128_stride_32_tp8/cs_f16_shard_4.npy \
  --input=@/data/llama3.1/weights/8b/decode_args_bs4_128_stride_32_tp8/cs_f16_shard_5.npy \
  --input=@/data/llama3.1/weights/8b/decode_args_bs4_128_stride_32_tp8/cs_f16_shard_6.npy \
  --input=@/data/llama3.1/weights/8b/decode_args_bs4_128_stride_32_tp8/cs_f16_shard_7.npy --benchmark_repetitions=8

What component(s) does this issue relate to?

Runtime

Version information

No response

Additional context

No response

@pdhirajkumarprasad pdhirajkumarprasad added the bug 🐞 Something isn't working label Dec 27, 2024
@IanWood1
Copy link
Contributor

IanWood1 commented Dec 27, 2024

Which IREE version is this from? Similar issue reported #19533 and was supposed to be fixed with #19535 but maybe the identified commit wasn't the only cause

@pdhirajkumarprasad
Copy link
Author

I am using f1e1866 and same issue is present in 70B as well

@pdhirajkumarprasad
Copy link
Author

This error is not present in 8B anymore with a43d893 but seeing issue with 70B prefill sharded model with tp8. Here is simple IR to produce the issue

#map = affine_map<(d0, d1, d2, d3) -> (d0, d1, d3)>
#map1 = affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>
module @module {
  util.global private @__auto.token_embd.weight {stream.affinity = #hal.device.promise<@__device_0>} = #stream.parameter.named<"model"::"token_embd.weight"> : tensor<128256x8192xf16>
  util.global private @__auto.blk.0.attn_output.weight.shard.0 {stream.affinity = #hal.device.promise<@__device_0>} = #stream.parameter.named<"model"::"blk.0.attn_output.weight.shard.0"> : tensor<8192x1024xf16>
  func.func @prefill_bs4(%arg0: !torch.vtensor<[4,?],si64> {iree.abi.affinity = #hal.device.promise<@__device_0>}, %arg1: !torch.vtensor<[4],si64> {iree.abi.affinity = #hal.device.promise<@__device_0>}, %arg2: !torch.vtensor<[4,?],si64> {iree.abi.affinity = #hal.device.promise<@__device_0>}, %arg3: !torch.tensor<[?,655360],f16> {iree.abi.affinity = #hal.device.promise<@__device_0>}, %arg4: !torch.tensor<[?,655360],f16> {iree.abi.affinity = #hal.device.promise<@__device_1>}, %arg5: !torch.tensor<[?,655360],f16> {iree.abi.affinity = #hal.device.promise<@__device_2>}, %arg6: !torch.tensor<[?,655360],f16> {iree.abi.affinity = #hal.device.promise<@__device_3>}, %arg7: !torch.tensor<[?,655360],f16> {iree.abi.affinity = #hal.device.promise<@__device_4>}, %arg8: !torch.tensor<[?,655360],f16> {iree.abi.affinity = #hal.device.promise<@__device_5>}, %arg9: !torch.tensor<[?,655360],f16> {iree.abi.affinity = #hal.device.promise<@__device_6>}, %arg10: !torch.tensor<[?,655360],f16> {iree.abi.affinity = #hal.device.promise<@__device_7>}) -> !torch.vtensor<[4,?,8,128],f16> attributes {torch.assume_strict_symbolic_shapes} {
    %__auto.token_embd.weight = util.global.load @__auto.token_embd.weight : tensor<128256x8192xf16>
    %0 = torch_c.from_builtin_tensor %__auto.token_embd.weight : tensor<128256x8192xf16> -> !torch.vtensor<[128256,8192],f16>
    %__auto.blk.0.attn_output.weight.shard.0 = util.global.load @__auto.blk.0.attn_output.weight.shard.0 : tensor<8192x1024xf16>
    %40 = torch_c.from_builtin_tensor %__auto.blk.0.attn_output.weight.shard.0 : tensor<8192x1024xf16> -> !torch.vtensor<[8192,1024],f16>
    %5793 = torch.symbolic_int "s1" {min_val = 2, max_val = 4095} : !torch.int
    torch.bind_symbolic_shape %arg0, [%5793], affine_map<()[s0] -> (4, s0 * 32)> : !torch.vtensor<[4,?],si64>
    torch.bind_symbolic_shape %arg2, [%5793], affine_map<()[s0] -> (4, s0)> : !torch.vtensor<[4,?],si64>
    %5797 = torch_c.to_builtin_tensor %arg0 : !torch.vtensor<[4,?],si64> -> tensor<4x?xi64>
    %c1 = arith.constant 1 : index
    %dim = tensor.dim %5797, %c1 : tensor<4x?xi64>
    %5798 = flow.tensor.transfer %5797 : tensor<4x?xi64>{%dim} to #hal.device.promise<@__device_0>
    %5799 = torch_c.from_builtin_tensor %5798 : tensor<4x?xi64> -> !torch.vtensor<[4,?],si64>
    %int-1 = torch.constant.int -1
    %false = torch.constant.bool false
    %false_30 = torch.constant.bool false
    %5845 = torch.aten.embedding %0, %5799, %int-1, %false, %false_30 : !torch.vtensor<[128256,8192],f16>, !torch.vtensor<[4,?],si64>, !torch.int, !torch.bool, !torch.bool -> !torch.vtensor<[4,?,8192],f16>
    %int1_134 = torch.constant.int 1
    %5949 = torch.aten.size.int %arg0, %int1_134 : !torch.vtensor<[4,?],si64>, !torch.int -> !torch.int
    %int4 = torch.constant.int 4
    %5950 = torch.aten.mul.int %int4, %5949 : !torch.int, !torch.int -> !torch.int
    %int8192 = torch.constant.int 8192
    %5951 = torch.prim.ListConstruct %5950, %int8192 : (!torch.int, !torch.int) -> !torch.list<int>
    %5952 = torch.aten.view %5845, %5951 : !torch.vtensor<[4,?,8192],f16>, !torch.list<int> -> !torch.vtensor<[?,8192],f16>
    %5953 = torch.aten.mm %5952, %40 : !torch.vtensor<[?,8192],f16>, !torch.vtensor<[8192,1024],f16> -> !torch.vtensor<[?,1024],f16>
    %int4_135 = torch.constant.int 4
    %int1024 = torch.constant.int 1024
    %5954 = torch.prim.ListConstruct %int4_135, %5949, %int1024 : (!torch.int, !torch.int, !torch.int) -> !torch.list<int>
    %5955 = torch.aten.view %5953, %5954 : !torch.vtensor<[?,1024],f16>, !torch.list<int> -> !torch.vtensor<[4,?,1024],f16>
    %int4_236 = torch.constant.int 4
    %int8 = torch.constant.int 8
    %int128_237 = torch.constant.int 128
    %6103 = torch.prim.ListConstruct %int4_236, %5949, %int8, %int128_237 : (!torch.int, !torch.int, !torch.int, !torch.int) -> !torch.list<int>
    %6104 = torch.aten.view %5955, %6103 : !torch.vtensor<[4,?,1024],f16>, !torch.list<int> -> !torch.vtensor<[4,?,8,128],f16>
    return %6104 : !torch.vtensor<[4,?,8,128],f16>
  }
}

commands:

iree-compile 70b_prefill_sharded.mlir --iree-hip-target=gfx942 -o=70b_prefill_sharded.vmfb --iree-hal-target-device="hip[0]" --iree-hal-target-device="hip[1]" --iree-hal-target-device="hip[2]" --iree-hal-target-device="hip[3]" --iree-hal-target-device="hip[4]" --iree-hal-target-device="hip[5]" --iree-hal-target-device="hip[6]" --iree-hal-target-device="hip[7]" --iree-dispatch-creation-enable-aggressive-fusion=true --iree-global-opt-propagate-transposes=true --iree-opt-aggressively-propagate-transposes=true --iree-opt-data-tiling=false --iree-preprocessing-pass-pipeline='builtin.module(util.func(iree-preprocessing-generalize-linalg-matmul-experimental))' --iree-hal-indirect-command-buffers=true --iree-stream-resource-memory-model=discrete --iree-hip-legacy-sync=false --iree-hal-memoization=true --iree-opt-strip-assertions

iree-benchmark-module --hip_use_streams=true --module=70b_prefill_sharded.vmfb --parameters=model=/data/llama3.1/weights/70b/fp16/tp8/llama3.1_70b_fp16_tp8_parameters.irpa --parameters=model=/data/llama3.1/weights/70b/fp16/tp8/llama3.1_70b_fp16_tp8_parameters.rank0.irpa --parameters=model=/data/llama3.1/weights/70b/fp16/tp8/llama3.1_70b_fp16_tp8_parameters.rank1.irpa --parameters=model=/data/llama3.1/weights/70b/fp16/tp8/llama3.1_70b_fp16_tp8_parameters.rank2.irpa --parameters=model=/data/llama3.1/weights/70b/fp16/tp8/llama3.1_70b_fp16_tp8_parameters.rank3.irpa --parameters=model=/data/llama3.1/weights/70b/fp16/tp8/llama3.1_70b_fp16_tp8_parameters.rank4.irpa --parameters=model=/data/llama3.1/weights/70b/fp16/tp8/llama3.1_70b_fp16_tp8_parameters.rank5.irpa --parameters=model=/data/llama3.1/weights/70b/fp16/tp8/llama3.1_70b_fp16_tp8_parameters.rank6.irpa --parameters=model=/data/llama3.1/weights/70b/fp16/tp8/llama3.1_70b_fp16_tp8_parameters.rank7.irpa --device=hip://0 --device=hip://1 --device=hip://2 --device=hip://3 --device=hip://4 --device=hip://5 --device=hip://6 --device=hip://7 --function=prefill_bs4 --input=@/data/llama3.1/weights/70b/decode_args_bs4_128_stride_32_tp8//next_tokens.npy --input=@/data/llama3.1/weights/70b/decode_args_bs4_128_stride_32_tp8//seq_lens.npy --input=@/data/llama3.1/weights/70b/decode_args_bs4_128_stride_32_tp8//seq_block_ids.npy --input=@/data/llama3.1/weights/70b/decode_args_bs4_128_stride_32_tp8//cs_f16_shard_0.npy --input=@/data/llama3.1/weights/70b/decode_args_bs4_128_stride_32_tp8//cs_f16_shard_1.npy --input=@/data/llama3.1/weights/70b/decode_args_bs4_128_stride_32_tp8//cs_f16_shard_2.npy --input=@/data/llama3.1/weights/70b/decode_args_bs4_128_stride_32_tp8//cs_f16_shard_3.npy --input=@/data/llama3.1/weights/70b/decode_args_bs4_128_stride_32_tp8//cs_f16_shard_4.npy --input=@/data/llama3.1/weights/70b/decode_args_bs4_128_stride_32_tp8//cs_f16_shard_5.npy --input=@/data/llama3.1/weights/70b/decode_args_bs4_128_stride_32_tp8//cs_f16_shard_6.npy --input=@/data/llama3.1/weights/70b/decode_args_bs4_128_stride_32_tp8//cs_f16_shard_7.npy

@IanWood1
Copy link
Contributor

IanWood1 commented Jan 5, 2025

It looks like the repro uses the inputs from decode when running prefill. I think this is the cause if the issue. I ran the repro with a modified command and didn't see an issue:

iree-benchmark-module \
  --hip_use_streams=true \
  --module=70b_prefill_sharded.vmfb \
  --parameters=model=/data/llama3.1/weights/70b/fp16/tp8/llama3.1_70b_fp16_tp8_parameters.irpa \
  --parameters=model=/data/llama3.1/weights/70b/fp16/tp8/llama3.1_70b_fp16_tp8_parameters.rank0.irpa \
  --parameters=model=/data/llama3.1/weights/70b/fp16/tp8/llama3.1_70b_fp16_tp8_parameters.rank1.irpa \
  --parameters=model=/data/llama3.1/weights/70b/fp16/tp8/llama3.1_70b_fp16_tp8_parameters.rank2.irpa \
  --parameters=model=/data/llama3.1/weights/70b/fp16/tp8/llama3.1_70b_fp16_tp8_parameters.rank3.irpa \
  --parameters=model=/data/llama3.1/weights/70b/fp16/tp8/llama3.1_70b_fp16_tp8_parameters.rank4.irpa \
  --parameters=model=/data/llama3.1/weights/70b/fp16/tp8/llama3.1_70b_fp16_tp8_parameters.rank5.irpa \
  --parameters=model=/data/llama3.1/weights/70b/fp16/tp8/llama3.1_70b_fp16_tp8_parameters.rank6.irpa \
  --parameters=model=/data/llama3.1/weights/70b/fp16/tp8/llama3.1_70b_fp16_tp8_parameters.rank7.irpa \
  --device=hip://0 \
  --device=hip://1 \
  --device=hip://2 \
  --device=hip://3 \
  --device=hip://4 \
  --device=hip://5 \
  --device=hip://6 \
  --device=hip://7 \
  --function=prefill_bs4 \
  --input=@/data/llama3.1/weights/70b/prefill_args_bs4_128_stride_32_tp8/tokens.npy \
  --input=@/data/llama3.1/weights/70b/prefill_args_bs4_128_stride_32_tp8//seq_lens.npy \
  --input=@/data/llama3.1/weights/70b/prefill_args_bs4_128_stride_32_tp8//seq_block_ids.npy \
  --input=@/data/llama3.1/weights/70b/prefill_args_bs4_128_stride_32_tp8//cs_f16_shard_0.npy \
  --input=@/data/llama3.1/weights/70b/prefill_args_bs4_128_stride_32_tp8//cs_f16_shard_1.npy \
  --input=@/data/llama3.1/weights/70b/prefill_args_bs4_128_stride_32_tp8//cs_f16_shard_2.npy \
  --input=@/data/llama3.1/weights/70b/prefill_args_bs4_128_stride_32_tp8//cs_f16_shard_3.npy \
  --input=@/data/llama3.1/weights/70b/prefill_args_bs4_128_stride_32_tp8//cs_f16_shard_4.npy \
  --input=@/data/llama3.1/weights/70b/prefill_args_bs4_128_stride_32_tp8//cs_f16_shard_5.npy \
  --input=@/data/llama3.1/weights/70b/prefill_args_bs4_128_stride_32_tp8//cs_f16_shard_6.npy \
  --input=@/data/llama3.1/weights/70b/prefill_args_bs4_128_stride_32_tp8//cs_f16_shard_7.npy

@IanWood1
Copy link
Contributor

IanWood1 commented Jan 5, 2025

Edit: both prefill/decode are hitting an assert in iree's runtime when running with the modified command. I think this may be related to other llama issues (#19573)

#5  0x00007ffff782871b in __assert_fail_base (fmt=0x7ffff79dd130 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
    assertion=assertion@entry=0x55555559186f "!!(iree_hal_resource_is(base_value, &iree_hal_hip_buffer_vtable))", file=file@entry=0x555555565330 "iree/runtime/src/iree/hal/drivers/hip/hip_buffer.c",
    line=line@entry=33, function=function@entry=0x555555565363 "iree_hal_hip_buffer_t *iree_hal_hip_buffer_cast(iree_hal_buffer_t *)") at ./assert/assert.c:92
#6  0x00007ffff7839e96 in __GI___assert_fail (assertion=0x55555559186f "!!(iree_hal_resource_is(base_value, &iree_hal_hip_buffer_vtable))",
    file=0x555555565330 "iree/runtime/src/iree/hal/drivers/hip/hip_buffer.c", line=33, function=0x555555565363 "iree_hal_hip_buffer_t *iree_hal_hip_buffer_cast(iree_hal_buffer_t *)") at ./assert/assert.c:101 
#7  0x000055555561f19a in iree_hal_hip_buffer_cast (base_value=<optimized out>) at /home/ianwood2/iree/runtime/src/iree/hal/drivers/hip/hip_buffer.c:33
#8  iree_hal_hip_buffer_device_pointer (base_buffer=<optimized out>) at /home/ianwood2/iree/runtime/src/iree/hal/drivers/hip/hip_buffer.c:188
#9  0x0000555555628988 in iree_hal_hip_stream_command_buffer_copy_buffer (base_command_buffer=<optimized out>, source_ref=..., target_ref=..., flags=<optimized out>)
    at /home/ianwood2/iree/runtime/src/iree/hal/drivers/hip/stream_command_buffer.c:438
#10 0x0000555555624c4f in iree_hal_hip_multi_queue_command_buffer_copy_buffer (base_command_buffer=<optimized out>, source_ref=..., target_ref=..., flags=<optimized out>)
    at /home/ianwood2/iree/runtime/src/iree/hal/drivers/hip/hip_multi_queue_command_buffer.c:286
#11 0x000055555561d01e in iree_hal_hip_device_perform_queue_read_now (user_data=<optimized out>, status=<optimized out>) at /home/ianwood2/iree/runtime/src/iree/hal/drivers/hip/hip_device.c:1872
#12 0x000055555562042f in iree_hal_hip_dispatch_thread_main (param=<optimized out>) at /home/ianwood2/iree/runtime/src/iree/hal/drivers/hip/dispatch_thread.c:66
#13 0x000055555563881b in iree_thread_start_routine (param=<optimized out>) at /home/ianwood2/iree/runtime/src/iree/base/internal/threading_pthreads.c:119
#14 0x00007ffff7894ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#15 0x00007ffff7926850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Update: tested with #19583 and could successfully run iree-benchmark-module

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants