How does launch_transform4d_0213 work inside Transformer Kernel #1415
Unanswered
li-yi-dong
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm confused with this function
launch_transform4d_0213
inside Transformer Kernel.It's input is the output of
_attn_context.Forward
(buf_1), of which size ishidden_size * seq_length * batch_size
.But inside
launch_transform4d_0213
, it launch a CUDA kernel that hasdim3 grid_dims(batch_size, heads * ((seq_length - 1) / 8 + 1), trans_count)
Notice that the
blockDim.y
could easily larger than heads.Inside the CUDA kernel,
d0_stride = hidden_dim * seq_length
and accessing the input byfloat4 vals_vec = in_vec[cnt * d0_stride * gridDim.x + d0 * d0_stride + d1 * d1_stride + d2 * d2_stride + d3]
, which means that d1 * d1_stride should never larger than d0_stride. Then,d1_stride = d0_stride / heads
, which means thatd1
should never exceedheads
.But,
d1 = blockIdx.y / ((seq_length - 1) / blockDim.y + 1)
. Forheads
> 8,((seq_length - 1) / blockDim.y + 1)
= 1 andd1
would be equal toblockIdx.y
, which may exceedheads
. This will make the index of input to be larger than meaningful.Do I misunderstanding the output from
_attn_context.Forward
or thelaunch_transform4d_0213
?Below are the source codes:
Beta Was this translation helpful? Give feedback.
All reactions