How does launch_transform4d_0213 work inside Transformer Kernel #1415

li-yi-dong · 2021-09-29T08:00:41Z

li-yi-dong
Sep 29, 2021

I'm confused with this function launch_transform4d_0213 inside Transformer Kernel.

It's input is the output of _attn_context.Forward (buf_1), of which size is hidden_size * seq_length * batch_size.
But inside launch_transform4d_0213, it launch a CUDA kernel that has dim3 grid_dims(batch_size, heads * ((seq_length - 1) / 8 + 1), trans_count)
Notice that the blockDim.y could easily larger than heads.

Inside the CUDA kernel, d0_stride = hidden_dim * seq_length and accessing the input by float4 vals_vec = in_vec[cnt * d0_stride * gridDim.x + d0 * d0_stride + d1 * d1_stride + d2 * d2_stride + d3] , which means that d1 * d1_stride should never larger than d0_stride. Then, d1_stride = d0_stride / heads, which means that d1 should never exceed heads.

But, d1 = blockIdx.y / ((seq_length - 1) / blockDim.y + 1). For heads > 8, ((seq_length - 1) / blockDim.y + 1) = 1 and d1 would be equal to blockIdx.y, which may exceed heads. This will make the index of input to be larger than meaningful.

Do I misunderstanding the output from _attn_context.Forward or the launch_transform4d_0213?

Below are the source codes:

_attn_context.Forward(bsz_heads, buf_1, v_tf_ptr, ctx_bufB_ptr, _cublasHandle);

launch_transform4d_0213<T>(
    attn_o_inp_ptr, buf_1, bsz, _heads, _seq_length, _hidden_size, _stream, 1);

// 3 * [B A S N] - > [B S C*H]
template <>
void launch_transform4d_0213<float>(float* out,
                                    const float* in,
                                    int batch_size,
                                    int heads,
                                    int seq_length,
                                    int hidden_dim,
                                    cudaStream_t stream,
                                    int trans_count)
{
    hidden_dim >>= 2;
    dim3 grid_dims(batch_size, heads * ((seq_length - 1) / 8 + 1), trans_count);
    dim3 block_dims(hidden_dim / heads, 8);
    transform4d_0213<float>
        <<<grid_dims, block_dims, 0, stream>>>(out, in, heads, seq_length, hidden_dim, 1);
}

template <>
__global__ void transform4d_0213<float>(float* out,
                                        const float* in,
                                        int heads,
                                        int seq_length,
                                        int hidden_dim,
                                        int head_ext)
{
    int d0_stride = hidden_dim * seq_length;
    int d1_stride = d0_stride / heads;
    int d2_stride = hidden_dim / heads;

    int d0_out_stride = d0_stride;
    int d1_out_stride = d2_stride;
    int d2_out_stride = hidden_dim;

    int d0 = blockIdx.x;                                        // Batch
    int d1 = blockIdx.y / ((seq_length - 1) / blockDim.y + 1);  // Head
    int d2 = (threadIdx.y + blockDim.y * blockIdx.y) % seq_length;
    int cnt = blockIdx.z;
    int d3 = threadIdx.x;  // Values (groups of 8)

    if (d2 < seq_length) {
        const float4* in_vec = reinterpret_cast<const float4*>(in);
        float4* out_vec = reinterpret_cast<float4*>(out);

        float4 vals_vec = in_vec[cnt * d0_stride * gridDim.x + d0 * d0_stride + d1 * d1_stride +
                                 d2 * d2_stride + d3];
        out_vec[d0 * d0_out_stride * gridDim.z + cnt * d2_out_stride + d1 * d1_out_stride +
                d2 * d2_out_stride * gridDim.z + d3] = vals_vec;
    }
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How does launch_transform4d_0213 work inside Transformer Kernel #1415

{{title}}

Replies: 0 comments

Select a reply

How does launch_transform4d_0213 work inside Transformer Kernel #1415

li-yi-dong Sep 29, 2021

Replies: 0 comments

li-yi-dong
Sep 29, 2021