Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA]: Provide generic and safe C++ interfaces for warp shuffle #2976

Open
1 task done
fbusato opened this issue Nov 27, 2024 · 18 comments · May be fixed by #3210
Open
1 task done

[FEA]: Provide generic and safe C++ interfaces for warp shuffle #2976

fbusato opened this issue Nov 27, 2024 · 18 comments · May be fixed by #3210
Assignees
Labels
feature request New feature or request. good first issue Good for newcomers.

Comments

@fbusato
Copy link
Contributor

fbusato commented Nov 27, 2024

Is this a duplicate?

Area

libcu++

Is your feature request related to a problem? Please describe.

CUDA provides warp shuffle intrinsics that support a limited set of types. Secondly, they there are not check to validate the inputs
see https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-shuffle-functions

Describe the solution you'd like

Provide:

  • cuda::shfl(T var, int srcLane, unsigned mask = 0xFFFFFFFF, int width=warpSize)
  • cuda::shfl_up(T var, int delta, unsigned mask = 0xFFFFFFFF, int width=warpSize)
  • cuda::shfl_down(T var, int delta, unsigned mask = 0xFFFFFFFF, int width=warpSize)
  • cuda::shfl_xor(T var, int laneMask, unsigned mask = 0xFFFFFFFF, int width=warpSize)

Features and checks:

  • All shuffle operations work with any trivially copyable types with arbitrary sizes + extended floating points
  • width is a power of two and 1 <= width <= WarpSize
  • mask is a subset of __activemask()
  • cuda::shfl
    • srcLane is part of mask
    • 0 <= srcLane < width ([optional] no modulo behavior)
  • cuda::shfl_up
    • 1 < delta < width
    • max(laneid - delta, 0) is part of mask
  • cuda::shfl_down
    • 1 < delta < width
    • min(laneid + delta, width) is part of mask
  • cuda::shfl_xor
    • clamp(laneid ^ laneMask, 0, width) is part of mask
  • check that mask value is the same for all participating lanes (__match_all_sync()) [optional]

Describe alternatives you've considered

An alternative could use mask at the end of the parameter list

Additional context

No response

@fbusato fbusato added the feature request New feature or request. label Nov 27, 2024
@github-project-automation github-project-automation bot moved this to Todo in CCCL Nov 27, 2024
@miscco miscco added the good first issue Good for newcomers. label Nov 30, 2024
@soumikiith
Copy link

soumikiith commented Dec 11, 2024

Hi, if I understand correctly, you might want something like this. I have only done this for one of the functions mentioned. If this is correct or if any additional features need to be added, I would happily like to take up this problem.

template <typename T>
__device__ T shfl(T var, int srcLane, unsigned mask = 0xFFFFFFFF, int width = warpSize) {
    static_assert(std::is_trivially_copyable<T>::value, "T must be trivially copyable");
    assert(width <= warpSize && "Width must not exceed warp size");
    assert(isPowerofTwo(width) && "width must be power of 2");

    return __shfl_sync(mask, var, srcLane, width);

   //The rest of the code comes here.
}

NOTE: The code provided is like an MWE to understand whether it meets the actual expectations.

@miscco
Copy link
Collaborator

miscco commented Dec 11, 2024

Hi @soumikiith ,

this is definitely going in the right direction. We would need some additional constraints though

  • The docs state the supported types, so I am not sure whether we can actually go with is_trivially_copyable
  • We would want to SFINAE away the functions if they do not meet the requirements (we have a set of macros _CCCL_TEMPLATE and _CCCL_REQUIRES that make this easier)
  • We need to use macros for a lot of stuff, like __device__ needs to be _CCCL_DEVICE and assert would become _CCCL_ASSERT
  • We want this most likely in either <cuda/cmath> or <cuda/functional>. But we need to ensure that we only define those functions of there is an actual cuda compiler used (See uses of _CCCL_HAS_CUDA_COMPILER)

@soumikiith
Copy link

Hi @miscco ,
Thank you for your feedback. I will look into this. I genuinely thank you because of the MACROS. I did not know that. I will surely implement using them.
Thank you again!!.

@miscco
Copy link
Collaborator

miscco commented Dec 11, 2024

Also regarding extended floating point support, have a look at e.g cmath extensions

@fbusato
Copy link
Contributor Author

fbusato commented Dec 11, 2024

please note that __shfl_sync only supports 32-bit values at HW-level. Any other type size must be cast from/to a common type (e.g. array of uint32_t).
This requires bit_cast or ::memcpy (for __half, __nv_bfloat16, composition of them, etc.)

@miscco
Copy link
Collaborator

miscco commented Dec 11, 2024

We can bitcast __half and __nvbfloat just fine and should easily extend that to any other extended floating point vector type

@fbusato
Copy link
Contributor Author

fbusato commented Dec 11, 2024

We can bitcast __half and __nvbfloat just fine and should easily extend that to any other extended floating point vector type

we need to extend this mechanism to raw array, cuda::std::array, cuda::std::pair, composition of them, etc..

@soumikiith
Copy link

Hi, I am currently finishing off on the issue. Sorry for the delayed response. I just want to check whether my implementation is aligned with the expectations of the issue. I am sharing a minimal code. If anything needs to be addressed, please let me know. Waiting for any further comments or any review. Thank You.

namespace cuda {
    // Helper to check if a type is supported for warp shuffle operations
    template <typename T>
    struct is_supported_type : std::false_type {};
   //more code here.
    #if defined(_CCCL_HAS_NVFP16)
        template <> struct is_supported_type<__half> : std::true_type {};
        template <> struct is_supported_type<__half2> : std::true_type {};
    #endif

//more code supporting floating vector types here.
 
    //Helper function to validate shuffle inputs
    inline void validate_shuffle_inputs(int width)
    {
        _CCCL_ASSERT((width > 0 && (width & (width - 1)) == 0), "Width must be a power of two");
    }

    template <typename T>
    _CCCL_DEVICE T shfl(T var, int srcLane, unsigned mask = 0xFFFFFFFF, int width = warpSize) {
    _CCCL_ASSERT(is_supported_type<T>::value, "T must be a supported type for warp shuffle operations");
    validate_shuffle_inputs(width);

//more code here

    if constexpr(sizeof(T) == 2)
    {
        float packed_data = _CCCL_BUILTIN_BIT_CAST(float, var);
        float result = __shfl_sync(mask, packed_data, srcLane, width);
        return _CCCL_BUILTIN_BIT_CAST(T, result);
    }
//more code here
}

P.S. I am implementing the code inside <cuda/cmath> as you mentioned.

@miscco
Copy link
Collaborator

miscco commented Dec 16, 2024

Sorry for the delayed response.

No need to apologize, we deeply appreciate you putting your time in.

I just want to check whether my implementation is aligned with the expectations of the issue. I am sharing a minimal code. If anything needs to be addressed, please let me know. Waiting for any further comments or any review. Thank You.

Generally we prefer to discuss within a draft PR if there are still design issues. We recently changed our CI so that it does not run on draft PRs so there is no drawback in opening a draft PR, even if it is just a sketch.

struct is_supported_type : std::false_type {};

//more code here.
#if defined(_CCCL_HAS_NVFP16)
template <> struct is_supported_type<__half> : std::true_type {};
template <> struct is_supported_type<__half2> : std::true_type {};
#endif

We are generally prefering variable templates over types. Note that we might need this for C++11 support and compilers with insufficient variable templates support

//Helper function to validate shuffle inputs
inline void validate_shuffle_inputs(int width)
{
    _CCCL_ASSERT((width > 0 && (width & (width - 1)) == 0), "Width must be a power of two");
}

This should just be directly inlined into the code. This would add a function call in debug mode when its not necessary

P.S. I am implementing the code inside <cuda/cmath> as you mentioned.

To be sure, we want this code to live in something like <cuda/__cmath/safe_shuffle.h> and which is then included into <cuda/cmath>

@vrajvaghela89
Copy link

"Hi, I'd like to work on this issue. Could you assign it to me?"

@miscco
Copy link
Collaborator

miscco commented Dec 19, 2024

@vrajvaghela89 there is already @soumikiith working on it

@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Dec 20, 2024
@pciolkosz
Copy link
Contributor

pciolkosz commented Dec 21, 2024

@fbusato Why not use cooperative groups shfl? Is it because of some very unusual mask pattern?
CG shfl has support for arbitrary trivially copyable types, not sure about some of the other requirements. But some of the points are probably already enforced by the tile model.

@fbusato
Copy link
Contributor Author

fbusato commented Dec 21, 2024

that's a good point. Do you mean using CG as a backend?
anyway, is_trivially_copyable is not sufficient. I don't understand the restriction related to sizeof(T) <= 32

@pciolkosz
Copy link
Contributor

Either backend or just use CG where the proposed functionality would be used.
I think the restriction for size <= 32 comes from performance analysis, I don't remember the details, but its possible that for larger types its better to use shared memory instead, since you can use a few very wide store/loads, instead of shuffling every 4 bytes. @wmaxey might remember more where the restriction comes from.
Regarding trivially_copyable, I think shuffle has the exact same restrictions like memcpy, but I can see some cases where some opt-in for a type that is technically not trivially copyable could be useful.

@fbusato
Copy link
Contributor Author

fbusato commented Dec 21, 2024

I guess size <= 32 performance issue is related to register usage

Regarding trivially_copyable, I think shuffle has the exact same restrictions like memcpy, but I can see some cases where some opt-in for a type that is technically not trivially copyable could be useful.

the main use cases are the extended floating point types

@miscco
Copy link
Collaborator

miscco commented Jan 3, 2025

I don't understand the restriction related to sizeof(T) <= 32

I do not follow, the spec explicitly allows the extended floating point types, but also explicitly forbids all other arithmetic types with sizeof(T) < sizeof(int)

@fbusato
Copy link
Contributor Author

fbusato commented Jan 3, 2025

I do not follow, the spec explicitly allows the extended floating point types, but also explicitly forbids all other arithmetic types with sizeof(T) < sizeof(int)

Shuffles map to HW operations. At HW level, they can move only 32-bit at once. Everything else is emulated.

@fbusato
Copy link
Contributor Author

fbusato commented Jan 3, 2025

There is also another (very nice) optimization that we can include. Standard warp shuffle operations don't allow interacting with the PTX predicate, which is very common to build efficient functionalities like reduce and scan.
I will create a PR to provide PTX shuffle.

@cccl-authenticator-app cccl-authenticator-app bot moved this from In Review to In Progress in CCCL Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request. good first issue Good for newcomers.
Projects
Status: In Progress
Development

Successfully merging a pull request may close this issue.

5 participants