-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA]: Enable using custom data types with cuda.parallel #3135
Comments
I hacked together a prototype for Below is what it looks like to use """
Using cuda.parallel to operate on structs ("dataclasses") on the GPU.
"""
import numpy as np
import cupy as cp
import cuda.parallel.experimental as cudax
from gpudataclass import gpudataclass
# The @gpudataclass decorator registers `Pixel` as a user-defined
# numba type.
@gpudataclass
class Pixel:
r: np.dtype("int32")
g: np.dtype("int32")
b: np.dtype("int32")
# This is the comparator we want to pass to `reduce`. It takes
# two Pixel objects as input and returns the one with the
# larger `g` component as output:
def max_g_value(x, y):
return x if x.g > y.g else y
# Next, we need to initialize data on the device. We'll construct
# a CuPy array of size (10, 3) to represent 10 RGB values
# and view it as a structured dtype:
dtype = np.dtype([("r", "int32"), ("g", "int32"), ("b", "int32")])
d_rgb = cp.random.randint(0, 256, (10, 3), dtype=cp.int32).view(dtype)
# Create an empty array to store the output:
d_out = cp.zeros(1, dtype)
# The initial value is provided as a Pixel object:
h_init = Pixel(0, 0, 0)
# Now, we can perform the reduction:
# compute temp storage:
reducer = cudax.reduce_into(d_rgb, d_out, max_g_value, h_init)
temp_storage_bytes = reducer(None, d_rgb, d_out, len(d_rgb), h_init)
# do the reduction:
d_temp_storage = cp.zeros(temp_storage_bytes, dtype=np.uint8)
_ = reducer(d_temp_storage, d_rgb, d_out, len(d_rgb), h_init)
# results:
print()
print("Input RGB values:")
print("-----------------")
print(d_rgb.get())
print()
print("Value with largest g component:")
print("-------------------------------")
print(d_out.get())
print() output:
The code for the example above, and the It would be great to get some feedback on whether this is generally a good direction for the API/implementation and what features we want to support in an MVP. |
Wow, this looks better than a MVP! In your comment you wrote:
Did you mean I think what you have is great, there are only two things that come to mind looking through the code you posted in the comment above, and one isn't even related to your work:
|
Is this a duplicate?
Area
cuda.parallel (Python)
Is your feature request related to a problem? Please describe.
In C++, Thrust enables using algorithms with custom data types:
Example (ChatGPT generated)
We'd like to support the same use-case from Python using
cuda.parallel
.Describe the solution you'd like
First, we should implement a POC that shows this is possible purely from the Python side. Likely this would look similar to the example used in this numba extension example which defines a custom numba data type and passes it to a user defined function.
Second, we should decide on what the API should look like. We probably don't want users having to define custom numba data types and the typing/lowering for those. We should investigate what we can do on their behalf.
Issues around
h_init
for reductionIn a discussion with @gevtushenko , it came up that Thrust's
reduce
algorithm requires an initial value to be passed for the reduction as a host value. It's a question how we would pass an appropriate value from Python to the underlying C++ layer. We would either need to define a ctypes struct types corresponding to the numba type, or have the C++ layer accept a pointer to device memory for theh_init
argument.Admittedly, I'm not exactly sure what other issues abound here and will update this issue as I explore/learn more.
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: