Computation followed by data (CFD) means that the computation is performed on the same device on which the data resides.
CFD is unambiguous when different data resides on the same device (i.e. in SYCL context all arrays have the same queue), otherwise CFD is ambiguius and error should be raised.
arLib_cfd.func(ar1, ar2, ar3, ...) # computation is offloaded to the common queue
NB: Adopting CFD precludes all arrays from being host data only.
Data should have following attributes:
1. __sycl_usm_array_interface__
2. queue -> dpctl.SyclQueue
3. device -> dpctl.SyclDevice
4. usm_type -> "shared"|"device"|"host"
Definition of equivalent SYCL queues is following.
Two SYCL queues are equivalent if they have the same:
- SYCL context
- SYCL device
- queue properties
dpctl
package will provide facility to check queue equivalency.
Q: What happens when queues are equivalent but not the same? In this case what queue will be used to submit the kernel?
A: If two SYCL queues are equivalent, the kernel can be submitted to either one. The decision is arbitrary and is left to be decided by individual package implementation.
Rules for applying CFD semantics and selecting queue for offloading are following.
Assume we have example kernel function:
@numba_dppy.kernel
def f(a, b, c):
i = numba_dppy.get_global_id(0)
c[i] = a[i] * b[i]
- Users are not allowed to mix arrays with non-equivalent SYCL queues.
In following examples:
gpu_queue
andgpu_queue_1
are not equivalent.gpu_queue_1
andgpu_queue_2
are equivalent.
a = dpctl.tensor.usm_ndarray([1, 2, 3, 4], type="shared", queue="gpu_queue")
b = dpctl.tensor.usm_ndarray([1, 2, 3, 4], type="shared", queue="gpu_queue")
c = dpctl.tensor.usm_ndarray([1, 2, 3, 4], type="shared", queue="gpu_queue")
f(a, b, c) # offloading to gpu_queue
a = dpctl.tensor.usm_ndarray([1, 2, 3, 4], type="shared", queue="gpu_queue")
b = dpctl.tensor.usm_ndarray([1, 2, 3, 4], type="shared", queue="gpu_queue_1")
c = dpctl.tensor.usm_ndarray([1, 2, 3, 4], type="shared", queue="gpu_queue_1")
f(a, b, c) # raising error as queues are not equivalent
a = dpctl.tensor.usm_ndarray([1, 2, 3, 4], type="shared", queue="gpu_queue_1")
b = dpctl.tensor.usm_ndarray([1, 2, 3, 4], type="shared", queue="gpu_queue_2")
c = dpctl.tensor.usm_ndarray([1, 2, 3, 4], type="shared", queue="gpu_queue_2")
f(a, b, c) # offloading to gpu_queue_1 or gpu_queue_2 depending on implementation
- All unified shared memory types (USM-types) are accessible from device. Users can mix arrays with different USM-type as long as they were allocated using the same SYCL queue.
Considerations: Packages can warn about performance penalty (if there is any) when users mix USM-type.
a = dpctl.tensor.usm_ndarray([1, 2, 3, 4], type="shared", queue="gpu_queue")
b = dpctl.tensor.usm_ndarray([1, 2, 3, 4], type="device", queue="gpu_queue")
c = dpctl.tensor.usm_ndarray([1, 2, 3, 4], type="host", queue="gpu_queue")
f(a, b, c) # offloading to gpu_queue even if USM-types are different
- Packages should provide following helper functions to make data compatible for CFD.
data.asarray(queue, usm_type="device")
# OR
pkg.asarray(data, queue, usm_type="device")
In following example gpu_queue_1
and gpu_queue_2
are not equivalent.
a = dpctl.tensor.usm_ndarray([1, 2, 3, 4], type="device", queue="gpu_queue_1")
b = dpctl.tensor.usm_ndarray([1, 2, 3, 4], type="device", queue="gpu_queue_2")
c = dpctl.tensor.usm_ndarray([1, 2, 3, 4], type="device", queue="gpu_queue_2")
f(a, b, c) # raising error as queues are not equivalent
Users can explicitly convert data to make it CFD compatible.
a_queue_2 = dpctl.tensor.asarray(a, queue="gpu_queue_2") # potentially zero-copy
f(a_queue_2, b, c) # offloading to gpu_queue_2
- Users are allowed to pass numpy.ndarray(s) to
@numba_dppy.kernel
. Users will have to specify the SYCL queue that should be used to copy thenumpy.ndarray
data and to submit the kernel. Numba_dppy providescontext_manager
to allow users to specify the SYCL queue.
Following examples will reuse kernel function from above.
a = numpy.asarray([1, 2, 3, 4])
b = numpy.asarray([1, 2, 3, 4])
c = numpy.asarray([1, 2, 3, 4])
f(a, b, c) # raising error as the queue to submit the kernel can not be determined
Users should use context manager to specify the SYCL queue.
with numba_dppy.offload_to_sycl_device("gpu_queue"):
f(a, b, c) # offloading to gpu_queue
- Users are not allowed to mix
numpy.ndarray
anddpctl.tensor.usm_ndarray
.
a = numpy.asarray([1, 2, 3, 4])
b = numpy.asarray([1, 2, 3, 4])
c = dpctl.tensor.usm_ndarray([1, 2, 3, 4], type="device", queue="gpu_queue_2")
f(a, b, c) # raising error as the queue to submit the kernel can not be determined
with numba_dppy.offload_to_sycl_device("gpu_queue"):
f(a, b, c) # raising error as the queue to submit the kernel can not be determined
Although, the SYCL queue gpu_queue_2
can be used to copy the numpy.ndarray
data and
to submit the kernel, it breaks the CFD
as inferring the SYCL queue can not be accomplished for data a
and b
.
with numba_dppy.offload_to_sycl_device("gpu_queue_2"):
f(a, b, c) # raising error as inferring the SYCL queue can not be accomplished for data a and b
Precedence of USM-types: device < shared < host
| USM-type | Device | Shared | Host |
|----------|--------|--------|--------|
| Device | Device | Device | Device |
| Shared | Device | Shared | Shared |
| Host | Device | Shared | Host |