Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the documentation for building Pybind11 SYCL Backend with CUDA #1843

Open
sreerajkksd opened this issue Sep 17, 2024 · 2 comments
Open

Comments

@sreerajkksd
Copy link

Hi, I'm trying to build the pybind11 extension mentioned under onemkl_gemv example DPCTL build with CUDA:
https://github.com/IntelPython/dpctl/tree/master/examples/pybind11/onemkl_gemv

Example mentioned fails to run all test cases:

The build works with the following changes, but some tests are still failing:

	--- a/examples/pybind11/onemkl_gemv/CMakeLists.txt
	+++ b/examples/pybind11/onemkl_gemv/CMakeLists.txt
	@@ -41,6 +41,9 @@ pybind11_add_module(${py_module_name}
	     ${_sources}
	 )
	 add_sycl_to_target(TARGET ${py_module_name} SOURCES ${_sources})
	+target_compile_options(${py_module_name} PRIVATE -fsycl-targets=nvptx64-nvidia-cuda)
	+target_link_options(${py_module_name} PRIVATE -fsycl-targets=nvptx64-nvidia-cuda)
	+
	 target_compile_definitions(${py_module_name} PRIVATE -DMKL_ILP64)
	 target_include_directories(${py_module_name}
	     PUBLIC ${MKL_INCLUDE_DIR} sycl_gemm

I also had to add an additional flag as well while building sycl_gemv:

-DDpctl_DIR=<DPCTL_DIR>/cmake

Sample reproducer:

SYCL_PI_TRACE=1 python3 -c 'import dpctl; import dpctl.tensor as dpt; import numpy as np; from sycl_gemm import gemv; q = dpctl.SyclQueue(); Mnp, vnp = np.random.randn(5, 3), np.random.randn(3); M = dpt.asarray(Mnp, sycl_queue=q); v = dpt.asarray(vnp, sycl_queue=q); r = dpt.empty((5,), dtype=v.dtype, sycl_queue=q); hev, ev = gemv(q, M, v, r, []); hev.wait(); rnp = dpt.asnumpy(r);' 

While executing this, it failed with:

SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_opencl.so [ PluginVersion: 15.47.1 ]
SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_level_zero.so [ PluginVersion: 15.47.1 ]
SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_cuda.so [ PluginVersion: 15.49.1 ]
SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_unified_runtime.so [ PluginVersion: 15.47.1 ]
SYCL_PI_TRACE[all]: Requested device_type: info::device_type::automatic
SYCL_PI_TRACE[all]: Requested device_type: info::device_type::automatic
SYCL_PI_TRACE[all]: Selected device: -> final score = 1500
SYCL_PI_TRACE[all]:   platform: NVIDIA CUDA BACKEND
SYCL_PI_TRACE[all]:   device: NVIDIA A100 80GB PCIe
SYCL_PI_TRACE[all]: Requested device_type: info::device_type::automatic
SYCL_PI_TRACE[all]: Requested device_type: info::device_type::automatic
SYCL_PI_TRACE[all]: Requested device_type: info::device_type::automatic
SYCL_PI_TRACE[all]: Requested device_type: info::device_type::automatic
SYCL_PI_TRACE[all]: Selected device: -> final score = 1500
SYCL_PI_TRACE[all]:   platform: NVIDIA CUDA BACKEND
SYCL_PI_TRACE[all]:   device: NVIDIA A100 80GB PCIe
SYCL_PI_TRACE[all]: Requested device_type: info::device_type::automatic
SYCL_PI_TRACE[all]: Requested device_type: info::device_type::automatic
SYCL_PI_TRACE[all]: Selected device: -> final score = 1500
SYCL_PI_TRACE[all]:   platform: NVIDIA CUDA BACKEND
SYCL_PI_TRACE[all]:   device: NVIDIA A100 80GB PCIe
Traceback (most recent call last):
  File "<string>", line 1, in <module>
RuntimeError: Native API failed. Native API returns: -42 (PI_ERROR_INVALID_BINARY) -42 (PI_ERROR_INVALID_BINARY)

Coming back to the source which is invoked, the failure happens when executing the following code(github):

   if (v_typenum == api.UAR_DOUBLE_) {
        using T = double;
        sycl::event gemv_ev = oneapi::mkl::blas::row_major::gemv(
            q, oneapi::mkl::transpose::nontrans, n, m, T(1),
            reinterpret_cast<T *>(mat_typeless_ptr), m,
            reinterpret_cast<T *>(v_typeless_ptr), 1, T(0),
            reinterpret_cast<T *>(r_typeless_ptr), 1, depends);
        res_ev = gemv_ev;
    }

... and SYCL_PI_TRACE=-1 reported:

    ---> piextDeviceSelectBinary(
            <unknown> : 0x67c2de0
            <unknown> : 0x68d3780
            <unknown> : 1
            <unknown> : 0x7ffcb6131ebc
    ) --->  pi_result : -42
            [out]<unknown> ** : 0x68d3780[ 0x7f37efe416b0 ... ]

python -m dpctl --full-list report the following:

> python -m dpctl --full-list                                                                                                                         1s
Platform  0 ::
    Name        Intel(R) OpenCL
    Version     OpenCL 3.0 LINUX
    Vendor      Intel(R) Corporation
    Backend     opencl
    Num Devices 1
      # 0
        Name                Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
        Version             2024.18.7.0.11_160000
        Filter string       opencl:cpu:0
Platform  1 ::
    Name        NVIDIA CUDA BACKEND
    Version     CUDA 12.5
    Vendor      NVIDIA Corporation
    Backend     ext_oneapi_cuda
    Num Devices 1
      # 0
        Name                NVIDIA A100 80GB PCIe
        Version             CUDA 12.5
        Filter string       cuda:gpu:0
@oleksandr-pavlyk
Copy link
Collaborator

oleksandr-pavlyk commented Sep 17, 2024

@sreerajkksd Thank you for the interest, I'll try to answer superficially, and refer you to our poster at SciPy 2024, https://intelpython.github.io/portable-data-parallel-extensions-scipy-2024/

The poster companion material https://github.com/IntelPython/example-portable-data-parallel-extensions/tree/main contains examples of building Python extensions using DPC++ and targeting NVidia GPUs, also one including oneMKL.


This example in DPCTL is written to be built with oneAPI MKL library (https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html) . The BLAS portion of this library provides implementations for x86-64 CPUs and for SPIR-capable devices. In particular, the library does not contain offload sections for Nvidia GPUs and for AMD GPUs.

The oneMKL interface library, https://github.com/oneapi-src/oneMKL, is C++ library that uses oneAPI MKL library for CPU and SPIR devices, and cuBLAS/cuSOLVER for NVidia GPUs, and rocBLAS/rocSOLVER for AMD GPUs. It need to be built, and I'd refer to the poster material and documentation for more details.

It is a good idea to provide references to said material in the README of this dpctl example though! Thanks for the suggestion

@ndgrigorian
Copy link
Collaborator

I also had to add an additional flag as well while building sycl_gemv:
-DDpctl_DIR=<DPCTL_DIR>/cmake

Yes, I added the following option:
-DDpctl_ROOT=$(python -m dpctl --cmakedir)
as well when building.

The example should be updated accounting for this as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants