-
Notifications
You must be signed in to change notification settings - Fork 628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error with GPU-only Image Decoding in NVIDIA DALI Pipeline #5697
Comments
Hi @aafaqin, Thank you for reaching out. |
Hi @JanuszL , Thank you for the clarification. I understand now that the mixed mode is essential for handling initial decoding stages on the CPU before GPU processing can take place. Given our intent to enhance performance through GPU utilization, I am curious if we can integrate GPU Direct Storage (GDS) with DALI to streamline data transfers directly from storage to GPU memory, bypassing the CPU to accelerate the workflow. Could this approach mitigate the need for CPU involvement in the initial decoding steps, or would it be feasible to adjust the pipeline to support such a configuration? Additionally, we are exploring methods for writing to disk with JPEG compression and are considering the use of nvjpeg combined with cufile for efficient disk writing. Do you suggest this approach, or is there an alternative method within DALI or NVIDIA's libraries that would better suit our needs? Looking forward to your insights. Best regards |
Hi @aafaqin,
I'm afraid this is not currently possible as the decoding process requires some work to happen on the CPU first (stream parsing, and, in the case of a hybrid approach, not HW decoding, Huffman coefficients decoding).
DALI hasn't approached the encoding yet, technically it should be feasible however I'm not sure if the encoded images end up in the CPU or GPU memory. You may try using nvImageCodec for decoding and kvikio for GDS access. |
Thanks for the help so far on the same code i am trying out different ways like class SimplePipeline(Pipeline):
def __init__(self, batch_size, num_threads, device_id, external_data):
super(SimplePipeline, self).__init__(batch_size, num_threads, device_id, seed=12)
# self.input = fn.external_source(source=external_data, num_outputs=2,dtype=[types.UINT8, types.INT32])
self.input = fn.external_source(source=external_data, num_outputs=2,dtype=[types.UINT8, types.INT32],parallel=True,prefetch_queue_depth=16,batch=True)
def define_graph(self):
self.jpegs, self.labels = self.input
self.decode = fn.decoders.image(self.jpegs,device="mixed", output_type=types.RGB)
self.resize = fn.resize(self.decode,device="gpu", resize_x=1600, resize_y=1600)
# self.prem = fn.transpose(self.resize, perm=[2,0,1],dtype=types.FLOAT)
self.cmnp = fn.crop_mirror_normalize(self.resize,device="gpu",
dtype=types.FLOAT,
output_layout="CHW",
crop=(1600,1600),
mean=[0.0,0.0,0.0],
std=[255.0,255.0,255.0])
return self.cmnp ,self.labels
Still my CPU core is just 1 CPU core being used(100% utilisation) i have a 64 core CPU how to spread it. |
Hi @aafaqin,
It means you use only 1 DALI thread (see |
I've set the num_threads in the DALI pipeline to match the number of CPU cores (64 in my case) and verified the DALI_AFFINITY_MASK. Despite this, I am not seeing any significant performance improvement when increasing the batch size. The average processing speed per image remains unchanged, regardless of adjustments to the batch size. Do you have any insights on what could be causing this bottleneck? Could it be related to how external inputs are being processed or perhaps the GPU-CPU synchronization? Any suggestions to optimize this further would be greatly appreciated. |
Can you try capturing the profile of the processing using nsight and see how it looks like/share? |
Describe the question.
I’m encountering an issue while running a DALI pipeline with GPU-only decoding. The pipeline works when the fn.decoders.image operator is set to "mixed" mode, but it fails with device="gpu" mode, throwing an error about incompatible device storage for the input. Here’s the setup and error details:
Code:
Error:
RuntimeError: Assert on "IsCompatibleDevice(dev, inp_dev, op_type)" failed:
The input 0 for gpu operator
nvidia.dali.fn.decoders.image
is stored on incompatible device "cpu". Valid device is "gpu".GPU and Platform Information:
CUFile GDS Check: Here are the results from running gdscheck:
plaintext
Additional Notes: The pipeline works when device="mixed" is used for fn.decoders.image, but switching to device="gpu" causes the error. I’m using external data for fn.external_source, which may be causing the device compatibility issue. The goal is to decode directly on the GPU to optimize performance.
Check for duplicates
The text was updated successfully, but these errors were encountered: