-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue]: tried to use nn.dataParallel however crashed #1421
Comments
Hi @jdgh000, I was able to reproduce your issue and have opened an internal ticket for further investigation. |
Hi @jdgh000, looks like you are running on a laptop with integrated graphics, you can check if |
thx, let me know, |
As @zichguan-amd mentioned, this has to do with the example being ran on your APU rather than a dedicated graphics card. Correct me if I'm wrong, but I believe you're running on a 5900HX. Could you try running directly on your dGPU by adding this line at the top of your python script?
|
this is not apu sure, cpu model I put is wrong. it is mi250. since cpu model is not that important, i just typed the suggestion. |
Name: AMD EPYC 7763 64-Core Processor |
In that case can you run with |
I saw the prompt and did few times but does not seem to outputting much than not using...either TRACE or INFO
|
seems failing in one of these: |
With |
It is already torch2.6.1 and ROCm6.2.4 |
you said you reproduced it, should not you be able to look into this instead of poking around blindfully? i am not able to do the experimental steps at this point I reported enough that you are able to see on your side. Secondly, your reasoning, irrational and logic is very weak on this, you already seen on your system but then later attempts to attributes to APU, the fact that you can see it is due to APU is already negated by the fact that you are able to seen on your side. @zichguan-amd please dont have me to try something fruitless steps i.e. debug envariable and upgrading, it is just spinning the wheels all the time pls instead follow the reasoning and logic to address this issue! |
We were only able to reproduce this issue when using integrated graphics, so we kindly ask you to provide more details in order for us to help you find a fix. |
it is not working on ROCm, on nvidia rtx gpu:
|
What made you think you were able to reproduce only on IG? It does not say anywhere it says that above?? I gave you all relevant information. I hgve you all the information gpu models, rocm version above, you just ignored those and re-asked. What you says here absolutely makes no sense because you just changed the story anew saying it is only reproducible on IG. Could you paste log on IG and discreet? I dont think you do because it makes no sense! |
Problem Description
Ran following example:
https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html with little modification but it failed during run:
if I apply nn.dataParallel to model then it occurs, without applying it works
model = nn.DataParallel(model)
code:
Operating System
rhel9
CPU
9500hx ryzen
GPU
mi250
ROCm Version
ROCm 6.2.0
ROCm Component
rccl
Steps to Reproduce
Run example code with nn.dataParallel (actual code pasted in problem description):
https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
The text was updated successfully, but these errors were encountered: