-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for CUDA-based GPU build #3160
Conversation
@ChipKerchner Thanks for your contribution! Before we review, can you please remove the unrelated commits from this PR? I'm not sure what is causing them. I think that you will have to |
Conflicts - src/c_api.cpp & src/boosting/gbdt.cpp should be taken from cuda branch. Not sure what is going on with R-package/src/lightgbm_R.cpp (recent commit?) How do I resolve these since they aren't showing up on my branch? |
@jameslamb I had to rebase from one github to another. This is the result. I'm not a git master so I'll try but I may ask for help. This is the result
|
your local # get current state of LightGBM master branch
git remote add upstream [email protected]:microsoft/LightGBM.git
git checkout master
git pull upstream master
# update the master branch on your fork
git push origin master
# update this feature branch
git checkout cuda
git rebase master
# push your changes
git push origin cuda --force If you've never done this before, be sure to back up your code somewhere. |
I did as you suggested. Hopefully everything is fine now.
|
great! It looks like only changes for your PR are in the history now, thank you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for cleaning up the git
stuff! I've left a few additional minor requests here.
Can you please get continuous integration working? Right now many of the CI jobs are failing, which might mean that this pull request in its current state has breaking changes.
Some time soon you will hear from the C++ maintainers with questions about the approach and the problem this pull request solves.
I believe that the first thing should be getting @huanzhang12's approval of the concept as the main expert in GPU computing. Otherwise, many further actions and reviews can be pointless. |
Already started a conversion a while back. |
Ah, I see, OK! But that were just words and here is the code. |
Could some one tell me why the automated checks are failing and what I should do to fix them? I have a minimal x86_64 (AMD64) with CUDA testing environment (MINGW64) with MSVC Debug 2019 (14.2). Main development was performed on ppc64le |
@jameslamb I looked at the logs and such. I'm not sure what is failing, how I can reproduce it on my system(s) and how to fix it. Help would be much appreciated. Update: I fixed a problem with accidentally override gpu_use_dp for the GPU version. These are the problems I don't quite understand yet.... Anyone that can help, please let me know.
|
I looked through the failing jobs and other than the failing I can say this...I was surprised to see how many changes are being added that are not inside Building on what @StrikerRUS said in #3160 (comment), I will reserve more comments until @huanzhang12 can give a thorough review. |
@StrikerRUS Since you were recently working on parts of the continuous-integration and I have no way to duplicate that setup, could you help me and tell me why various parts are failing or crashing? |
@ChipKerchner I see that failing tests are from this file https://github.com/microsoft/LightGBM/blob/master/tests/python_package_test/test_consistency.py. Since the error is segfault and it doesn't depend on OS, I believe that the root cause is something fundamental in cpp code. To reproduce the issue just run |
I am able to do this:
Azure starts up and:
I'm not sure what to do next. Is there anyone that can help me? @StrikerRUS @guolinke |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChipKerchner Thanks for enabling multi-GPU training! Hope it indeed works! However, I haven't got your input on speedup factor gained from multiple GPUs (#3160 (comment)).
Please consider addressing some more minor comments related to the code quality and I guess this PR can be finally merged if no one else is planning to provide their review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have nothing else to recommend (just the question about copyrights, apologies if that has already been addressed).
I don't know enough to thoroughly review most of the code in this PR, but since all the R CI jobs are passing I'm satisfied that it doesn't impact the R package 😀
Thanks for all the hard work @ChipKerchner !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChipKerchner Thank you and all people who were working on this PR so much for all your hard work and patience! We really appreciate this valuable contribution!
I don't have any other comments.
@jameslamb Please update your requested changes
review status.
I am going to merge this. Thank you so much ! @ChipKerchner @austinpagan |
@ChipKerchner Hello! Could you please provide a GitHub nickname of a person we can contact regarding CUDA bugs? |
@StrikerRUS Could you give a little detail to the nature of the bugs, which files it may be in, and how to reproduce it? |
@ChipKerchner Thanks a lot for the quick reply! The error is
and happens somewhere around here
This bug can be reproduced on both multi-GPU configuration (see #3450 (comment)) and single-GPU machine (see #3424 (comment)). The easiest way to reproduce it is to run
Simply adjust If you do not mind, lets move our further discussion to #3450 because this PR already has more than 400 comments. |
Gently ping @ChipKerchner . |
@StrikerRUS Sorry, we won't be able to look at this until at least next month. |
@ChipKerchner OK, got it! Thanks for your response! |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
This is the initial CUDA work. It should work similarly to the GPU/OCL work.
To compile use - 'USE_CUDA=1'. Python unit tests should include 'device': 'cuda' where needed.
All unit tests pass for CPU, GPU/OCL and CUDA. CPU & CUDA were tested on ppc64le and GPU/OCL was tested on x86_64