Add support for CUDA-based GPU build #3160

ChipKerchner · 2020-06-11T15:34:39Z

This is the initial CUDA work. It should work similarly to the GPU/OCL work.

To compile use - 'USE_CUDA=1'. Python unit tests should include 'device': 'cuda' where needed.

All unit tests pass for CPU, GPU/OCL and CUDA. CPU & CUDA were tested on ppc64le and GPU/OCL was tested on x86_64

ghost · 2020-06-11T15:34:52Z

All CLA requirements met.

jameslamb · 2020-06-11T15:39:30Z

@ChipKerchner Thanks for your contribution! Before we review, can you please remove the unrelated commits from this PR? I'm not sure what is causing them.

I think that you will have to git rebase master and fix conflicts.

ChipKerchner · 2020-06-11T15:40:04Z

Conflicts - src/c_api.cpp & src/boosting/gbdt.cpp should be taken from cuda branch. Not sure what is going on with R-package/src/lightgbm_R.cpp (recent commit?)

How do I resolve these since they aren't showing up on my branch?

ChipKerchner · 2020-06-11T15:42:46Z

@ChipKerchner Thanks for your contribution! Before we review, can you please remove the unrelated commits from this PR? I'm not sure what is causing them.

I think that you will have to git rebase master and fix conflicts.

@jameslamb I had to rebase from one github to another. This is the result. I'm not a git master so I'll try but I may ask for help.

This is the result

git rebase master
Current branch cuda is up to date.

jameslamb · 2020-06-11T18:39:24Z

@ChipKerchner Thanks for your contribution! Before we review, can you please remove the unrelated commits from this PR? I'm not sure what is causing them.
I think that you will have to git rebase master and fix conflicts.

@jameslamb I had to rebase from one github to another. This is the result. I'm not a git master so I'll try but I may ask for help.

This is the result
git rebase master
Current branch cuda is up to date.

your local master is probably very out of date with LightGBM. You can run something like this in your repo:

# get current state of LightGBM master branch
git remote add upstream [email protected]:microsoft/LightGBM.git
git checkout master
git pull upstream master

# update the master branch on your fork
git push origin master

# update this feature branch
git checkout cuda
git rebase master

# push your changes
git push origin cuda --force

If you've never done this before, be sure to back up your code somewhere. git push --force is a non-reversible destructive action.

ChipKerchner · 2020-06-11T19:04:47Z

I did as you suggested. Hopefully everything is fine now.

This branch is even with microsoft:master.

jameslamb · 2020-06-11T19:10:44Z

I did as you suggested. Hopefully anything is fine now.
This branch is even with microsoft:master.

great! It looks like only changes for your PR are in the history now, thank you.

jameslamb

Thanks for cleaning up the git stuff! I've left a few additional minor requests here.

Can you please get continuous integration working? Right now many of the CI jobs are failing, which might mean that this pull request in its current state has breaking changes.

Some time soon you will hear from the C++ maintainers with questions about the approach and the problem this pull request solves.

CMakeLists.txt

build_LGBM.232.sh

install_LGBM.232.sh

src/boosting/gbdt.cpp

StrikerRUS · 2020-06-11T19:30:25Z

I believe that the first thing should be getting @huanzhang12's approval of the concept as the main expert in GPU computing. Otherwise, many further actions and reviews can be pointless.

ChipKerchner · 2020-06-11T19:32:01Z

I believe that the first thing should be getting @huanzhang12's approval of the concept as the main expert in GPU computing. Otherwise, many further actions and reviews can be pointless.

Already started a conversion a while back.

#2937

StrikerRUS · 2020-06-11T19:46:05Z

Already started a conversion a while back.

Ah, I see, OK! But that were just words and here is the code.

ChipKerchner · 2020-06-11T19:55:12Z

Could some one tell me why the automated checks are failing and what I should do to fix them? I have a minimal x86_64 (AMD64) with CUDA testing environment (MINGW64) with MSVC Debug 2019 (14.2). Main development was performed on ppc64le

ChipKerchner · 2020-06-15T12:02:46Z

Can you please get continuous integration working? Right now many of the CI jobs are failing, which might mean that this pull request in its current state has breaking changes.

@jameslamb I looked at the logs and such. I'm not sure what is failing, how I can reproduce it on my system(s) and how to fix it. Help would be much appreciated.

Update: I fixed a problem with accidentally override gpu_use_dp for the GPU version.

These are the problems I don't quite understand yet.... Anyone that can help, please let me know.

Looks like some errors during setup.
Regular version is fine - bdist and sdist are not.
There seems to be a crash which I can not duplicate in test_consistency.py. Maybe others?

jameslamb · 2020-06-16T03:46:00Z

Can you please get continuous integration working? Right now many of the CI jobs are failing, which might mean that this pull request in its current state has breaking changes.

@jameslamb I looked at the logs and such. I'm not sure what is failing, how I can reproduce it on my system(s) and how to fix it. Help would be much appreciated.

Update: I fixed a problem with accidentally override gpu_use_dp for the GPU version.

These are the problems I don't quite understand yet.... Anyone that can help, please let me know.

Looks like some errors during setup.

Regular version is fine - bdist and sdist are not.

There seems to be a crash which I can not duplicate in test_consistency.py. Maybe others?

I looked through the failing jobs and other than the failing lint task, I'm not sure how to help with the other issues you're seeing. I'm not a primary C++ maintainer for this project, and you're touching a lot of files that I am not familiar with, so I'm not the best person to help you fix these issues.

I can say this...I was surprised to see how many changes are being added that are not inside #ifdef USE_CUDA blocks. Those are likely the sources of any failures from existing CI jobs.

Building on what @StrikerRUS said in #3160 (comment), I will reserve more comments until @huanzhang12 can give a thorough review.

ChipKerchner · 2020-06-17T19:52:45Z

@StrikerRUS Since you were recently working on parts of the continuous-integration and I have no way to duplicate that setup, could you help me and tell me why various parts are failing or crashing?

StrikerRUS · 2020-06-18T14:05:07Z

@ChipKerchner I see that failing tests are from this file https://github.com/microsoft/LightGBM/blob/master/tests/python_package_test/test_consistency.py. Since the error is segfault and it doesn't depend on OS, I believe that the root cause is something fundamental in cpp code.

To reproduce the issue just run pytest ./tests/python_package_test/test_consistency.py. If you need an exact environment, you can use this our Ubuntu Docker: https://hub.docker.com/r/lightgbm/vsts-agent.

ChipKerchner · 2020-06-22T20:47:11Z

To reproduce the issue just run pytest ./tests/python_package_test/test_consistency.py. If you need an exact environment, you can use this our Ubuntu Docker: https://hub.docker.com/r/lightgbm/vsts-agent.

I am able to do this:

docker run -it -e VSTS_ACCOUNT=<my account> -e VSTS_TOKEN=<my token> lightgbm/vsts-agent:ubuntu-14.04-dev

Azure starts up and:

>> Connect:
Connecting to server ...
>> Register Agent:
Scanning for tool capabilities.
Connecting to the server.
Successfully added the agent
Testing agent connection.
2020-06-22 20:12:44Z: Settings Saved.
Scanning for tool capabilities.
Connecting to the server.
2020-06-22 20:12:46Z: Listening for Jobs

I'm not sure what to do next. Is there anyone that can help me? @StrikerRUS @guolinke

StrikerRUS

@ChipKerchner Thanks for enabling multi-GPU training! Hope it indeed works! However, I haven't got your input on speedup factor gained from multiple GPUs (#3160 (comment)).

Please consider addressing some more minor comments related to the code quality and I guess this PR can be finally merged if no one else is planning to provide their review.

CMakeLists.txt

include/LightGBM/cuda/cuda_utils.h

include/LightGBM/config.h

include/LightGBM/cuda/vector_cudahost.h

src/treelearner/kernels/histogram_16_64_256.cu

jameslamb

I have nothing else to recommend (just the question about copyrights, apologies if that has already been addressed).

I don't know enough to thoroughly review most of the code in this PR, but since all the R CI jobs are passing I'm satisfied that it doesn't impact the R package 😀

Thanks for all the hard work @ChipKerchner !

include/LightGBM/cuda/cuda_utils.h

StrikerRUS

@ChipKerchner Thank you and all people who were working on this PR so much for all your hard work and patience! We really appreciate this valuable contribution!
I don't have any other comments.

@jameslamb Please update your requested changes review status.

guolinke · 2020-09-20T05:34:44Z

I am going to merge this. Thank you so much ! @ChipKerchner @austinpagan

StrikerRUS · 2020-11-09T12:35:00Z

@ChipKerchner Hello! Could you please provide a GitHub nickname of a person we can contact regarding CUDA bugs?

ChipKerchner · 2020-11-09T12:52:58Z

@ChipKerchner Hello! Could you please provide a GitHub nickname of a person we can contact regarding CUDA bugs?

@StrikerRUS Could you give a little detail to the nature of the bugs, which files it may be in, and how to reproduce it?

StrikerRUS · 2020-11-09T21:56:21Z

@ChipKerchner Thanks a lot for the quick reply!

The error is

lightgbm.basic.LightGBMError: [CUDA] invalid argument /LightGBM/src/treelearner/cuda_tree_learner.cpp 414

and happens somewhere around here

LightGBM/src/treelearner/cuda_tree_learner.cpp

Line 414 in 6c10c4c

    
           CUDASUCCESS_OR_FATAL(cudaMemcpyAsync(&device_features[copied_feature * num_data_], tmp_data, sizes_in_byte, cudaMemcpyHostToDevice, stream_[device_id]));

This bug can be reproduced on both multi-GPU configuration (see #3450 (comment)) and single-GPU machine (see #3424 (comment)). The easiest way to reproduce it is to run simple_example.py inside NVIDIA Docker:

export ROOT_DOCKER_FOLDER=/LightGBM
cat > docker.env <<EOF
TASK=cuda
COMPILER=gcc
GITHUB_ACTIONS=true
OS_NAME=linux
BUILD_DIRECTORY=$ROOT_DOCKER_FOLDER
CONDA_ENV=test-env
PYTHON_VERSION=3.8
EOF
cat > docker-script.sh <<EOF
export CONDA=\$HOME/miniconda
export PATH=\$CONDA/bin:\$PATH
nvidia-smi
$ROOT_DOCKER_FOLDER/.ci/setup.sh || exit -1
$ROOT_DOCKER_FOLDER/.ci/test.sh
source activate \$CONDA_ENV
cd \$BUILD_DIRECTORY/examples/python-guide/
python simple_example.py
EOF
sudo docker run --env-file docker.env -v "$GITHUB_WORKSPACE":"$ROOT_DOCKER_FOLDER" --rm --gpus all nvidia/cuda:11.0-devel-ubuntu20.04 /bin/bash $ROOT_DOCKER_FOLDER/docker-script.sh

Simply adjust $GITHUB_WORKSPACE to your working directory with LightGBM repo in the script above.

If you do not mind, lets move our further discussion to #3450 because this PR already has more than 400 comments.

StrikerRUS · 2020-12-12T13:24:17Z

Gently ping @ChipKerchner .

ChipKerchner · 2020-12-15T14:44:39Z

@StrikerRUS Sorry, we won't be able to look at this until at least next month.

StrikerRUS · 2020-12-16T15:17:19Z

@ChipKerchner OK, got it! Thanks for your response!

github-actions · 2023-08-24T02:45:10Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

ChipKerchner requested review from btrotta, chivee, guolinke, henry0312, huanzhang12, jameslamb, Laurae2, StrikerRUS and wxchan as code owners June 11, 2020 15:34

ChipKerchner force-pushed the cuda branch from e1f8270 to 02b3256 Compare June 11, 2020 18:58

jameslamb requested changes Jun 11, 2020

View reviewed changes

CMakeLists.txt Outdated Show resolved Hide resolved

CMakeLists.txt Outdated Show resolved Hide resolved

build_LGBM.232.sh Outdated Show resolved Hide resolved

install_LGBM.232.sh Outdated Show resolved Hide resolved

src/boosting/gbdt.cpp Outdated Show resolved Hide resolved

ChipKerchner requested a review from jameslamb June 15, 2020 18:51

jameslamb added the feature label Jun 23, 2020

Initial CUDA work

328a9f0

StrikerRUS requested changes Sep 12, 2020

View reviewed changes

jameslamb reviewed Sep 12, 2020

View reviewed changes

include/LightGBM/cuda/cuda_utils.h Show resolved Hide resolved

ChipKerchner added 2 commits September 14, 2020 12:39

More code cleanup based on reviews comments.

ea537f8

Update docs with latest config changes.

d9e9d2e

StrikerRUS approved these changes Sep 15, 2020

View reviewed changes

StrikerRUS requested a review from jameslamb September 15, 2020 15:23

jameslamb approved these changes Sep 15, 2020

View reviewed changes

StrikerRUS mentioned this pull request Sep 15, 2020

Build integrated Python package library #3144

Merged

guolinke merged commit f7ad945 into microsoft:master Sep 20, 2020

This was referenced Sep 21, 2020

GPU support on ppc64le #2937

Closed

Code Refactoring #2341

Closed

CI CUDA job #3402

Closed

jameslamb mentioned this pull request Sep 26, 2020

[R-package] add new copyright holder in DESCRIPTION #3409

Merged

StrikerRUS mentioned this pull request Oct 2, 2020

[docs] document CUDA version support #3428

Merged

jameslamb mentioned this pull request Oct 3, 2020

[R-package] Suppress 'warning: unknown pragma' for CRAN package #3433

Closed

StrikerRUS mentioned this pull request Nov 29, 2020

Optimisations for Apple Silicon #3606

Closed

This was referenced Jan 24, 2021

Support for single precision float in CUDA version #3836

Closed

Support Windows in CUDA version #3837

Closed

beckernick mentioned this pull request Apr 7, 2021

Tracking performance of the LightGBM experimental CUDA build NVIDIA/gbm-bench#25

Open

jameslamb mentioned this pull request Feb 28, 2022

[CUDA] New CUDA version Part 1 #4630

Merged

jameslamb mentioned this pull request Jan 18, 2023

[CUDA] consolidate CUDA versions #5677

Merged

github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for CUDA-based GPU build #3160

Add support for CUDA-based GPU build #3160

ChipKerchner commented Jun 11, 2020 •

edited

Loading

ghost commented Jun 11, 2020 •

edited by ghost

Loading

jameslamb commented Jun 11, 2020

ChipKerchner commented Jun 11, 2020 •

edited

Loading

ChipKerchner commented Jun 11, 2020 •

edited

Loading

jameslamb commented Jun 11, 2020

ChipKerchner commented Jun 11, 2020 •

edited

Loading

jameslamb commented Jun 11, 2020

jameslamb left a comment

StrikerRUS commented Jun 11, 2020 •

edited

Loading

ChipKerchner commented Jun 11, 2020

StrikerRUS commented Jun 11, 2020

ChipKerchner commented Jun 11, 2020 •

edited

Loading

ChipKerchner commented Jun 15, 2020 •

edited

Loading

jameslamb commented Jun 16, 2020

ChipKerchner commented Jun 17, 2020

StrikerRUS commented Jun 18, 2020

ChipKerchner commented Jun 22, 2020

StrikerRUS left a comment

jameslamb left a comment

StrikerRUS left a comment •

edited

Loading

guolinke commented Sep 20, 2020

StrikerRUS commented Nov 9, 2020

ChipKerchner commented Nov 9, 2020 •

edited

Loading

StrikerRUS commented Nov 9, 2020

StrikerRUS commented Dec 12, 2020

ChipKerchner commented Dec 15, 2020

StrikerRUS commented Dec 16, 2020

github-actions bot commented Aug 24, 2023

Add support for CUDA-based GPU build #3160

Add support for CUDA-based GPU build #3160

Conversation

ChipKerchner commented Jun 11, 2020 • edited Loading

ghost commented Jun 11, 2020 • edited by ghost Loading

jameslamb commented Jun 11, 2020

ChipKerchner commented Jun 11, 2020 • edited Loading

ChipKerchner commented Jun 11, 2020 • edited Loading

jameslamb commented Jun 11, 2020

ChipKerchner commented Jun 11, 2020 • edited Loading

jameslamb commented Jun 11, 2020

jameslamb left a comment

Choose a reason for hiding this comment

StrikerRUS commented Jun 11, 2020 • edited Loading

ChipKerchner commented Jun 11, 2020

StrikerRUS commented Jun 11, 2020

ChipKerchner commented Jun 11, 2020 • edited Loading

ChipKerchner commented Jun 15, 2020 • edited Loading

jameslamb commented Jun 16, 2020

ChipKerchner commented Jun 17, 2020

StrikerRUS commented Jun 18, 2020

ChipKerchner commented Jun 22, 2020

StrikerRUS left a comment

Choose a reason for hiding this comment

jameslamb left a comment

Choose a reason for hiding this comment

StrikerRUS left a comment • edited Loading

Choose a reason for hiding this comment

guolinke commented Sep 20, 2020

StrikerRUS commented Nov 9, 2020

ChipKerchner commented Nov 9, 2020 • edited Loading

StrikerRUS commented Nov 9, 2020

StrikerRUS commented Dec 12, 2020

ChipKerchner commented Dec 15, 2020

StrikerRUS commented Dec 16, 2020

github-actions bot commented Aug 24, 2023

ChipKerchner commented Jun 11, 2020 •

edited

Loading

ghost commented Jun 11, 2020 •

edited by ghost

Loading

ChipKerchner commented Jun 11, 2020 •

edited

Loading

ChipKerchner commented Jun 11, 2020 •

edited

Loading

ChipKerchner commented Jun 11, 2020 •

edited

Loading

StrikerRUS commented Jun 11, 2020 •

edited

Loading

ChipKerchner commented Jun 11, 2020 •

edited

Loading

ChipKerchner commented Jun 15, 2020 •

edited

Loading

StrikerRUS left a comment •

edited

Loading

ChipKerchner commented Nov 9, 2020 •

edited

Loading