Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MIG deployment of triton cause "CacheManager Init Failed. Error: -17" #7906

Open
LSC527 opened this issue Dec 25, 2024 · 0 comments
Open

MIG deployment of triton cause "CacheManager Init Failed. Error: -17" #7906

LSC527 opened this issue Dec 25, 2024 · 0 comments

Comments

@LSC527
Copy link

LSC527 commented Dec 25, 2024

Description
Same deployment but with different GPU (w/ or w/o MIG). DCGM unable to star when w/ MIG:

CacheManager Init Failed. Error: -17
W1225 10:48:27.718944 4706 metrics.cc:811] "DCGM unable to start: DCGM initialization error"

Similar to #3506 but not caused by inefficient memory.
Triton Information
nvcr.io/nvidia/tritonserver:24.11-py3

To Reproduce
GPUs w/ MIG

sudo docker run -it --rm --network=host --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -e NVIDIA_VISIBLE_DEVICES=0:0 nvcr.io/nvidia/tritonserver
tritonserver --model-repository {my_model_path}

outputs:

I1225 10:48:25.952289 4706 pinned_memory_manager.cc:277] "Pinned memory pool is created at '0x7f24b8000000' with size 268435456"
I1225 10:48:25.954209 4706 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I1225 10:48:25.958281 4706 model_lifecycle.cc:473] "loading: onnx:1"
I1225 10:48:25.960593 4706 onnxruntime.cc:2875] "TRITONBACKEND_Initialize: onnxruntime"
I1225 10:48:25.960634 4706 onnxruntime.cc:2885] "Triton TRITONBACKEND API version: 1.19"
I1225 10:48:25.960657 4706 onnxruntime.cc:2891] "'onnxruntime' TRITONBACKEND API version: 1.19"
I1225 10:48:25.960665 4706 onnxruntime.cc:2921] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"true\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}"
I1225 10:48:25.977518 4706 onnxruntime.cc:2986] "TRITONBACKEND_ModelInitialize: onnx (version 1)"
I1225 10:48:25.978169 4706 onnxruntime.cc:984] "skipping model configuration auto-complete for 'onnx': inputs and outputs already specified"
I1225 10:48:25.978790 4706 onnxruntime.cc:3051] "TRITONBACKEND_ModelInstanceInitialize: onnx_0_0 (GPU device 0)"
I1225 10:48:27.703699 4706 model_lifecycle.cc:849] "successfully loaded 'onnx'"
I1225 10:48:27.703793 4706 server.cc:604]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I1225 10:48:27.703839 4706 server.cc:631]
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
| Backend     | Path                                                            | Config                                                                                                                 |
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
| onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6 |
|             |                                                                 | .000000","default-max-batch-size":"4"}}                                                                                |
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+

I1225 10:48:27.703886 4706 server.cc:674]
+-------+---------+--------+
| Model | Version | Status |
+-------+---------+--------+
| onnx  | 1       | READY  |
+-------+---------+--------+

CacheManager Init Failed. Error: -17
W1225 10:48:27.718944 4706 metrics.cc:811] "DCGM unable to start: DCGM initialization error"
I1225 10:48:27.719361 4706 metrics.cc:783] "Collecting CPU metrics"
I1225 10:48:27.719448 4706 tritonserver.cc:2598]
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                               |
| server_version                   | 2.52.0                                                                                                                                                               |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tens |
|                                  | or_data parameters statistics trace logging                                                                                                                          |
| model_repository_path[0]         | {my_model_path}                                                                                                 |
| model_control_mode               | MODE_NONE                                                                                                                                                            |
| strict_model_config              | 0                                                                                                                                                                    |
| model_config_name                |                                                                                                                                                                      |
| rate_limit                       | OFF                                                                                                                                                                  |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                            |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                             |
| min_supported_compute_capability | 6.0                                                                                                                                                                  |
| strict_readiness                 | 1                                                                                                                                                                    |
| exit_timeout                     | 30                                                                                                                                                                   |
| cache_enabled                    | 0                                                                                                                                                                    |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I1225 10:48:27.723652 4706 grpc_server.cc:2558] "Started GRPCInferenceService at 0.0.0.0:8001"
I1225 10:48:27.723879 4706 http_server.cc:4729] "Started HTTPService at 0.0.0.0:8000"
I1225 10:48:27.764810 4706 http_server.cc:362] "Started Metrics Service at 0.0.0.0:8002"

GPUs w/o MIG

sudo docker run -it --rm --network=host --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -e NVIDIA_VISIBLE_DEVICES=0 nvcr.io/nvidia/tritonserver
tritonserver --model-repository {my_model_path}

outputs:

I1225 10:41:12.658976 138 pinned_memory_manager.cc:277] "Pinned memory pool is created at '0x7f0058000000' with size 268435456"
I1225 10:41:12.661708 138 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I1225 10:41:12.667006 138 model_lifecycle.cc:473] "loading: onnx:1"
I1225 10:41:12.671093 138 onnxruntime.cc:2875] "TRITONBACKEND_Initialize: onnxruntime"
I1225 10:41:12.671117 138 onnxruntime.cc:2885] "Triton TRITONBACKEND API version: 1.19"
I1225 10:41:12.671123 138 onnxruntime.cc:2891] "'onnxruntime' TRITONBACKEND API version: 1.19"
I1225 10:41:12.671127 138 onnxruntime.cc:2921] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"true\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}"
I1225 10:41:12.688318 138 onnxruntime.cc:2986] "TRITONBACKEND_ModelInitialize: onnx (version 1)"
I1225 10:41:12.688871 138 onnxruntime.cc:984] "skipping model configuration auto-complete for 'onnx': inputs and outputs already specified"
I1225 10:41:12.689461 138 onnxruntime.cc:3051] "TRITONBACKEND_ModelInstanceInitialize: onnx_0_0 (GPU device 0)"
I1225 10:41:14.331226 138 model_lifecycle.cc:849] "successfully loaded 'onnx'"
I1225 10:41:14.331320 138 server.cc:604]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I1225 10:41:14.331363 138 server.cc:631]
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
| Backend     | Path                                                            | Config                                                                                                                 |
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
| onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6 |
|             |                                                                 | .000000","default-max-batch-size":"4"}}                                                                                |
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+

I1225 10:41:14.331410 138 server.cc:674]
+-------+---------+--------+
| Model | Version | Status |
+-------+---------+--------+
| onnx  | 1       | READY  |
+-------+---------+--------+

I1225 10:41:14.357465 138 metrics.cc:890] "Collecting metrics for GPU 0: NVIDIA A30"
I1225 10:41:14.365078 138 metrics.cc:783] "Collecting CPU metrics"
I1225 10:41:14.365165 138 tritonserver.cc:2598]
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                               |
| server_version                   | 2.52.0                                                                                                                                                               |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tens |
|                                  | or_data parameters statistics trace logging                                                                                                                          |
| model_repository_path[0]         | {my_model_path}                                                                                                 |
| model_control_mode               | MODE_NONE                                                                                                                                                            |
| strict_model_config              | 0                                                                                                                                                                    |
| model_config_name                |                                                                                                                                                                      |
| rate_limit                       | OFF                                                                                                                                                                  |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                            |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                             |
| min_supported_compute_capability | 6.0                                                                                                                                                                  |
| strict_readiness                 | 1                                                                                                                                                                    |
| exit_timeout                     | 30                                                                                                                                                                   |
| cache_enabled                    | 0                                                                                                                                                                    |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I1225 10:41:14.369295 138 grpc_server.cc:2558] "Started GRPCInferenceService at 0.0.0.0:8001"
I1225 10:41:14.369542 138 http_server.cc:4729] "Started HTTPService at 0.0.0.0:8000"
I1225 10:41:14.410425 138 http_server.cc:362] "Started Metrics Service at 0.0.0.0:8002"

Expected behavior
No DCGM error when w/ MIG

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant