Releases: NVIDIA-Merlin/HugeCTR
Merlin: HugeCTR 23.02
What's New in Version 23.02
-
HPS Enhancements:
- Enabled the HPS Tensorflow plugin.
- Enabled the max_norm clipping for the HPS Tensorflow plugin.
- Optimized the performance of HPS HashMap fetch.
- Enabled the HPS Profiler.
-
Google Cloud Storage (GCS) Support:
- Added the support of Google Cloud Storage(GCS) for both training and inference. For more details, check out the GCS section in the training with remote filesystem notebook.
-
Issues Fixed:
- Fixed a bug in HPS static table, which leads to a wrong results when the batch size is larger than 256.
- Fixed a preprocessing issue in the
wdl_prediction
notebook. - Corrected how devices are set and managed in HPS and InferenceModel.
- Fixed the debug build error.
- Fixed the build error related with the CUDA 12.0.
- Fixed reported issues with respect to Multi-Process HashMap in notebook and a couple of minor issues on the side.
-
Known Issues:
-
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources.
If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:-shm-size=1g -ulimit memlock=-1
See also the NCCL known issue and the GitHub issue.
-
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive.
To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR. -
The number of data files in the file list should be greater than or equal to the number of data reader workers.
Otherwise, different workers are mapped to the same file and data loading does not progress as expected. -
Joint loss training with a regularizer is not supported.
-
Dumping Adam optimizer states to AWS S3 is not supported.
-
Merlin: HugeCTR V4.3.1 (Merlin 22.12.1)
What's New in Version 4.3
In January 2023, the HugeCTR team plans to deprecate semantic versioning, such as `v4.3`.
Afterward, the library will use calendar versioning only, such as `v23.01`.
-
Support for BERT and Variants:
This release includes support for BERT in HugeCTR.
The documentation includes updates to the MultiHeadAttention layer and adds documentation for the SequenceMask layer.
For more information, refer to the samples/bst directory of the repository in GitHub. -
HPS Plugin for TensorFlow integration with TensorFlow-TensorRT (TF-TRT):
This release includes plugin support for integration with TensorFlow-TensorRT.
For sample code, refer to the Deploy SavedModel using HPS with Triton TensorFlow Backend notebook. -
Deep & Cross Network Layer version 2 Support:
This release includes support for Deep & Cross Network version 2.
For conceptual information, refer to https://arxiv.org/abs/2008.13535.
The documentation for the MultiCross Layer is updated. -
Enhancements to Hierarchical Parameter Server:
- RedisClusterBackend now supports TLS/SSL communication.
For sample code, refer to the Hierarchical Parameter Server Demo notebook.
The notebook is updated with step-by-step instructions to show you how to setup HPS to use Redis with (and without) encryption.
The Volatile Database Parameters documentation for HPS is updated with theenable_tls
,tls_ca_certificate
,tls_client_certificate
,tls_client_key
, andtls_server_name_identification
parameters. - MultiProcessHashMapBackend includes a bug fix that prevented configuring the shared memory size when using JSON file-based configuration.
- On-device input keys are supported now so that an extra host-to-device copy is removed to improve performance.
- A dependency on the XX-Hash library is removed.
The library is no longer used by HugeCTR. - Added the static table support to the embedding cache.
The static table is suitable when the embedding table can be placed entirely in GPU memory.
In this case, the static table is more than three times faster than the embedding cache lookup.
The static table does not support embedding updates.
- RedisClusterBackend now supports TLS/SSL communication.
-
Support for New Optimizers:
- Added support for SGD, Momentum SGD, Nesterov Momentum, AdaGrad, RMS-Prop, Adam and FTRL optimizers for dynamic embedding table (DET).
For sample code, refer to thetest_embedding_table_optimizer.cpp
file in the test/utest/embedding_collection/ directory of the repository on GitHub. - Added support for the FTRL optimizer for dense networks.
- Added support for SGD, Momentum SGD, Nesterov Momentum, AdaGrad, RMS-Prop, Adam and FTRL optimizers for dynamic embedding table (DET).
-
Data Reading from S3 for Offline Inference:
In addition to reading during training, HugeCTR now supports reading data from remote file systems such as HDFS and S3 during offline inference by using the DataSourceParams API.
The HugeCTR Training and Inference with Remote File System Example is updated to demonstrate the new functionality. -
Documentation Enhancements:
- The set up instructions for running the example notebooks are revised for clarity.
- The example notebooks are also updated to show using a data preprocessing script that simplifies the user experience.
- Documentation for the MLP Layer is new.
- Several 2022 talks and blogs are added to the HugeCTR Talks and Blogs page.
-
Issues Fixed:
- The original CUDA device with NUMA bind before a call to some HugeCTR APIs is recovered correctly now.
This issue sometimes lead to a problem when you mixed calls to HugeCTR and other CUDA enabled libraries. - Fixed the occasional CUDA kernel launch failure of embedding when installed HugeCTR with macro DEBUG.
- Fixed an SOK build error that was related to TensorFlow v2.1.0 and higher.
The issue was that the C++ API and C++ standard were updated to use C++17. - Fixed a CUDA 12 related compilation error.
- The original CUDA device with NUMA bind before a call to some HugeCTR APIs is recovered correctly now.
-
Known Issues:
-
HugeCTR can lead to a runtime error if client code calls the RMM
rmm::mr::set_current_device_resource()
method orrmm::mr::set_current_device_resource()
method.
The error is due to the Parquet data reader in HugeCTR also callingrmm::mr::set_current_device_resource()
.
As a result, the device becomes visible to other libraries in the same process.
Refer to GitHub issue #356 for more information.
As a workaround, you can set environment variableHCTR_RMM_SETTABLE
to0
to prevent HugeCTR from setting a custom RMM device resource, if you know thatrmm::mr::set_current_device_resource()
is called by client code other than HugeCTR.
But be cautious because the setting can reduce the performance of Parquet reading. -
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources.
If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:-shm-size=1g -ulimit memlock=-1
See also the NCCL known issue and the GitHub issue #243.
-
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive.
To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR. -
The number of data files in the file list should be greater than or equal to the number of data reader workers.
Otherwise, different workers are mapped to the same file and data loading does not progress as expected. -
Joint loss training with a regularizer is not supported.
-
Dumping Adam optimizer states to AWS S3 is not supported.
-
Merlin: HugeCTR V4.3 (Merlin 22.12)
What's New in Version 4.3
In January 2023, the HugeCTR team plans to deprecate semantic versioning, such as `v4.3`.
Afterward, the library will use calendar versioning only, such as `v23.01`.
-
Support for BERT and Variants:
This release includes support for BERT in HugeCTR.
The documentation includes updates to the MultiHeadAttention layer and adds documentation for the SequenceMask layer.
For more information, refer to the samples/bst directory of the repository in GitHub. -
HPS Plugin for TensorFlow integration with TensorFlow-TensorRT (TF-TRT):
This release includes plugin support for integration with TensorFlow-TensorRT.
For sample code, refer to the Deploy SavedModel using HPS with Triton TensorFlow Backend notebook. -
Deep & Cross Network Layer version 2 Support:
This release includes support for Deep & Cross Network version 2.
For conceptual information, refer to https://arxiv.org/abs/2008.13535.
The documentation for the MultiCross Layer is updated. -
Enhancements to Hierarchical Parameter Server:
- RedisClusterBackend now supports TLS/SSL communication.
For sample code, refer to the Hierarchical Parameter Server Demo notebook.
The notebook is updated with step-by-step instructions to show you how to setup HPS to use Redis with (and without) encryption.
The Volatile Database Parameters documentation for HPS is updated with theenable_tls
,tls_ca_certificate
,tls_client_certificate
,tls_client_key
, andtls_server_name_identification
parameters. - MultiProcessHashMapBackend includes a bug fix that prevented configuring the shared memory size when using JSON file-based configuration.
- On-device input keys are supported now so that an extra host-to-device copy is removed to improve performance.
- A dependency on the XX-Hash library is removed.
The library is no longer used by HugeCTR. - Added the static table support to the embedding cache.
The static table is suitable when the embedding table can be placed entirely in GPU memory.
In this case, the static table is more than three times faster than the embedding cache lookup.
The static table does not support embedding updates.
- RedisClusterBackend now supports TLS/SSL communication.
-
Support for New Optimizers:
- Added support for SGD, Momentum SGD, Nesterov Momentum, AdaGrad, RMS-Prop, Adam and FTRL optimizers for dynamic embedding table (DET).
For sample code, refer to thetest_embedding_table_optimizer.cpp
file in the test/utest/embedding_collection/ directory of the repository on GitHub. - Added support for the FTRL optimizer for dense networks.
- Added support for SGD, Momentum SGD, Nesterov Momentum, AdaGrad, RMS-Prop, Adam and FTRL optimizers for dynamic embedding table (DET).
-
Data Reading from S3 for Offline Inference:
In addition to reading during training, HugeCTR now supports reading data from remote file systems such as HDFS and S3 during offline inference by using the DataSourceParams API.
The HugeCTR Training and Inference with Remote File System Example is updated to demonstrate the new functionality. -
Documentation Enhancements:
- The set up instructions for running the example notebooks are revised for clarity.
- The example notebooks are also updated to show using a data preprocessing script that simplifies the user experience.
- Documentation for the MLP Layer is new.
- Several 2022 talks and blogs are added to the HugeCTR Talks and Blogs page.
-
Issues Fixed:
- The original CUDA device with NUMA bind before a call to some HugeCTR APIs is recovered correctly now.
This issue sometimes lead to a problem when you mixed calls to HugeCTR and other CUDA enabled libraries. - Fixed the occasional CUDA kernel launch failure of embedding when installed HugeCTR with macro DEBUG.
- Fixed an SOK build error that was related to TensorFlow v2.1.0 and higher.
The issue was that the C++ API and C++ standard were updated to use C++17. - Fixed a CUDA 12 related compilation error.
- The original CUDA device with NUMA bind before a call to some HugeCTR APIs is recovered correctly now.
-
Known Issues:
-
HugeCTR can lead to a runtime error if client code calls the RMM
rmm::mr::set_current_device_resource()
method orrmm::mr::set_current_device_resource()
method.
The error is due to the Parquet data reader in HugeCTR also callingrmm::mr::set_current_device_resource()
.
As a result, the device becomes visible to other libraries in the same process.
Refer to GitHub issue #356 for more information.
As a workaround, you can set environment variableHCTR_RMM_SETTABLE
to0
to prevent HugeCTR from setting a custom RMM device resource, if you know thatrmm::mr::set_current_device_resource()
is called by client code other than HugeCTR.
But be cautious because the setting can reduce the performance of Parquet reading. -
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources.
If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:-shm-size=1g -ulimit memlock=-1
See also the NCCL known issue and the GitHub issue #243.
-
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive.
To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR. -
The number of data files in the file list should be greater than or equal to the number of data reader workers.
Otherwise, different workers are mapped to the same file and data loading does not progress as expected. -
Joint loss training with a regularizer is not supported.
-
Dumping Adam optimizer states to AWS S3 is not supported.
-
Merlin: HugeCTR V4.2 (Merlin 22.11)
What's New in Version 4.2
In January 2023, the HugeCTR team plans to deprecate semantic versioning, such as `v4.2`.
Afterward, the library will use calendar versioning only, such as `v23.01`.
-
Change to HPS with Redis or Kafka:
This release includes a change to Hierarchical Parameter Server and affects deployments that useRedisClusterBackend
or model parameter streaming with Kafka.
A third-party library that was used for HPS partition selection algorithm is replaced to improve performance.
The new algorithm can produce different partition assignments for volatile databases.
As a result, volatile database backends that retain data between application startup, such as theRedisClusterBackend
, must be reinitialized.
Model streaming with Kafka is equally affected.
To avoid issues with updates, reset all respective queue offsets to theend_offset
before you reinitialize theRedisClusterBackend
. -
Enhancements to the Sparse Operation Kit in DeepRec:
This release includes updates to the Sparse Operation Kit to improve the performance of the embedding variable lookup operation in DeepRec.
The API for thelookup_sparse()
function is changed to remove thehotness
argument.
Thelookup_sparse()
function is enhanced to calculate the number of non-zero elements dynamically.
For more information, refer to the sparse_operation_kit directory of the DeepRec repository in GitHub. -
Enhancements to 3G Embedding:
This release includes the following enhancements to 3G embedding:- The API is changed.
TheEmbeddingPlanner
class is replaced with theEmbeddingCollectionConfig
class.
For examples of the API, see the tests in the test/embedding_collection_test directory of the repository in GitHub. - The API is enhanced to support dumping and loading weights during the training process.
The methods areModel.embedding_dump(path: str, table_names: list[str])
andModel.embedding_load(path: str, list[str])
.
Thepath
argument is a directory in file system that you can dump weights to or load weights from.
Thetable_names
argument is a list of embedding table names as strings.
- The API is changed.
-
New Volatile Database Type for HPS:
This release adds adb_type
value ofmulti_process_hash_map
to the Hierarchical Parameter Server.
This database type supports sharing embeddings across process boundaries by using shared memory and the/dev/shm
device file.
Multiple processes running HPS can read and write to the same hash map.
For an example, refer to the Hierarchcal Parameter Server Demo notebook. -
Enhancements to the HPS Redis Backend:
In this release, the Hierarchical Parameter Server can open multiple connections in parallel to each Redis node.
This enhancement enables HPS to take advantage of overlapped processing optimizations in the I/O module of Redis servers.
In addition, HPS can now take advantage of Redis hash tags to co-locate embedding values and metadata.
This enhancement can reduce the number of accesses to Redis nodes and the number of per-node round trip communications that are needed to complete transactions.
As a result, the enhancement increases the insertion performance. -
MLPLayer is New:
This release adds an MLP layer with thehugectr.Layer_t.MLP
class.
This layer is very flexible and makes it easier to use a group of fused fully-connected layers and enable the related optimizations.
For each fused fully-connected layer inMLPLayer
, the output dimension, bias, and activation function are all adjustable.
MLPLayer supports FP32, FP16 and TF32 data types.
For an example, refer to the dgx_a100_mlp.py in thesamples/dlrm
directory of the GitHub repository to learn how to use the layer. -
Sparse Operation Kit installable from PyPi:
Version1.1.4
of the Sparse Operation Kit is installable from PyPi in the merlin-sok package. -
Multi-task Model Support added to the ONNX Model Converter:
This release adds support for multi-task models to the ONNX converter.
This release also includes an enhancement to the preprocess_census.py script insamples/mmoe
directory of the GitHub repository. -
Issues Fixed:
- Using the HPS Plugin for TensorFlow with
MirroredStrategy
and running the Hierarchical Parameter Server Demo notebook triggered an issue with ReplicaContext and caused a crash.
The issue is fixed and resolves GitHub issue #362. - The 4_nvt_process.py sample in the
samples/din/utils
directory of the GitHub repository is updated to use the latest NVTabular API.
This update resolves GitHub issue #364. - An illegal memory access related to 3G embedding and the dgx_a100_ib_nvlink.py sample in the
samples/dlrm
directory of the GitHub repository is fixed. - An error in HPS with the
lookup_fromdlpack()
method is fixed.
The error was related to calculating the number of keys and vectors from the corresponding DLPack tensors. - An error in the HugeCTR backend for Triton Inference Server is fixed.
A crash was triggered when the initial size of the embedding cache is smaller than the allowed minimum size. - An error related to using a ReLU layer with an odd input size in mixed precision mode could trigger a crash.
The issue is fixed. - An error related to using an asynchronous reader with the AsyncParam class and specifying an
io_alignment
value that is smaller than the block device sector size is fixed.
Now, if the specifiedio_alignment
value is smaller than the block device sector size,io_alignment
is automatically set to the block device sector size. - Unreported memory leaks in the GRU layer and collectives are fixed.
- Several broken documentation links related to HPS are fixed.
- Using the HPS Plugin for TensorFlow with
-
Known Issues:
-
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources.
If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:-shm-size=1g -ulimit memlock=-1
See also the NCCL known issue and the GitHub issue.
-
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive.
To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR. -
The number of data files in the file list should be greater than or equal to the number of data reader workers.
Otherwise, different workers are mapped to the same file and data loading does not progress as expected. -
Joint loss training with a regularizer is not supported.
-
Dumping Adam optimizer states to AWS S3 is not supported.
-
Merlin: HugeCTR V4.1.1 (Merlin 22.10)
What's New in Version 4.1.1
-
Simplified Interface for 3G Embedding Table Placement Strategy:
3G embedding now provides an easier way for you to configure an embedding table placement strategy.
Instead of using JSON, you can configure the embedding table placement strategy by using function arguments.
You only need to provide theshard_matrix
,table_group_strategy
, andtable_placement_strategy
arguments.
With these arguments, 3G embedding can group different tables together and place them according to theshard_matrix
argument.
For an example, refer to dlrm_train.py file in thetest/embedding_collection_test
directory of the repository on GitHub.
For comparison, refer to the same file from the v4.0 branch of the repository. -
New MMoE and Shared-Bottom Samples:
This release includes a new shared-bottom model, an example program, preprocessing scripts, and updates to documentation.
For more information, refer to theREADME.md
,mmoe_parquet.py
, and other files in thesamples/mmoe
directory of the repository on GitHub.
This release also includes a fix to the calculation and reporting of AUC for multi-task models, such as MMoE. -
Support for AWS S3 File System:
The Parquet DataReader can now read datasets from the Amazon Web Services S3 file system.
You can also load and dump models from and to S3 during training.
The documentation for theDataSourceParams
class is updated.
To view sample code, refer to the HugeCTR Training with Remote File System Example class is updated. -
Simplication for File System Usage:
You no longer ’t need to passDataSourceParams
for model loading and dumping.
TheFileSystem
class automatically infers the correct file system type, local, HDFS, or S3, based on the path URI that you specified when you built the model.
For example, the pathhdfs://localhost:9000/
is inferred as an HDFS file system and the pathhttps://mybucket.s3.us-east-1.amazonaws.com/
is inferred as an S3 file system. -
Support for Loading Models from Remote File Systems to HPS:
This release enables you to load models from HDFS and S3 remote file systems to HPS during inference.
To use the new feature, specify an HDFS for S3 path URI inInferenceParams
. -
Support for Exporting Intermediate Tensor Values into a Numpy Array:
This release adds functioncheck_out_tensor
toModel
andInferenceModel
.
You can use this function to check out the intermediate tensor values using the Python interface.
This function is especially helpful for debugging.
For more information, refer toModel.check_out_tensor
andInferenceModel.check_out_tensor
. -
On-Device Input Keys for HPS Lookup:
The HPS lookup supports input embedding keys that are on GPU memory during inference.
This enhancement removes a host-to-device copy by using the DLPacklookup_fromdlpack()
interface.
By using the interface, the input DLPack capsule of embedding key can be a GPU tensor. -
Documentation Enhancements:
- The graphic for the Hierarchical Parameter Server library that shows relationship to other software packages is enhanced.
- The sample notebook for Deploy SavedModel using HPS with Triton TensorFlow Backend is added to the documentation.
- Style updates to the Hierarchical Parameter Server API documentation.
-
Issues Fixed:
- The
InteractionLayer
class is fixed so that it works correctly withnum_feas > 30
. - The cuBLASLt configuration is corrected by increasing the workspace size and adding the epilogue mask.
- The NVTabular based preprocessing script for our samples that demonstrate feature crossing is fixed.
- The async data reader is fixed. Previously, it would hang and cause a corruption issue due to an improper I/O block size and I/O alignment problem.
TheAsyncParam
class is changed to implement the fix.
Theio_block_size
argument is replaced by themax_nr_request
argument and the actual I/O block size that the async reader uses is computed accordingly.
For more information, refer to theAsyncParam
class documentation.
- The
-
Known Issues:
-
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources.
If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:-shm-size=1g -ulimit memlock=-1
See also the NCCL known issue and the GitHub issue.
-
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive.
To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR. -
The number of data files in the file list should be greater than or equal to the number of data reader workers.
Otherwise, different workers are mapped to the same file and data loading does not progress as expected. -
Joint loss training with a regularizer is not supported.
-
Dumping Adam optimizer states to AWS S3 is not supported.
-
Merlin: HugeCTR V4.1 (Merlin 22.10)
What's New in Version 4.1
-
Simplified Interface for 3G Embedding Table Placement Strategy:
3G embedding now provides an easier way for you to configure an embedding table placement strategy.
Instead of using JSON, you can configure the embedding table placement strategy by using function arguments.
You only need to provide theshard_matrix
,table_group_strategy
, andtable_placement_strategy
arguments.
With these arguments, 3G embedding can group different tables together and place them according to theshard_matrix
argument.
For an example, refer to dlrm_train.py file in thetest/embedding_collection_test
directory of the repository on GitHub.
For comparison, refer to the same file from the v4.0 branch of the repository. -
New MMoE and Shared-Bottom Samples:
This release includes a new shared-bottom model, an example program, preprocessing scripts, and updates to documentation.
For more information, refer to theREADME.md
,mmoe_parquet.py
, and other files in thesamples/mmoe
directory of the repository on GitHub.
This release also includes a fix to the calculation and reporting of AUC for multi-task models, such as MMoE. -
Support for AWS S3 File System:
The Parquet DataReader can now read datasets from the Amazon Web Services S3 file system.
You can also load and dump models from and to S3 during training.
The documentation for theDataSourceParams
class is updated.
To view sample code, refer to the HugeCTR Training with Remote File System Example class is updated. -
Simplication for File System Usage:
You no longer ’t need to passDataSourceParams
for model loading and dumping.
TheFileSystem
class automatically infers the correct file system type, local, HDFS, or S3, based on the path URI that you specified when you built the model.
For example, the pathhdfs://localhost:9000/
is inferred as an HDFS file system and the pathhttps://mybucket.s3.us-east-1.amazonaws.com/
is inferred as an S3 file system. -
Support for Loading Models from Remote File Systems to HPS:
This release enables you to load models from HDFS and S3 remote file systems to HPS during inference.
To use the new feature, specify an HDFS for S3 path URI inInferenceParams
. -
Support for Exporting Intermediate Tensor Values into a Numpy Array:
This release adds functioncheck_out_tensor
toModel
andInferenceModel
.
You can use this function to check out the intermediate tensor values using the Python interface.
This function is especially helpful for debugging.
For more information, refer toModel.check_out_tensor
andInferenceModel.check_out_tensor
. -
On-Device Input Keys for HPS Lookup:
The HPS lookup supports input embedding keys that are on GPU memory during inference.
This enhancement removes a host-to-device copy by using the DLPacklookup_fromdlpack()
interface.
By using the interface, the input DLPack capsule of embedding key can be a GPU tensor. -
Documentation Enhancements:
- The graphic for the Hierarchical Parameter Server library that shows relationship to other software packages is enhanced.
- The sample notebook for Deploy SavedModel using HPS with Triton TensorFlow Backend is added to the documentation.
- Style updates to the Hierarchical Parameter Server API documentation.
-
Issues Fixed:
- The
InteractionLayer
class is fixed so that it works correctly withnum_feas > 30
. - The cuBLASLt configuration is corrected by increasing the workspace size and adding the epilogue mask.
- The NVTabular based preprocessing script for our samples that demonstrate feature crossing is fixed.
- The async data reader is fixed. Previously, it would hang and cause a corruption issue due to an improper I/O block size and I/O alignment problem.
TheAsyncParam
class is changed to implement the fix.
Theio_block_size
argument is replaced by themax_nr_request
argument and the actual I/O block size that the async reader uses is computed accordingly.
For more information, refer to theAsyncParam
class documentation.
- The
-
Known Issues:
-
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources.
If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:-shm-size=1g -ulimit memlock=-1
See also the NCCL known issue and the GitHub issue.
-
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive.
To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR. -
The number of data files in the file list should be greater than or equal to the number of data reader workers.
Otherwise, different workers are mapped to the same file and data loading does not progress as expected. -
Joint loss training with a regularizer is not supported.
-
Dumping Adam optimizer states to AWS S3 is not supported.
-
Merlin: HugeCTR V4.0 (Merlin 22.09)
What's New in Version 4.0
-
3G Embedding Stablization:
Since the introduction of the next generation of HugeCTR embedding in v3.7, several updates and enhancements were made, including code refactoring to improve usability.
The enhancements for this release are as follows:- Optimized the performance for sparse lookup in terms of inter-warp load imbalance.
Sparse Operation Kit (SOK) takes advantage of the enhancement to improve performance. - This release includes a fix for determining the maximum embedding vector size in the
GlobalEmbeddingData
andLocalEmbeddingData
classes. - Version 1.1.4 of Sparse Operation Kit can be installed with Pip and includes the enhancements mentioned in the preceding bullets.
- Optimized the performance for sparse lookup in terms of inter-warp load imbalance.
-
Embedding Cache Initialization with Configurable Ratio:
In previous releases, the default value for thecache_refresh_percentage_per_iteration
parameter of the InferenceParams was0.1
.In this release, default value is
0.0
and the parameter provides an additional purpose.
If you set the parameter to a value greater than0.0
and also setuse_gpu_embedding_cache
toTrue
for a model, when Hierarchical Parameter Server (HPS) starts, HPS initializes the embedding cache for the model on the GPU by loading a subset of the embedding vectors from the sparse files for the model.
When embedding cache initialization is used, HPS creates log records when it starts at the INFO level.
The logging records are similar toEC initialization for model: "<model-name>", num_tables: <int>
andEC initialization on device: <int>
.
This enhancement reduces the duration of the warm up phase. -
Lazy Initialization of HPS Plugin for TensorFlow:
In this release, when you deploy aSavedModel
of TensorFlow with Triton Inference Server, HPS is implicitly initialized when the loaded model is executed for the first time.
In previous releases, you needed to runhps.Init(ps_config_file, global_batch_size)
explicitly.
For more information, see the API documentation forhierarchical_parameter_server.Init
. -
Enhancements to the HDFS Backend:
- The HDFS Backend is now called IO::HadoopFileSystem.
- This release includes fixes for memory leaks.
- This release includes refactoring to generalize the interface for HDFS and S3 as remote filesystems.
- For more information, see
hadoop_filesystem.hpp
in theinclude/io
directory of the repository on GitHub.
-
Dependency Clarification for Protobuf and Hadoop:
Hadoop and Protobuf are truethird_party
modules now.
Developers can now avoid unnecessary and frequent cloning and deletion. -
Finer granularity control for overlap behavior:
We deperacated the oldoverlapped_pipeline
knob and introduces four new knobstrain_intra_iteration_overlap
/train_inter_iteration_overlap
/eval_intra_iteration_overlap
/eval_inter_iteration_overlap
to help user better control the overlap behavior. For more information, see the API documentation forSolver.CreateSolver
-
Documentation Improvements:
- Removed two deprecated tutorials
triton_tf_deploy
anddump_to_tf
. - Previously, the graphics in the Performance page did not appear.
This issue is fixed in this release. - Previously, the API documentation for the HPS Plugin for TensorFlow did not show the class information. This issue is fixed in this release.
- Removed two deprecated tutorials
-
Issues Fixed:
- Fixed a build error that was triggered in debug mode.
The error was caused by the newly introduced 3G embedding unit tests. - When using the Parquet DataReader, if a parquet dataset file specified in
metadata.json
does not exist, HugeCTR no longer crashes.
The new behavior is to skip the missing file and display a warning message.
This change relates to GitHub issue 321.
- Fixed a build error that was triggered in debug mode.
-
Known Issues:
-
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources.
If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:-shm-size=1g -ulimit memlock=-1
See also the NCCL known issue and the GitHub issue.
-
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive.
To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR. -
The number of data files in the file list should be greater than or equal to the number of data reader workers.
Otherwise, different workers are mapped to the same file and data loading does not progress as expected. -
Joint loss training with a regularizer is not supported.
-
Merlin: HugeCTR V3.9.1 (Merlin 22.08)
- fix compatibility issue of cudf 22.06
- some document refactors
Merlin: HugeCTR V3.9 (Merlin 22.08)
What's New in Version 3.9
-
Updates to 3G Embedding:
- Sparse Operation Kit (SOK) is updated to use the HugeCTR 3G embedding as a developer preview feature.
For more information, refer to the Python programs in the sparse_operation_kit/experiment/benchmark/dlrm directory of the repository on GitHub. - Dynamic embedding table mode is added.
The mode is based on the cuCollection with some functionality enhancement.
A dynamic embedding table grows its size when the table is full so that you no longer need to configure the memory usage information for embedding.
For more information, refer to the embedding_storage/dynamic_embedding_storage directory of the repository on GitHub.
- Sparse Operation Kit (SOK) is updated to use the HugeCTR 3G embedding as a developer preview feature.
-
Enhancements to the HPS Plugin for TensorFlow:
This release includes improvements to the interoperability of SOK and HPS.
The plugin now supports the sparse lookup layer.
The documentation for the HPS plugin is enhanced as follows:- An introduction to the plugin is new.
- New notebooks demonstrate how to use the HPS plugin are added.
- API documentation for the plugin is new.
-
Enhancements to the HPS Backend for Triton Inference Server
This release adds support for integrating the HPS Backend and the TensorFlow Backend through the ensemble mode with Triton Inference Server.
The enhancement enables deploying a TensorFlow model with large embedding tables with Triton by leveraging HPS.
For more information, refer to the sample programs in the hps-triton-ensemble directory of the HugeCTR Backend repository in GitHub. -
New Multi-Node Tutorial:
The multi-node training tutorial is new.
The additions show how to use HugeCTR to train a model with multiple nodes and is based on our most recent Docker container.
The tutorial should be useful to users who do not have a job-scheduler-installed cluster such as Slurm Workload Manager.
The update addresses a issue that was first reported in GitHub issue 305. -
Support Offline Inference for MMoE:
This release includes MMoE offline inference where both per-class AUC and average AUC are provided.
When the number of class AUCs is greater than one, the output includes a line like the following example:[HCTR][08:52:59.254][INFO][RK0][main]: Evaluation, AUC: {0.482141, 0.440781}, macro-averaging AUC: 0.46146124601364136
-
Enhancements to the API for the HPS Database Backend
This release includes several enhancements to the API for theDatabaseBackend
class.
For more information, seedatabase_backend.hpp
and the header files for other database backends in theHugeCTR/include/hps
directory of the repository.
The enhancments are as follows:- You can now specify a maximum time budget, in nanoseconds, for queries so that you can build an application that must operate within strict latency limits.
Fetch queries return execution control to the caller if the time budget is exhausted.
The unprocessed entries are indicated to the caller through a callback function. - The
dump
andload_dump
methods are new.
These methods support saving and loading embedding tables from disk.
The methods support a custom binary format and the RocksDB SST table file format.
These methods enable you to import and export embedding table data between your custom tools and HugeCTR. - The
find_tables
method is new.
The method enables you to discover all table data that is currently stored for a model in aDatabaseBackend
instance.
A new overloaded method forevict
is added that can process the results fromfind_tables
to quickly and simply drop all the stored information that is related to a model.
- You can now specify a maximum time budget, in nanoseconds, for queries so that you can build an application that must operate within strict latency limits.
-
Documentation Enhancements
- The documentation for the
max_all_to_all_bandwidth
parameter of theHybridEmbeddingParam
class is clarified to indicate that the bandwidth unit is per-GPU.
Previously, the unit was not specified.
- The documentation for the
-
Issues Fixed:
- Hybrid embedding with
IB_NVLINK
as thecommunication_type
of the
HybridEmbeddingParam
is fixed in this release. - Training performance is affected by a GPU routine that checks if an input key can be out of the embedding table. If you can guarantee that the input keys can work with the specified
workspace_size_per_gpu_in_mb
, we have a workaround to disable the routine by setting the environment variableHUGECTR_DISABLE_OVERFLOW_CHECK=1
. The workaround restores the training performance. - Engineering discovered and fixed a correctness issue with the Softmax layer.
- Engineering removed an inline profiler that was rarely used or updated. This change relates to GitHub issue 340.
- Hybrid embedding with
-
Known Issues:
-
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources.
If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:-shm-size=1g -ulimit memlock=-1
See also the NCCL known issue and the GitHub issue.
-
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive.
To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR. -
The number of data files in the file list should be greater than or equal to the number of data reader workers.
Otherwise, different workers are mapped to the same file and data loading does not progress as expected. -
Joint loss training with a regularizer is not supported.
-
Merlin: HugeCTR V3.8 (Merlin 22.07)
What's New in Version 3.8
-
Sample Notebook to Demonstrate 3G Embedding:
This release includes a sample notebook that introduces the Python API of the
embedding collection and the key concepts for using 3G embedding.
You can view HugeCTR Embedding Collection
from the documentation or access theembedding_collection.ipynb
file from the
notebooks
directory of the repository. -
DLPack Python API for Hierarchical Parameter Server Lookup:
This release introduces support for embedding lookup from the Hierarchical
Parameter Server (HPS) using the DLPack Python API. The new method is
lookup_fromdlpack()
. For sample usage, see the
Lookup the Embedding Vector from DLPack
heading in the "Hierarchical Parameter Server Demo" notebook. -
Read Parquet Datasets from HDFS with the Python API:
This release enhances theDataReaderParams
class with adata_source_params
argument. You can use the argument to specify
the data source configuration such as the host name of the Hadoop NameNode and the NameNode port number to read from HDFS. -
Logging Performance Improvements:
This release includes a performance enhancement that reduces the performance impact of logging. -
Enhancements to Layer Classes:
- The
FullyConnected
layer now supports 3D inputs - The
MatrixMultiply
layer now supports 4D inputs.
- The
-
Documentation Enhancements:
- An automatically generated table of contents is added to the top of most
pages in the web documentation. The goal is to provide a better experience
for navigating long pages such as the
HugeCTR Layer Classes and Methods
page. - URLs to the Criteo 1TB click logs dataset are updated. For an example, see the
HugeCTR Wide and Deep Model with Criteo
notebook.
- An automatically generated table of contents is added to the top of most
-
Issues Fixed:
- The data generator for the Parquet file type is fixed and produces consistent file names between the
_metadata.json
file and the actual dataset files.
Previously, running the data generator to create synthetic data resulted in a core dump.
This issue was first reported in the GitHub issue 321. - Fixed the memory crash in running a large model on multiple GPUs that occurred during AUC warm up.
- Fixed the issue of keyset generation in the ETC notebook.
Refer to the GitHub issue 332 for more details. - Fixed the inference build error that occurred when building with debug mode.
- Fixed the issue that multi-node training prints duplicate messages.
- The data generator for the Parquet file type is fixed and produces consistent file names between the
-
Known Issues:
-
Hybrid embedding with
IB_NVLINK
as thecommunication_type
of the
HybridEmbeddingParam
class does not work currently. We are working on fixing it. The other communication types have no issues. -
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources.
If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:-shm-size=1g -ulimit memlock=-1
See also the NCCL known issue and the GitHub issue.
-
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive.
To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR. -
The number of data files in the file list should be greater than or equal to the number of data reader workers.
Otherwise, different workers are mapped to the same file and data loading does not progress as expected. -
Joint loss training with a regularizer is not supported.
-