Intelligent Machine Learning (IML) targets to set up a full-stack, high-performant, and intelligent infrastructure of deep learning for both offline and online, including data processing, model training, model evaluation, and model inferencing, and makes DL real engineering-free and democratic for AI-driven biz.
- [2025/01] EDiT: A Local-SGD-Based Efficient Distributed Training Method for Large Language Models, ICLR'24.
- [2024/06] DLRover-RM has been accepted by VLDB'24.
- [2024/04] Flash Checkpoint Supports HuggingFace transformers.Trainer to Asynchronously persist checkpoints.
- [2024/02] Flash Checkpoint Saves the Megatron-LM Checkpoint in Seconds.
- [2024/01] Flash Checkpoint to Recover Large Model Training From Failure in Seconds.
- [2023/11] ATorch supporting efficient and easy-to-use model training is released.
- [2023/10] AGD: an Auto-switchable Optimizer using Stepwise Gradient Difference as Preconditioning Matrix, NeurIPS'24.
- [2023/09] Weighted Sharpness-Aware Minimization (WSAM) has been accepted by KDD'23.
- [2023/08] DLRover improves the stability of pre-trained model training over thousands of GPUs.
- [2023/04] DLRover auto-scales nodes of a DeepRec distributed training job.