Skip to content

Latest commit

 

History

History
119 lines (98 loc) · 12.9 KB

readme.md

File metadata and controls

119 lines (98 loc) · 12.9 KB

LLM Papers We Recommend to Read

The past several years has marked the steady rise of large language models (LLMs), largely driven by advancements in computational power, data availability, and algorithmic innovation. LLMs have profoundly shaped the research landscape, introducing new methodologies and paradigms that challenge traditional approaches.

We have also expanded our research interests to the field of LLMs. Here are some research papers related to LLMs. We highly recommend beginners to read and thoroughly understand these papers.

😄 We welcome and value any contributions.

Basic Architectures of LLMs

Title Link
Sequence to Sequence Learning with Neural Networks [paper]
Transformer: Attention Is All You Need [paper]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [paper]
GPT: Improving Language Understanding by Generative Pre-Training [paper]
GPT2: Language Models are Unsupervised Multitask Learners [paper]
GPT3: Language Models are Few-Shot Learners [paper]
GPT3.5: Fine-Tuning Language Models from Human Preferences [paper]
LLaMA: Open and Efficient Foundation Language Models [paper]
Llama 2: Open Foundation and Fine-Tuned Chat Models [paper]

Multimodal Large Language Models

Title Link
Efficient Multimodal Large Language Models: A Survey [paper]
CLIP: Learning Transferable Visual Models From Natural Language Supervision [paper]

Parallelism Training System

Title Link
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism [paper]
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models [paper]
ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning [paper]
ZeRO-Offload: Democratizing Billion-Scale Model Training [paper]
PipeDream: generalized pipeline parallelism for DNN training [paper]
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism [paper]
TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models [paper]
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding [paper]
PanGu-$\Sigma$: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing [paper]
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale [paper]
Accelerating Distributed MoE Training and Inference with Lina [paper]
Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism [paper]
Alpa: Automating Inter- and {Intra-Operator} Parallelism for Distributed Deep Learning [paper]

LLM Serving System

I think the work on LLM serving can be categorized into the following fields: systematic optimization (e.g., vLLM), scheduling optimization (e.g., DistServe, Llumnix), offloading (e.g., FlexGen), prefix-sharing, KV cache compression/eviction/selection, and speculative decoding.

I will conduct actual categorization in the future.

Title Link
Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems [paper]
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness [paper]
Efficiently Scaling Transformer Inference [paper]
vLLM: Efficient Memory Management for Large Language Model Serving with PagedAttention [paper]
DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale [paper]
Orca: A Distributed Serving System for Transformer-Based Generative Models [paper]
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU [paper]
S$^{3}$: Increasing GPU Utilization during Generative Inference for Higher Throughput [paper]
Splitwise: Efficient generative LLM inference using phase splitting [paper]
SpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification [paper]
Petals: Collaborative Inference and Fine-tuning of Large Models [paper]
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU [paper]
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving [paper]
LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism [paper]
Vidur: A Large-Scale Simulation Framework For LLM Inference [paper]
Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers [paper]
AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving [paper]
SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills [paper]
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve [paper]
Llumnix: Dynamic Scheduling for Large Language Model Serving [paper]
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving [paper]
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management [paper]
ServerlessLLM: Low-Latency Serverless Inference for Large Language Models [paper]
Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for LLMs [paper]
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference [paper]
EcoServe: Maximizing Multi-Resource Utilization with SLO Guarantees in LLM Serving [paper]

Serving LLMs with Multiple LoRAs

Title Link
PetS: A Unified Framework for Parameter-Efficient Transformers Serving [paper]
Punica: Multi-Tenant LoRA Serving [paper]
S-LoRA: Serving Thousands of Concurrent LoRA Adapters [paper]
dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving [paper]

Parameter-Efficient Fine-Tuning (PEFT)

Title Link
Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models [paper]
Parameter-Efficient Transfer Learning for NLP [paper]
Prefix-Tuning: Optimizing Continuous Prompts for Generation [paper]
LoRA: Low-Rank Adaptation of Large Language Models [paper]
Towards a Unified View of Parameter-Efficient Transfer Learning [paper]
Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning [paper]
When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method [paper]
Unleashing the Power of Task-Specific Directions in Parameter Efficient Fine-tuning [paper]
DoRA: Weight-Decomposed Low-Rank Adaptation [paper]
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection [paper]

Compression (Quantization, Sparsity)

Title Link
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale [paper]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers [paper]
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration [paper]
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving [paper]
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time [paper]
Atom: Low-bit Quantization for Efficient and Accurate LLM Serving [paper]
QLoRA: Efficient Finetuning of Quantized LLMs [paper]