Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add open source community #21

Merged
merged 2 commits into from
Sep 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 37 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,13 @@ VPTQ can compress 70B, even the 405B model, to 1-2 bits without retraining and m
* Lightweight Quantization Algorithm: only cost ~17 hours to quantize 405B Llama-3.1
* Agile Quantization Inference: low decode overhead, best throughput, and TTFT

## Details and [**Tech Report**](https://github.com/microsoft/VPTQ/blob/main/VPTQ_tech_report.pdf)
## [**Tech Report**](https://github.com/microsoft/VPTQ/blob/main/VPTQ_tech_report.pdf)

Scaling model size significantly challenges the deployment and inference of Large Language Models (LLMs). Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits). It reduces memory requirements, optimizes storage costs, and decreases memory bandwidth needs during inference. However, due to numerical representation limitations, traditional scalar-based weight quantization struggles to achieve such extreme low-bit. Recent research on Vector Quantization (VQ) for LLMs has demonstrated the potential for extremely low-bit model quantization by compressing vectors into indices using lookup tables.

Read tech report at [**Tech Report**](https://github.com/microsoft/VPTQ/blob/main/VPTQ_tech_report.pdf)

## Early Results from Tech Report
### Early Results from Tech Report
VPTQ achieves better accuracy and higher throughput with lower quantization overhead across models of different sizes. The following experimental results are for reference only; VPTQ can achieve better outcomes under reasonable parameters, especially in terms of model accuracy and inference speed.

<img src="assets/vptq.png" width="500">
Expand All @@ -28,10 +29,10 @@ VPTQ achieves better accuracy and higher throughput with lower quantization over
| LLaMA-2 70B | 2.07 | 3.93 | 5.72 | 68.6 | 9.7 | 19.54 | 19 |
| | 2.11 | 3.92 | 5.71 | 68.7 | 9.7 | 20.01 | 19 |


## Install and Evaluation
## Installation and Evaluation

### Dependencies

- python 3.10+
- torch >= 2.2.0
- transformers >= 4.44.0
Expand All @@ -45,43 +46,66 @@ VPTQ achieves better accuracy and higher throughput with lower quantization over
export PATH=/usr/local/cuda-12/bin/:$PATH # set dependent on your environment
```

*Will Take several minutes to compile CUDA kernels*
```python
pip install git+https://github.com/microsoft/VPTQ.git --no-build-isolation
```

### Language Generation
### Models from Open Source Community

⚠️ The repository only provides a method of model quantization algorithm.

⚠️ The open-source community [VPTQ-community](https://huggingface.co/VPTQ-community) provides models based on the technical report and quantization algorithm.

⚠️ The repository cannot guarantee the performance of those models.

| Model Series | Collections |
|:----------------------:|:-----------:|
| Llama 3.1 8B Instruct | [HF 🤗](https://huggingface.co/collections/VPTQ-community/vptq-llama-31-8b-instruct-without-finetune-66f2b70b1d002ceedef02d2e) |
| Llama 3.1 70B Instruct | [HF 🤗](https://huggingface.co/collections/VPTQ-community/vptq-llama-31-70b-instruct-without-finetune-66f2bf454d3dd78dfee2ff11) |
| Qwen 2.5 7B Instruct | [HF 🤗](https://huggingface.co/collections/VPTQ-community/vptq-qwen-25-7b-instruct-without-finetune-66f3e9866d3167cc05ce954a) |
| Qwen 2.5 72B Instruct | [HF 🤗](https://huggingface.co/collections/VPTQ-community/vptq-qwen-25-72b-instruct-without-finetune-66f3bf1b3757dfa1ecb481c0) |
| Qwen 2.5 405B Instruct | [HF 🤗](https://huggingface.co/collections/VPTQ-community/vptq-llama-31-405b-instruct-without-finetune-66f4413f9ba55e1a9e52cfb0) |

### Language Generation Example
To generate text using the pre-trained model, you can use the following code snippet:

The model [*VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k32768-0-woft*](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k32768-0-woft) (~1.875 bit) is provided by open source community. The repository cannot guarantee the performance of those models.

```python
python -m vptq --model=LLaMa-2-7b-1.5bi-vptq --prompt="Do Not Go Gentle into That Good Night"
python -m vptq --model=VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k32768-0-woft --prompt="Explain: Do Not Go Gentle into That Good Night"
```

### Terminal Chatbot Example
Launching a chatbot:
Note that you must use a chat model for this to work

```python
python -m vptq --model=LLaMa-2-7b-chat-1.5b-vptq --chat
python -m vptq --model=VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k32768-0-woft --chat
```

### Python API Example
Using the Python API:

```python
import vptq
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained("LLaMa-2-7b-1.5bi-vptq")
m = vptq.AutoModelForCausalLM.from_pretrained("LLaMa-2-7b-1.5bi-vptq", device_map='auto')
tokenizer = transformers.AutoTokenizer.from_pretrained("VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k32768-0-woft")
m = vptq.AutoModelForCausalLM.from_pretrained("VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k32768-0-woft", device_map='auto')

inputs = tokenizer("Do Not Go Gentle into That Good Night", return_tensors="pt").to("cuda")
inputs = tokenizer("Explain: Do Not Go Gentle into That Good Night", return_tensors="pt").to("cuda")
out = m.generate(**inputs, max_new_tokens=100, pad_token_id=2)
print(tokenizer.decode(out[0], skip_special_tokens=True))
```

### Gradio app example
### Gradio Web App Example
A environment variable is available to control share link or not.
`export SHARE_LINK=1`
```
python -m vptq.app
```


## Road Map
- [ ] Merge the quantization algorithm into the public repository.
- [ ] Submit the VPTQ method to various inference frameworks (e.g., vLLM, llama.cpp).
Expand All @@ -99,6 +123,7 @@ python -m vptq.app
* We are deeply grateful for the inspiration provided by the papers QUIP, QUIP#, GPTVQ, AQLM, WoodFisher, GPTQ, and OBC.

## Publication

EMNLP 2024 Main
```bibtex
@inproceedings{
Expand All @@ -118,7 +143,7 @@ EMNLP 2024 Main

## Limitation of VPTQ
* ⚠️ VPTQ should only be used for research and experimental purposes. Further testing and validation are needed before you use it.
* ⚠️ The repository only provides a method of model quantization algorithm. The open-source community may provide models based on the technical report and quantization algorithm by themselves, but the repository project cannot guarantee the performance of those models.
* ⚠️ The repository only provides a method of model quantization algorithm. The open-source community may provide models based on the technical report and quantization algorithm by themselves, but the repository cannot guarantee the performance of those models.
* ⚠️ VPTQ is not capable of testing all potential applications and domains, and VPTQ cannot guarantee the accuracy and effectiveness of VPTQ across other tasks or scenarios.
* ⚠️ Our tests are all based on English texts; other languages are not included in the current testing.

Expand Down
2 changes: 1 addition & 1 deletion vptq/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@

from vptq.app_utils import get_chat_loop_generator

chat_completion = get_chat_loop_generator("VPTQ-community/Meta-Llama-3.1-8B-Instruct-v8k65536-4096")
chat_completion = get_chat_loop_generator("VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k32768-0-woft")


def respond(
Expand Down
Loading