microsoft · YangWang92 · Sep 25, 2024 · Sep 25, 2024 · Sep 25, 2024
diff --git a/README.md b/README.md
@@ -9,12 +9,13 @@ VPTQ can compress 70B, even the 405B model, to 1-2 bits without retraining and m
 * Lightweight Quantization Algorithm: only cost ~17 hours to quantize 405B Llama-3.1
 * Agile Quantization Inference: low decode overhead, best throughput, and TTFT
 
-## Details and [**Tech Report**](https://github.com/microsoft/VPTQ/blob/main/VPTQ_tech_report.pdf)
+## [**Tech Report**](https://github.com/microsoft/VPTQ/blob/main/VPTQ_tech_report.pdf)
 
 Scaling model size significantly challenges the deployment and inference of Large Language Models (LLMs). Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits). It reduces memory requirements, optimizes storage costs, and decreases memory bandwidth needs during inference. However, due to numerical representation limitations, traditional scalar-based weight quantization struggles to achieve such extreme low-bit. Recent research on Vector Quantization (VQ) for LLMs has demonstrated the potential for extremely low-bit model quantization by compressing vectors into indices using lookup tables.
 
+Read tech report at [**Tech Report**](https://github.com/microsoft/VPTQ/blob/main/VPTQ_tech_report.pdf)
 
-## Early Results from Tech Report
+### Early Results from Tech Report
 VPTQ achieves better accuracy and higher throughput with lower quantization overhead across models of different sizes. The following experimental results are for reference only; VPTQ can achieve better outcomes under reasonable parameters, especially in terms of model accuracy and inference speed.
 
 <img src="assets/vptq.png" width="500">
@@ -28,10 +29,10 @@ VPTQ achieves better accuracy and higher throughput with lower quantization over
 | LLaMA-2 70B | 2.07     | 3.93 | 5.72 | 68.6   | 9.7    | 19.54   | 19      |
 |             | 2.11     | 3.92 | 5.71 | 68.7   | 9.7    | 20.01   | 19      |
 
-
-## Install and Evaluation
+## Installation and Evaluation
 
 ### Dependencies
+
 - python 3.10+
 - torch >= 2.2.0
 - transformers >= 4.44.0
@@ -45,43 +46,66 @@ VPTQ achieves better accuracy and higher throughput with lower quantization over
 export PATH=/usr/local/cuda-12/bin/:$PATH  # set dependent on your environment
 ```
 
+*Will Take several minutes to compile CUDA kernels*
 ```python
 pip install git+https://github.com/microsoft/VPTQ.git --no-build-isolation
 ```
 
-### Language Generation
+### Models from Open Source Community
+
+⚠️ The repository only provides a method of model quantization algorithm. 
+
+⚠️ The open-source community [VPTQ-community](https://huggingface.co/VPTQ-community) provides models based on the technical report and quantization algorithm. 
+
+⚠️ The repository cannot guarantee the performance of those models.
+
+|      Model Series      | Collections |
+|:----------------------:|:-----------:|
+|  Llama 3.1 8B Instruct |     [HF 🤗](https://huggingface.co/collections/VPTQ-community/vptq-llama-31-8b-instruct-without-finetune-66f2b70b1d002ceedef02d2e)    |
+| Llama 3.1 70B Instruct |     [HF 🤗](https://huggingface.co/collections/VPTQ-community/vptq-llama-31-70b-instruct-without-finetune-66f2bf454d3dd78dfee2ff11)    |
+|  Qwen 2.5 7B Instruct  |     [HF 🤗](https://huggingface.co/collections/VPTQ-community/vptq-qwen-25-7b-instruct-without-finetune-66f3e9866d3167cc05ce954a)    |
+|  Qwen 2.5 72B Instruct |     [HF 🤗](https://huggingface.co/collections/VPTQ-community/vptq-qwen-25-72b-instruct-without-finetune-66f3bf1b3757dfa1ecb481c0)    |
+|  Qwen 2.5 405B Instruct |     [HF 🤗](https://huggingface.co/collections/VPTQ-community/vptq-llama-31-405b-instruct-without-finetune-66f4413f9ba55e1a9e52cfb0)    |
+
+### Language Generation Example
 To generate text using the pre-trained model, you can use the following code snippet:
 
+The model [*VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k32768-0-woft*](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k32768-0-woft) (~1.875 bit) is provided by open source community. The repository cannot guarantee the performance of those models.
+
 ```python
-python -m vptq --model=LLaMa-2-7b-1.5bi-vptq --prompt="Do Not Go Gentle into That Good Night"
+python -m vptq --model=VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k32768-0-woft --prompt="Explain: Do Not Go Gentle into That Good Night"
 ```
 
+### Terminal Chatbot Example 
 Launching a chatbot:
 Note that you must use a chat model for this to work
 
 ```python
-python -m vptq --model=LLaMa-2-7b-chat-1.5b-vptq --chat
+python -m vptq --model=VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k32768-0-woft --chat
 ```
+
+### Python API Example
 Using the Python API:
 
 ```python
 import vptq
 import transformers
-tokenizer = transformers.AutoTokenizer.from_pretrained("LLaMa-2-7b-1.5bi-vptq")
-m = vptq.AutoModelForCausalLM.from_pretrained("LLaMa-2-7b-1.5bi-vptq", device_map='auto')
+tokenizer = transformers.AutoTokenizer.from_pretrained("VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k32768-0-woft")
+m = vptq.AutoModelForCausalLM.from_pretrained("VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k32768-0-woft", device_map='auto')
 
-inputs = tokenizer("Do Not Go Gentle into That Good Night", return_tensors="pt").to("cuda")
+inputs = tokenizer("Explain: Do Not Go Gentle into That Good Night", return_tensors="pt").to("cuda")
 out = m.generate(**inputs, max_new_tokens=100, pad_token_id=2)
 print(tokenizer.decode(out[0], skip_special_tokens=True))
 ```
 
-### Gradio app example
+### Gradio Web App Example
 A environment variable is available to control share link or not. 
 `export SHARE_LINK=1`
 ```
 python -m vptq.app
 ```
 
+
 ## Road Map
 - [ ] Merge the quantization algorithm into the public repository.
 - [ ] Submit the VPTQ method to various inference frameworks (e.g., vLLM, llama.cpp).
@@ -99,6 +123,7 @@ python -m vptq.app
 * We are deeply grateful for the inspiration provided by the papers QUIP, QUIP#, GPTVQ, AQLM, WoodFisher, GPTQ, and OBC.
 
 ## Publication
+
 EMNLP 2024 Main
 ```bibtex
 @inproceedings{
@@ -118,7 +143,7 @@ EMNLP 2024 Main
 
 ## Limitation of VPTQ
 * ⚠️ VPTQ should only be used for research and experimental purposes. Further testing and validation are needed before you use it.
-* ⚠️ The repository only provides a method of model quantization algorithm. The open-source community may provide models based on the technical report and quantization algorithm by themselves, but the repository project cannot guarantee the performance of those models.
+* ⚠️ The repository only provides a method of model quantization algorithm. The open-source community may provide models based on the technical report and quantization algorithm by themselves, but the repository cannot guarantee the performance of those models.
 * ⚠️ VPTQ is not capable of testing all potential applications and domains, and VPTQ cannot guarantee the accuracy and effectiveness of VPTQ across other tasks or scenarios.
 * ⚠️ Our tests are all based on English texts; other languages are not included in the current testing.
 

diff --git a/vptq/app.py b/vptq/app.py
@@ -9,7 +9,7 @@
 
 from vptq.app_utils import get_chat_loop_generator
 
-chat_completion = get_chat_loop_generator("VPTQ-community/Meta-Llama-3.1-8B-Instruct-v8k65536-4096")
+chat_completion = get_chat_loop_generator("VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k32768-0-woft")
 
 
 def respond(