New CUDA kv cache quantization support merged in llama.cpp yesterday #879

rosemash · 2024-06-01T21:22:30Z

rosemash
Jun 1, 2024

ggerganov#7527
Apologies if you are already aware of this (you most likely are) but this PR allows the KV cache to be quantized down to as small as q4_0 when loading a model, which dramatically reduces the amount of vram needed to offload layers in large models. I don't think there is a flag in koboldcpp to use this feature. It would really help to load large models on small GPUs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New CUDA kv cache quantization support merged in llama.cpp yesterday #879

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

New CUDA kv cache quantization support merged in llama.cpp yesterday #879

rosemash Jun 1, 2024

Replies: 0 comments

rosemash
Jun 1, 2024