You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ggerganov#7527
Apologies if you are already aware of this (you most likely are) but this PR allows the KV cache to be quantized down to as small as q4_0 when loading a model, which dramatically reduces the amount of vram needed to offload layers in large models. I don't think there is a flag in koboldcpp to use this feature. It would really help to load large models on small GPUs.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
ggerganov#7527
Apologies if you are already aware of this (you most likely are) but this PR allows the KV cache to be quantized down to as small as q4_0 when loading a model, which dramatically reduces the amount of vram needed to offload layers in large models. I don't think there is a flag in koboldcpp to use this feature. It would really help to load large models on small GPUs.
Beta Was this translation helpful? Give feedback.
All reactions