Releases: LostRuins/koboldcpp
koboldcpp-1.42.1
koboldcpp-1.42.1
- Added support for LLAMA GGUFv2 models, handled automatically. All older models will still continue to work normally.
- Fixed a problem with certain logit values that were causing segfaults when using the Typical sampler. Please let me know if it happens again.
- Merged rocm support from @YellowRoseCx so you should now be able to build AMD compatible GPU builds with HIPBLAS, which should be faster than using CLBlast.
- Merged upstream support for GGUF Falcon models. Note that GPU layer offload for Falcon is unavailable with
--useclblast
but works with CUDA. Older pre-gguf Falcon models are not supported. - Added support for unbanning EOS tokens directly from API, and by extension it can now be triggered from Lite UI settings. Note: Your command line
--unbantokens
flag will force override this.
- Added support for automatic rope scale calculations based on a model's training context (n_ctx_train), this triggers if you do not explicitly specify a(reverted in 1.42.1 for now, it was not setup correctly)--ropeconfig
. For example, this means llama2 models will (by default) use a smaller rope scale compared to llama1 models, for the same specified--contextsize
. Setting--ropeconfig
will override this. - Updated Kobold Lite, now with tavern style portraits in Aesthetic Instruct mode.
- Pulled other fixes and improvements from upstream.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.41 (beta)
koboldcpp-1.41 (beta)
It's been a while since the last release and quite a lot upstream has changed under the hood, so consider this release a beta.
- Added support for LLAMA GGUF models, handled automatically. All older models will still continue to work normally. Note that GGUF format support for other non-llama architectures has not been added yet.
- Added
--config
flag to load a.kcpps
settings file when launching from command line (Credits: @poppeman), these files can also be imported/exported from the GUI. - Added a new endpoint
/api/extra/tokencount
which can be used to tokenize and accurately measure how many tokens any string has. - Fix for bell characters occasionally causing the terminal to beep in debug mode.
- Fix for incorrect list of backends & missing backends displayed in the GUI.
- Set MMQ to be the default for CUDA when running from GUI.
- Updated Lite, and merged all the improvements and fixes from upstream.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.40.1
koboldcpp-1.40.1
This release is mostly for bugfixes to the previous one, but enough small stuff has changed that I chose to make it a new version instead of a patch for the previous one.
- Fixed a regression in format detection for LLAMA 70B.
- Converted the embedded horde worker into daemon mode, hopefully solves the occasional exceptions
- Fixed some OOMs for blasbatchsize 2048, adjusted buffer sizes
- Slight modification to the look ahead (2 to 5%) for the cuda pool malloc.
- Pulled some bugfixes from upstream
- Added a new field
idle
for the/api/extra/perf
endpoint, allows checking if a generation is in progress without sending one. - Fixed cmake compilation for cudatoolkit 12.
- Updated Lite, includes option for aesthetic instruct UI (early beta by @Lyrcaxis, please send them your feedback)
hotfix 1.40.1:
- handle stablecode-completion-alpha-3b
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.39.1
koboldcpp-1.39.1
- Fix SSE streaming to handle headers correctly during abort (Credits: @duncannah)
- Bugfix for
--blasbatchsize -1
and1024
(fix alloc blocks error) - Added experimental support for
--blasbatchsize 2048
(note, buffers are doubled if that is selected, using much more memory) - Added support for 12k and 16k
--contextsize
options. Please let me know if you encounter issues. - Pulled upstream improvements, further CUDA speedups for MMQ mode for all quant types.
- Fix for some LLAMA 65B models being detected as LLAMA2 70B models.
- Revert to upstream approach for CUDA pool malloc (1.39.1 - done only for MMQ).
- Updated Lite, includes adding support for importing Tavern V2 card formats, with world info (character book) and clearer settings edit boxes.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.38
koboldcpp-1.38
- Added upstream support for Quantized MatMul (MMQ) prompt processing, a new option for CUDA (enabled by adding
--usecublas mmq
or toggle in GUI). This uses slightly less memory, and is slightly faster for Q4_0 but slower for K-quants. - Fixed SSE streaming for multibyte characters (For Tavern compatibility)
--noavx2
mode now does not use OpenBLAS (same as Failsafe), this is due to numerous compatibility complaints.- GUI dropdown preset only displays built platforms (Credit: @YellowRoseCx)
- Added a Help button in the GUI
- Fixed an issue with mirostat not reading correct value from GUI
- Fixed an issue with context size slider being limited to 4096 in the GUI
- Displays a terminal warning if received context exceeds max launcher allocated context
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.37.1
koboldcpp-1.37.1
- NEW: KoboldCpp now comes with an embedded Horde Worker which allows anyone to share their ggml models with the AI Horde without downloading additional dependences.
--hordeconfig
now accepts 5 parameters[hordemodelname] [hordegenlength] [hordemaxctx] [hordeapikey] [hordeworkername]
, filling up all 5 will start a Horde worker for you that serves horde requests automatically in the background. For previous behavior, exclude the last 2 parameters to continue using your own Horde worker (e.g. HaidraScribe/KAIHordeBridge). This feature can also be enabled via the GUI. - Added Support for LLAMA2 70B models. This should work automatically, GQA will be set to 8 if it's detected.
- Fixed a bug with mirostat v2 that was causing overly deterministic results. Please try it again. (Credit: @ycros)
- Added addition information to
/api/extra/perf
for the last generation info, including the stopping reason as well as generated token counts. - Exposed the parameter for
--tensor_split
which works exactly like it does upstream. Only for CUDA. - Try to support Kepler as a target for CUDA as well on henky's suggestion, can't guarantee it will work as I don't have a K80, but it might.
- Retained support for
--blasbatchsize 1024
after it was removed upstream. Scratch & KV buffer sizes will be larger when using this. - Minor bugfixes, pulled other upstream fixes and optimizations, updated Kobold Lite (chat mode improvements)
Hotfix 1.37.1
- Fixed clblast to work correctly for LLAMA2 70B
- Fixed sending Client-Agent for embedded horde worker in addition to Bridge Agent and User Agent
- Changed
rms_norm_eps
to5e-6
for better results for both llama1 and 2 - Fixed some streaming bugs in Lite
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.36
koboldcpp-1.36
- Reverted an upstream change to
sched_yield()
that caused slowdowns for certain systems. This should fix speed regressions in 1.35. If you're still experiencing poorer speeds compared to earlier versions, please raise an issue with details. - Reworked command line args on RoPE for extended context to be similar to upstream. Thus,
--linearrope
has been removed. Instead, you can now use--ropeconfig
to customize both RoPE frequency scale (Linear) and RoPE frequency base (NTK-Aware) values, e.g.--ropeconfig 0.5 10000
for a 2x linear scale. By default, long contextNTK-Aware RoPE
will be automatically configured based on your--contextsize
parameter, similar to previously. If you're using LLAMA2 at 4K context, you'd probably want to use--ropeconfig 1.0 10000
to take advantage of the native 4K tuning without scaling. For ease of use, this can be set in the GUI too. - Expose additional token counter information through the API
/api/extra/perf
- The warning for poor sampler orders has been limited to show only once per session, and excludes mirostat. I've heard some people have issues with it, so please let me know if it's still causing problems, though it's only a text warning and should not affect actual operation.
- Model busy flag replaced by Thread Lock, credits @ycros.
- Tweaked scratch and KV buffer allocation sizes for extended context.
- Updated Kobold Lite, with better whitespace trim support and a new toggle for partial chat responses.
- Pulled other upstream fixes and optimizations.
- Downgraded CUDA windows libraries to 11.4 for smaller exe filesizes, same version previously tried by @henk717. Please do report any issues or regressions encountered with this version.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.35
koboldcpp-1.35
Note: This build adds significant changes for CUDA and may be less stable than normal - please report any performance regressions or bugs you encounter. It may be slower than usual. If that is the case, please use the previous version for now.
- Enabled the CUDA 8Bit MMV mode (see ggerganov#2067) , now that it seems stable enough and works correctly, this approach uses quantized dot products instead of the traditional DMMV approach for the formats
q4_0
,q4_1
,q5_0
andq5_1
. If you're able to do a full GPU offload, then CUDA for such models will likely be significantly faster than before. K-quants and CL are not affected. - Exposed performance information through the API (prompt processing and generation timing), access it with
/api/extra/perf
- Added support for linear RoPE as an alternative to NTK-Aware RoPE (similar to in 1.33, but using 2048 as a base). This is triggered by the launcher parameter
--linearrope
. The RoPE scale is determined by the--contextsize
parameter, thus for best results on SuperHOT models, you should launch with--linearrope --contextsize 8192
which provides a0.25
linear scale as the SuperHOT finetune suggests. If--linearrope
is not specified, then NTK-aware RoPE is used by default. - Added a Save and Load settings option to the GUI launcher.
- Added the ability to select "All Devices" in the GUI for CUDA. You are still recommended to select a specific device - split GPU is usually slower.
- Displays a warning if poor sampler orders are used, as the default configuration will give much better results.
- Updated Kobold Lite, pulled other upstream fixes and optimizations.
1.35.H Henk-Cuda Hotfix: This is an alternative version from Henk that you can try if you encounter speed reductions. Please let me know if it's better for you.
Henk may have newer versions at https://github.com/henk717/koboldcpp/releases/tag/1.35 please check that out for now. I will be able to upstream any fixes only in a few days.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.34.2
koboldcpp-1.34.2
This is a BIG update. Changes:
- Added a brand new
customtkinter
GUI which contains many more configurable settings. To use this new UI, the python modulecustomtkinter
is required for Linux and OSX (already included with windows .exe builds). The old GUI is still available otherwise. (Thanks: @Vali-98) - Switched to NTK aware scaling for RoPE, set based on
--contextsize
parameter, with support up to 8K context. This seems to perform much better than the previous dynamic linear method, even on untuned models. It still won't work perfectly for SuperHOT 8K, as that model requires a fixed 0.25 linear rope scale, but I think this approach is better in general. Note that the alpha value chosen is applied when you select the--contextsize
so for best results, only set a big--contextsize
if you need it since there will be minor perplexity loss otherwise. - Enabled support for NTK-Aware scaled RoPE to GPT-NeoX and GPT-J too! And surprisingly long context does work decently with older models, so you can enjoy something like Pyg6B or Pythia with 4K context if you like.
- Added
/generate
API support for sampler_order and mirostat/tau/eta parameters, which you can now set per-generation. (Thanks: @ycros) - Added
--bantokens
which allows you to specify a list of token substrings that the AI cannot use. For example--bantokens [ a ooo
prevents the AI from using any left square brackets, the lettera
, or any token containingooo
. This bans all instances of matching tokens! - Added more granular context size options, now you can select 3k and 6k context sizes as well.
- Added the ability to select Main GPU to use when using CUDA. For example,
--usecublas lowvram 2
will use the third Nvidia GPU if it exists. - Pulled updates from RWKV.cpp, minor speedup for prompt processing.
- Fixed build issues on certain older and OSX platforms, GCC 7 should now be supported. please report any that you find.
- Pulled fixes and updates from upstream, Updated Kobold Lite. Kobold Lite now allows you to view submitted contexts after each generation. Also includes two new scenarios and limited support for Tavern v2 cards.
- Adjusted scratch buffer sizes for big contexts, so unexpected segfaults/OOM errors should be less common (please report any you find). CUDA scratch buffers should also work better now (upstream fix).
1.34.1a Hotfix (CUDA): Cuda was completely broken, did a quick revert to get it working. Will upload a proper build later.
1.32.2 Hotfix: CUDA kernels now updated to latest version, used python to handle the GPU selection instead.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.33 Ultimate Edition
koboldcpp-1.33 Ultimate Edition
A.K.A The "We CUDA had it all edition"
- The KoboldCpp Ultimate edition is an All-In-One release with previously missing CUDA features added in, with options to support both CL and CUDA properly in a single distributable. You can now select CUDA mode with
--usecublas
, and optionally low VRAM using--usecublas lowvram
. This release also contains support for OpenBLAS, CLBlast (via--useclblast
), and CPU-only (No BLAS) inference. - Back ported CUDA support for all prior versions of GGML file formats for CUDA. CUDA mode now correctly supports every single earlier version of GGML files, (earlier quants from GGML, GGMF, GGJT v1, v2 and v3, with respective feature sets at the time they were released, should load and work correctly.)
- Ported over the memory optimizations I added for OpenCL to CUDA, now CUDA will use less VRAM, and you may be able to use even more layers than upstream in llama.cpp (testing needed).
- Ported over CUDA GPU acceleration via layer offloading for MPT, GPT-2, GPT-J and GPT-NeoX in CUDA.
- Updated Lite, pulled updates from upstream, various minor bugfixes. Also, instruct mode now allows any number of newlines in the start and end tag, configurable by user.
- Added long context support using Scaled RoPE for LLAMA, which you can use by setting
--contextsize
greater than 2048. It is based off the PR here ggerganov#2019 and should work reasonably well up to over 3k context, possibly higher.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program with the --help
flag.