Releases: LostRuins/koboldcpp
koboldcpp-1.53
koboldcpp-1.53
- Added support for SSL. You can now import your own SSL cert to use with KoboldCpp and serve it over HTTPS with
--ssl [cert.pem] [key.pem]
or via the GUI. The.pem
files must be unencrypted, you can also generate them with OpenSSL, eg.openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -sha256 -days 365 -config openssl.cnf -nodes
for your own self signed certificate. - Added support for presence penalty (alternative rep pen) over the KAI API and in Lite. If Presence Penalty is set over the OpenAI API, and
rep_pen
is not set, thenrep_pen
will be set to a default of 1.0 instead of 1.1. Both penalties can be used together, although this is probably not a good idea. - Added fixes for Broken Pipe error, thanks @mahou-shoujo.
- Added fixes for aborting ongoing connections while streaming in SillyTavern.
- Merged upstream support for Phi models and speedups for Mixtral
- The default non-blas batch size for GGUF models is now increased from 8 to 32.
- Merged HIPBlas fixes from @YellowRoseCx
- Fixed an issue with building convert tools in 1.52
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.52.2
koboldcpp-1.52.2
something old, something new edition
- NEW: Added a new bare-bones KoboldCpp NoScript WebUI, which does not require Javascript to work. It should be W3C HTML compliant and should run on every browser in the last 20 years, even text-based ones like Lynx (e.g. in the terminal over SSH). It is accessible by default at
/noscript
e.g. http://localhost:5001/noscript . This can be helpful when running KoboldCpp from systems which do not support a modern browser with Javascript. - Partial per-layer KV offloading is now merged for CUDA. Important: this means that the number of layers you can offload to GPU might be reduced, as each layer now takes up more space. To avoid per-layer KV offloading, use the
--usecublas lowvram
option (equivalent to-nkvo
in llama.cpp). Fully offloaded models should behave the same as before. - The
/api/extra/tokencount
endpoint now also returns an array of token ids in the response body from the tokenizer. - Merged support for QWEN and Mixtral from upstream. Note: Mixtral seems to perform large batch prompt processing extremely slowly. This is probably an implementation issue. For now, you might have better luck using
--noblas
or setting--blasbatchsize -1
when using Mixtral - Selecting a .kcpps in the GUI when choosing a model will load the model specified inside that config file instead.
- Added the Mamba Multitool script (from @henk717). This is a shell script that can be used in Linux to setup an environment with all dependencies required for building and running KoboldCpp on Linux.
- Improved KCPP Embedded Horde Worker fault tolerance, should now gracefully backoff for increasing durations whenever encountering errors polling from AI Horde, and will automatically recover from up to 24 hours of Horde downtime.
- Added a new parameter that shows number of Horde Worker errors in the
/api/extra/perf
endpoint, this can be used to monitor your embedded horde worker if it goes down. - Pulled other fixes and improvements from upstream, updated Kobold Lite, added asynchronous file autosaves (thanks @aleksusklim), various other improvements.
Hotfix 1.52.1: Fixed 'not enough memory' loading errors for large (20B+) models. See #563
NEW: Added Linux PyInstaller binaries
Hotfix 1.52.2: Merged fixes for Mixtral prompt processing
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.51.1
koboldcpp-1.51.1
all quiet on the kobold front edition
- Added a new flag
--quiet
which allows you to suppress input and outputs from appearing in the console. - When context shift is enabled, allocate a small amount (about 80 tokens) of reserved space to reduce the
Failed to predict
errors that occur due to running out of KV cache space caused by KV cache fragmentation when shifting. - Auto rope scaling will not be automatically applied if the model already overrides the RoPE freq scale with a value below 1.
- Increased the graph node limit for older models to fix AiDungeon GPT2 not working.
- Display the available endpoint KAI and OAI URLs in the terminal on startup.
- Updated some API examples in the documentation
--multiuser
now accepts an extra optional parameter that indicates how many concurrent requests to allow to queue. If unset, or set to 1, it defaults to the default value of 5.- Pulled fixed and improvements from upstream, updated Kobold Lite, fixed Chub imports, optimized for Firefox, added multiline input in aesthetic mode, various other improvements.
1.51.1 Hotfix:
- Reverted an upstream change that caused a CLBlast segfault that occurred when context size exceeded 2k.
- Stripped out the OAI SSE carriage return after end message that was causing issues in Janitor.
- Moved the 80 extra tokens allocated for handling KV fragmentation to be added on top of the specified max context length instead of subtracted from it at runtime, which could cause padding issues when counting tokens in Tavern. This means that loading
--contextsize 2048
will actually allocate a size of 2128 behind the scenes for example. - Changed the API url printouts to include the tunnel url when using
--remotetunnel
Added a linux test build provided by @henk717
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.50.1
koboldcpp-1.50.1
- Improved automatic GPU layer selection: In the GUI launcher with CuBLAS, it will now automatically select all layers to do a full GPU offload if it thinks you have enough VRAM to support it.
- Added a short delay to the Abort function in Lite, hopefully fixes the glitches with retry and abort.
- Fixed automatic RoPE values for Yi and Deepseek. If no
--ropeconfig
is set, the preconfigured rope values in the model now take priority over the automatic context rope scale. - The above fix should also allow YaRN RoPE scaled models to work correctly by default, assuming the model has been correctly converted. Note: Customized YaRN configurations flags are not yet available.
- The OpenAI compatible
/v1/completions
has been enhanced, adding extra unofficial parameters that Aphrodite uses, such as Min-P, Top-A and Mirostat. However, OpenAI does not support separatememory
fields or sampler order, so the Kobold API will still give better results there. - SSE streaming support has been added for OpenAI
/v1/completions
endpoint (tested working in SillyTavern) - Custom DALL-E endpoints are now supported, for use with OAI proxies.
- Pulled fixed and improvements from upstream, updated Kobold Lite
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here
Hotfix 1.50.1:
- Fixed a regression with older RWKV/GPT-2/GPT-J/GPT-NeoX models that caused a segfault.
- If ropeconfig is not set, apply auto linear rope scaling multiplier for rope-tuned models such as Yi when used outside their original context limit.
- Fixed another bug in Lite with the retry/abort button.
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.49
koboldcpp-1.49
- New API feature: Split Memory - The generation payload also supports a new field
memory
in addition to the usualprompt
field. If set, forcefully appends this string to the beginning of any submitted prompt text. If resulting context exceeds the limit, forcefully overwrites text from the beginning of the main prompt until it can fit. Useful to guarantee full memory insertion even when you cannot determine exact token count. Automatically used in Lite. - New API feature:
trim_stop
can be added to the generate payload. If true, removes detected stop_sequences from the output and truncates all text after them. Does not work with SSE streaming. - New API feature:
--preloadstory
now allows you to specify a json file (such as a story savefile) when launching the server. This file will be hosted on the server at/api/extra/preloadstory
, which frontends (such as Kobold Lite) can access over the API. - Pulled various improvements and fixes from upstream llama.cpp
- Updated Kobold Lite, added new TTS options and fixed some bugs with the Retry button when Aborting. Added support for World Info inject position, split memory and preloaded stories. Also added support for optional image generation using DALL-E 3 (OAI API).
- Fixed KoboldCpp colab prebuilts crashing on some older Colab CPUs. It should now also work on A100 and V100 GPUs in addition to the free tier T4s. If it fails, try enabling the ForceRebuild checkbox.
LLAMA_PORTABLE=1
makefile flag can now be used when making builds that target colab or Docker. - Various other minor fixes.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.48.1
koboldcpp-1.48.1
Harder Better Faster Stronger Edition
- NEW FEATURE: Context Shifting (A.K.A. EvenSmarterContext) - This feature utilizes KV cache shifting to automatically remove old tokens from context and add new ones without requiring any reprocessing. So long as you use no memory/fixed memory and don't use world info, you should be able to avoid almost all reprocessing between consecutive generations even at max context. This does not consume any additional context space, making it superior to SmartContext.
- Note: Context Shifting is enabled by default, and will override smartcontext if both are enabled. Context Shifting still needs more testing. Your outputs may be different with shifting enabled, but both seem equally coherent. To disable Context Shifting, use the flag
--noshift
. If you observe a bug, please report and issue or send a PR fix.
- Note: Context Shifting is enabled by default, and will override smartcontext if both are enabled. Context Shifting still needs more testing. Your outputs may be different with shifting enabled, but both seem equally coherent. To disable Context Shifting, use the flag
- 'Tensor Core' Changes: KoboldCpp now handles MMQ/Tensor Cores differently from upstream. Here's a breakdown:
- old approach (everybody): if mmq is enabled, just use mmq. If cublas is enabled, just use cublas. MMQ dimensions set to "FAVOR BIG"
- new approach (upstream llama.cpp): you cannot toggle mmq anymore. It is always enabled. MMQ dimensions set to "FAVOR SMALL". CuBLAS always kicks in if batch > 32.
- new approach (koboldcpp): you CAN toggle MMQ. It is always enabled, until batch > 32, then CuBLAS only kicks in if MMQ flag is false, otherwise it still uses MMQ for all batches. MMQ dimensions set to "FAVOR BIG".
- Added GPU Info Display and Auto GPU Layer Selection For Newbies - Uses a combination of
clinfo
andnvidia-smi
queries to automatically determine and display the user's GPU name in the GUI, and helps newbies suggest the GPU layers to use when first choosing a model, based on available VRAM and model filesizes. Not optimal, but it should give usable defaults and be even more newbie friendly now. You can thereafter edit the actual GPU layers to use. (Credit: Original concept adapted from @YellowRoseCx ) - Added Min-P sampler - It is now available over the API, and can also be set in Lite from the Advanced settings tab. (Credit: @kalomaze)
- Added
--remotetunnel
flag, which downloads and creates a TryCloudFlare remote tunnel, allowing you to access koboldcpp remotely over the internet even behind a firewall. Note: This downloads a tool calledCloudflared
to the same directory. - Added a new build target for Windows exe users
koboldcpp_clblast_noavx2
, now providing a "CLBlast NoAVX2 (Old CPU)" option that finally supports CLBlast acceleration for windows devices without AVX2 intrinsics. - Include
Content-Length
header in responses. - Fixed some crashes with other uncommon models in cuda mode.
- Retained support for GGUFv1, but you're encouraged to update as upstream has removed support.
- Minor tweaks and optimizations to streaming timings. Fixed segfault that happens when streaming in multiuser mode and aborting connection halfway.
freq_base_train
is now taken into account when setting automatic rope scale, that should handle codellama correctly now.- Updated Kobold Lite, added support for selecting Min-P and Sampler Seeds (for proper deterministic generation).
- Improved KoboldCpp Colab, now with prebuilt CUDA binaries. Time to load after launch is less than a minute, excluding model downloads. Added a few more default model options, you can also use any custom GGUF model URL. (Try it here!)
Hotfix 1.48.1 - Fixed issues with Multi-GPU setups. GUI defaults to CuBLAS if available. Other minor fixes
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.47.2
koboldcpp-1.47.2
- Added OpenAI optional adapter from #466 (thanks @lofcz) . This is an unofficial extension of the v1 OpenAI Chat Completions endpoint that allows customization of the instruct tags over the API. The Kobold API still provides better functionality and flexibility overall.
- Pulled upstream support for ChatML added token merges (they have to be from a correctly converted GGUF model though, overall ChatML is still an inferior prompt template compared to Alpaca/Vicuna/LLAMA2).
- Embedded Horde Worker improvements: Added auto-recovery pause timeout on too many errors, instead of halting the worker outright. The worker will still be halted if the total error count exceeds a high enough threshold.
- Bug fixes for a multiuser race condition in polled streaming and for Top-K values being clamped (thanks @raefu @kalomaze)
- Improved server CORS and content-type handling.
- Added GUI input for tensor_split fields (thanks @AAbushady)
- Fixed support for GGUFv1 Falcon models, which was broken due to the upstream rewrite of the BPE tokenizer.
- Pulled other fixes and optimizations from upstream
- Updated KoboldCpp Colab, now with the new Tiefighter model (try it here)
Hotfix 1.47.1 - Fixed a race condition with SSE streaming. Tavern streaming should be reliable now.
Hotfix 1.47.2 - Fixed an issue with older multilingual GGUFs needing an alternate BPE tokenizer.
Updates for Embedded Kobold Lite:
- SSE streaming for Kobold Lite has been implemented! It requires a relatively recent browser. Toggle it on in settings.
- Added Browser Storage Save Slots! You can now directly save stories within the browser session itself. This is intended to be a temporary storage allowing you to swap between and try multiple stories - the browser storage is wiped when the browser cache/history is cleared!
- Added World Info Search Depth
- Added Group Chat Management Panel (You can temporarily toggle the participants in a group chat)
- Added AUTOMATIC1111 integration! It's finally here, you can now generate images from a local A1111 install, as an alternative to Horde,
- Lots of miscellaneous fixes and improvements. If you encounter any issues, do report them here.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.46.1
koboldcpp-1.46.1
Important: Deprecation Notice for KoboldCpp 1.46
- The following command line arguments are deprecated and have been removed from this version on.
--psutil_set_threads - parameter will be removed as it's now generally unhelpful, the defaults are usually sufficient.
--stream - a Kobold Lite only parameter, which is now a toggle saved inside Lite's settings and thus no longer necessary.
--unbantokens - EOS unbans should only be set via the generate API, in the use_default_badwordsids json field.
--usemirostat - Mirostat values should only be set via the generate API, in the mirostat mirostat_tau and mirostat_eta json fields.
- Removed the original deprecated tkinter GUI, now only the new customtkinter GUI remains.
- Improved embedded horde worker, added even more session stats, job pulls and job submits are now done in parallel so it should run about 20% faster for horde requests.
- Changed the default model name from
concedo/koboldcpp
tokoboldcpp/[model_filename]
. This does prevent old "Kobold AI-Client" users from connecting via the API, so if you're still using that, either switch to a newer client or connect via the Basic/OpenAI API instead of the Kobold API. - Added proper API documentation, which can be found by navigating to
/api
or the web one at https://lite.koboldai.net/koboldcpp_api - Allow .kcpps files to be drag & dropped, as well as working via OpenWith in windows.
- Added a new OpenAI Chat Completions compatible endpoint at
/v1/chat/completions
(credit: @teddybear082) --onready
processes are now started with subprocess.run instead of Popen (#462)- Both
/check
and/abort
can now function together with multiuser mode, provided the correctgenkey
is used by the client (automatically handled in Lite). - Allow 64k
--contextsize
(for GGUF only, still 16k otherwise). - Minor UI fixes and enhancements.
- Updated Lite, pulled fixes and improvements from upstream.
v1.46.1 hotfix: fixed an issue where blasthreads was used for values between 1 and 32 tokens.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.45.2
koboldcpp-1.45.2
- Improved embedded horde worker: more responsive, and added Session Stats (Total Kudos Earned, EarnRate, Timings)
- Added a new parameter to grammar sampler API
grammar_retain_state
which lets you persist the grammar state across multiple requests. - Allow launching by picking a .kcpps file in the file selector GUI combined with
--skiplauncher
. That settings file must already have a model selected. (Similar to--config
, but that one doesn't use GUI at all.) - Added a new flag toggle
--foreground
for windows users. This sends the console terminal to the foreground every time a new prompt is generated, to avoid some idling slowdown issues. - Increased max support context with
--contextsize
to 32k, but only for GGUF models. It's still limited to 16k for older model versions. GGUF now actually has no hard limit to max context since it switched to using allocators, but it's not be compatible with older models. Additionally, models not trained with extended context are unlikely to work when RoPE scaled beyond 32k. - Added a simple OpenAI compatible completions API, which you can access at
/v1/completions
. You're still recommended to use the Kobold API as it has many more settings. - Increased stop_sequence limit to 16.
- Improved SSE streaming by batching pending tokens between events.
- Upgraded Lite polled-streaming to work even in multiuser mode. This works by sending a unique key for each request.
- Improved Makefile to reduce unnecessary builds, added flag for skipping K-quants.
- Enhanced Remote-Link.cmd to also work on Linux, simply run it to create a Cloudflare tunnel to access koboldcpp anywhere.
- Improved the default colab notebook to use mmq.
- Updated Lite and pulled other fixes and improvements from upstream llama.cpp.
Important: Deprecation Notice for KoboldCpp 1.45.1
The following command line arguments are considered deprecated and will be removed soon, in a future version.
--psutil_set_threads - parameter will be removed as it's now generally unhelpful, the defaults are usually sufficient.
--stream - a Kobold Lite only parameter, which is now a toggle saved inside Lite's settings and thus no longer necessary.
--unbantokens - EOS unbans should only be set via the generate API, in the use_default_badwordsids json field.
--usemirostat - Mirostat values should only be set via the generate API, in the mirostat mirostat_tau and mirostat_eta json fields.
Hotfix for 1.45.2 - Fixed a bug with reading thread counts in 1.45 and 1.45.1, also moved the OpenAI endpoint from /api/extra/oai/v1/completions
to just /v1/completions
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.44.2
koboldcpp-1.44.2
A.K.A The "Mom: we have SillyTavern at home edition"
- Added multi-user mode with
--multiuser
which allows up to 5 concurrent incoming/generate
requests from multiple clients to be queued up and processed in sequence, instead of rejecting other requests while busy. Note that the/check
and/abort
endpoints are inactive while multiple requests are in-queue, this is to prevent one user from accidentally reading or cancelling a different user's request. - Added a new launcher argument
--onready
which allows you to pass a terminal command (e.g. start a python script) to be executed after Koboldcpp has finished loading. This runs as a subprocess, and can be useful for starting cloudflare tunnels, displaying URLs etc. - Added Grammar Sampling for all architectures, which can be accessed via the web API (also in Lite). Older models are also supported.
- Added a new API endpoint
/api/extra/true_max_context_length
which allows fetching the true max context limit, separate from the horde-friendly value. - Added support for selecting from a 4th GPU from the UI and command line (was max 3 before).
- Tweaked automatic RoPE scaling
- Pulled other fixes and improvements from upstream.
- Note: Using
--usecublas
with the prebuilt Windows executables here are only intended for Nvidia devices. For AMD users, please check out @YellowRoseCx koboldcpp-rocm fork instead.
Major Update for Kobold Lite:
- Kobold Lite has undergone a massive overhaul, renamed and rearranged elements for a cleaner UI.
- Added Aesthetic UI for chat mode, which is now automatically selected when importing Tavern cards. You can easily switch between the different UIs for chat and instruct modes from the settings panel.
- Added Mirostat UI configs to settings panel.
- Allowed Idle Responses in all modes, it is now a global setting. Also fixed an idle response detection bug.
- Smarter group chats, mentioning a specific name when inside a group chat will cause that user to respond, instead of being random.
- Added support for automagically increasing the max context size slider limit, if a larger context is detected.
- Added scenario for importing characters from Chub.Ai
- Added a settings checkbox to enable streaming whenever applicable without requiring messing with URLs. Streaming can be easily toggled from the settings UI now, similar to EOS unbanning, although the
--stream
flag is still kept for compatibility. - Added a few Instruct Tag Presets in a dropdown.
- Supports instruct placeholders, allowing easy switching between instruct formats without rewriting the text. Added a toggle option to use "Raw Instruct Tags" (the old method) as an alternative to placeholder tags like
{{[INPUT]}}
and{{[OUTPUT]}}
- Added a toggle for "Newline After Memory" which can be set in the memory panel.
- Added a toggle for "Show Rename Save File" which shows a popup the user can use to rename the json save file before saving.
- You can specify a BNDF grammar string in settings to use when generating, this controls grammar sampling.
- Various minor bugfixes, also fixed stop_sequences still appearing in the AI outputs, they should be correctly truncated now.
v1.44.1 update - added queue number to perf endpoint, and updated lite to fix a few formatting bugs.
v1.44.2 update - fixed a speed regression from sched_yield again.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.