llama server: explictly set context to 32k #106

TomasTomecek · 2025-01-13T10:49:36Z

Here are logs to prove the change works:

-llama_new_context_with_model: n_ctx      = 2048
+llama_new_context_with_model: n_ctx      = 32768
 llama_new_context_with_model: n_batch    = 512
 llama_new_context_with_model: n_ubatch   = 512
 llama_new_context_with_model: flash_attn = 0
 llama_new_context_with_model: freq_base  = 1000000.0
 llama_new_context_with_model: freq_scale = 1
-llama_kv_cache_init:      CUDA0 KV buffer size =   256.00 MiB
-llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
-llama_new_context_with_model:  CUDA_Host  output buffer size =     0.12 MiB
-llama_new_context_with_model:      CUDA0 compute buffer size =   164.00 MiB
-llama_new_context_with_model:  CUDA_Host compute buffer size =    12.01 MiB
+llama_kv_cache_init:      CUDA0 KV buffer size =  4096.00 MiB
+llama_new_context_with_model: KV self size  = 4096.00 MiB, K (f16): 2048.00 MiB, V (f16): 2048.00 MiB
+llama_new_context_with_model:  CUDA_Host  output buffer size =     0.12 MiB
+llama_new_context_with_model:      CUDA0 compute buffer size =  2144.00 MiB
+llama_new_context_with_model:  CUDA_Host compute buffer size =    72.01 MiB
 llama_new_context_with_model: graph nodes  = 1030
 llama_new_context_with_model: graph splits = 2

Our python process now definitely occupies more memory based on the increase of buffers above:

# nvidia-smi
Mon Jan 13 10:42:44 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05              Driver Version: 560.35.05      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:00:1E.0 Off |                    0 |
| N/A   33C    P0             26W /   70W |   10283MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    421607      C   python3                                     10280MiB |
+-----------------------------------------------------------------------------------------+

Fixes #100

now the llama server logs say: ``` llama_new_context_with_model: n_ctx = 32768 ``` Signed-off-by: Tomas Tomecek <[email protected]>

xsuchy · 2025-01-13T12:30:00Z

docker-compose-prod.yaml

@@ -6,7 +6,7 @@ services:
      context: .
      dockerfile: ./Containerfile.cuda
    hostname: "${LLAMA_CPP_HOST}"
-    command: "python3 -m llama_cpp.server --model ${MODEL_FILEPATH} --host 0.0.0.0 --port ${LLAMA_CPP_SERVER_PORT} --n_gpu_layers ${LLM_NGPUS:-0}"
+    command: "python3 -m llama_cpp.server --model ${MODEL_FILEPATH} --host 0.0.0.0 --port ${LLAMA_CPP_SERVER_PORT} --n_gpu_layers ${LLM_NGPUS:-0} --n_ctx 32768"


Hmm while reviewing this (+1, np). I found that llama_cpp.server should have --parallel 2.See https://www.reddit.com/r/LocalLLaMA/comments/1be845y/multiple_concurrent_generations_with_llamacpp/
Is this what we need for parallel access?

wow, how could we miss this?!! very interesting, I'm gonna test this

xsuchy

+1

TomasTomecek · 2025-01-13T13:48:37Z

merging so I can deploy this

llama server: explictly set context to 32k

9e2c97b

now the llama server logs say: ``` llama_new_context_with_model: n_ctx = 32768 ``` Signed-off-by: Tomas Tomecek <[email protected]>

xsuchy reviewed Jan 13, 2025

View reviewed changes

xsuchy approved these changes Jan 13, 2025

View reviewed changes

TomasTomecek merged commit dac1e87 into fedora-copr:main Jan 13, 2025
9 checks passed

TomasTomecek deleted the set-ctx branch January 13, 2025 13:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama server: explictly set context to 32k #106

llama server: explictly set context to 32k #106

TomasTomecek commented Jan 13, 2025 •

edited

Loading

xsuchy Jan 13, 2025

TomasTomecek Jan 13, 2025

xsuchy left a comment

TomasTomecek commented Jan 13, 2025

llama server: explictly set context to 32k #106

llama server: explictly set context to 32k #106

Conversation

TomasTomecek commented Jan 13, 2025 • edited Loading

xsuchy Jan 13, 2025

Choose a reason for hiding this comment

TomasTomecek Jan 13, 2025

Choose a reason for hiding this comment

xsuchy left a comment

Choose a reason for hiding this comment

TomasTomecek commented Jan 13, 2025

TomasTomecek commented Jan 13, 2025 •

edited

Loading