Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama server: explictly set context to 32k #106

Merged
merged 1 commit into from
Jan 13, 2025

Conversation

TomasTomecek
Copy link
Collaborator

@TomasTomecek TomasTomecek commented Jan 13, 2025

Here are logs to prove the change works:

-llama_new_context_with_model: n_ctx      = 2048
+llama_new_context_with_model: n_ctx      = 32768
 llama_new_context_with_model: n_batch    = 512
 llama_new_context_with_model: n_ubatch   = 512
 llama_new_context_with_model: flash_attn = 0
 llama_new_context_with_model: freq_base  = 1000000.0
 llama_new_context_with_model: freq_scale = 1
-llama_kv_cache_init:      CUDA0 KV buffer size =   256.00 MiB
-llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
-llama_new_context_with_model:  CUDA_Host  output buffer size =     0.12 MiB
-llama_new_context_with_model:      CUDA0 compute buffer size =   164.00 MiB
-llama_new_context_with_model:  CUDA_Host compute buffer size =    12.01 MiB
+llama_kv_cache_init:      CUDA0 KV buffer size =  4096.00 MiB
+llama_new_context_with_model: KV self size  = 4096.00 MiB, K (f16): 2048.00 MiB, V (f16): 2048.00 MiB
+llama_new_context_with_model:  CUDA_Host  output buffer size =     0.12 MiB
+llama_new_context_with_model:      CUDA0 compute buffer size =  2144.00 MiB
+llama_new_context_with_model:  CUDA_Host compute buffer size =    72.01 MiB
 llama_new_context_with_model: graph nodes  = 1030
 llama_new_context_with_model: graph splits = 2

Our python process now definitely occupies more memory based on the increase of buffers above:

# nvidia-smi
Mon Jan 13 10:42:44 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05              Driver Version: 560.35.05      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:00:1E.0 Off |                    0 |
| N/A   33C    P0             26W /   70W |   10283MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    421607      C   python3                                     10280MiB |
+-----------------------------------------------------------------------------------------+

Fixes #100

now the llama server logs say:
```
llama_new_context_with_model: n_ctx      = 32768
```

Signed-off-by: Tomas Tomecek <[email protected]>
@@ -6,7 +6,7 @@ services:
context: .
dockerfile: ./Containerfile.cuda
hostname: "${LLAMA_CPP_HOST}"
command: "python3 -m llama_cpp.server --model ${MODEL_FILEPATH} --host 0.0.0.0 --port ${LLAMA_CPP_SERVER_PORT} --n_gpu_layers ${LLM_NGPUS:-0}"
command: "python3 -m llama_cpp.server --model ${MODEL_FILEPATH} --host 0.0.0.0 --port ${LLAMA_CPP_SERVER_PORT} --n_gpu_layers ${LLM_NGPUS:-0} --n_ctx 32768"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm while reviewing this (+1, np). I found that llama_cpp.server should have --parallel 2.See https://www.reddit.com/r/LocalLLaMA/comments/1be845y/multiple_concurrent_generations_with_llamacpp/
Is this what we need for parallel access?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, how could we miss this?!! very interesting, I'm gonna test this

Copy link
Member

@xsuchy xsuchy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@TomasTomecek
Copy link
Collaborator Author

merging so I can deploy this

@TomasTomecek TomasTomecek merged commit dac1e87 into fedora-copr:main Jan 13, 2025
9 checks passed
@TomasTomecek TomasTomecek deleted the set-ctx branch January 13, 2025 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ValueError: Requested tokens (2138) exceed context window of 2048
2 participants