Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intergration for vllm=0.6.3.post1 #39

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

Oasis-Git
Copy link

Here is the integration version of lmcache_vllm.

Fix the problem of:

  • the judgement in lmcache_should_retrieve and lmcache_should_store
  • fix the compatibility problem for execute model

@Oasis-Git
Copy link
Author

Here I show the full log output for offline inference:

$ python offline_inference.py 
INFO LMCache: Initializing lmcache_vllm version 0.6.2.2, supporting vllm versions: ['0.6.3.post1'] [2024-11-19 19:58:11,190] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/__init__.py:35
WARNING LMCache: No LMCache configuration file is set. Returning default config [2024-11-19 19:58:11,268] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_adapter.py:113
WARNING LMCache: Please set the configuration file through the environment variable: LMCACHE_CONFIG_FILE [2024-11-19 19:58:11,269] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_adapter.py:114
INFO 11-19 19:58:16 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='meta-llama/Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1300, download_dir='/local/yuweia/HF/vllm/', load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=meta-llama/Llama-3.1-8B-Instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None)
INFO 11-19 19:58:18 model_runner.py:1056] Starting to load model meta-llama/Llama-3.1-8B-Instruct...
INFO 11-19 19:58:18 weight_utils.py:243] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:04,  1.48s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:03<00:03,  1.59s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:03<00:01,  1.08s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:05<00:00,  1.22s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:05<00:00,  1.26s/it]

INFO 11-19 19:58:24 model_runner.py:1067] Loading model weights took 14.9888 GB
INFO LMCache: Initializing local-only (cpu) backend [2024-11-19 19:58:24,287] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/storage_backend/__init__.py:37
INFO LMCache: Initializing cpu mem, is_pinned: True [2024-11-19 19:58:24,288] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/storage_backend/mem_pool/local_pool.py:64
DEBUG LMCache: Current storage backend type <class 'lmcache.storage_backend.local_backend.LMCLocalBackend'> [2024-11-19 19:58:29,359] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/cache_engine.py:34
INFO 11-19 19:58:30 gpu_executor.py:122] # GPU blocks: 9739, # CPU blocks: 2048
INFO 11-19 19:58:30 gpu_executor.py:126] Maximum concurrency for 1300 tokens per request: 119.86x
INFO 11-19 19:58:33 model_runner.py:1395] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 11-19 19:58:33 model_runner.py:1399] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 11-19 19:58:49 model_runner.py:1523] Graph capturing finished in 16 secs.
DEBUG LMCache: original done [2024-11-19 19:58:50,142] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_injection.py:440
Processed prompts:   0%|                                                                           | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]INFO LMCache: KV cache retrieving mode: RetrieveStatus.PREFILL [2024-11-19 19:58:50,164] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_injection.py:49
INFO LMCache: Using default batched implementation of the get() method [2024-11-19 19:58:50,165] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/storage_backend/abstract_backend.py:120
INFO LMCache: Retrieved 0 chunks [2024-11-19 19:58:50,165] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/cache_engine.py:381
DEBUG LMCache: Returning the original input! [2024-11-19 19:58:50,166] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_adapter.py:647
INFO LMCache: KV cache saving mode: [<StoreStatus.PREFILL: 1>] [2024-11-19 19:58:50,320] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_injection.py:129
INFO LMCache: Using default batched implementation of the put() method [2024-11-19 19:58:50,545] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/storage_backend/abstract_backend.py:99
INFO LMCache: Stored/updated 5 chunks, total time 0.19s, make chunks time 0.19s [2024-11-19 19:58:50,545] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/cache_engine.py:311
DEBUG LMCache: Store skips 0 tokens and then stores 1086 tokens [2024-11-19 19:58:50,546] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_adapter.py:473
Processed prompts: 100%|████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.44s/it, est. speed input: 754.89 toks/s, output: 22.94 toks/s]


First request Time: 1.4499170444905758 seconds


Processed prompts:   0%|                                                                           | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]INFO LMCache: KV cache retrieving mode: RetrieveStatus.PREFILL [2024-11-19 19:58:51,607] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_injection.py:49
INFO LMCache: Using default batched implementation of the get() method [2024-11-19 19:58:51,608] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/storage_backend/abstract_backend.py:120
INFO LMCache: Concatenated 5 chunks -- elapsed time 0.0003560855984687805 [2024-11-19 19:58:51,623] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/cache_engine.py:397
INFO LMCache: Retrieved 5 chunks (1086 tokens in total) --elapsed time 0.014969948679208755 [2024-11-19 19:58:51,623] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/cache_engine.py:407
DEBUG LMCache: Injected token number: 1085 [2024-11-19 19:58:51,623] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_adapter.py:599
DEBUG LMCache: Rebuilt the input! [2024-11-19 19:58:51,627] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_adapter.py:644
INFO LMCache: KV cache saving mode: [<StoreStatus.SUFFIX_PREFILL: 4>] [2024-11-19 19:58:51,656] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_injection.py:129
DEBUG LMCache: Store skips 1086 tokens and then stores 0 tokens [2024-11-19 19:58:51,656] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_adapter.py:473
Processed prompts: 100%|███████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.12it/s, est. speed input: 1218.62 toks/s, output: 31.42 toks/s]


Second request Time: 0.8962920755147934 seconds

@Oasis-Git
Copy link
Author

Here I show the full log output for LMCache/examples/save_decode_cache experiment

$LMCACHE_CONFIG_FILE=example.yaml CUDA_VISIBLE_DEVICES=1 python offline_inference.py
INFO LMCache: Initializing lmcache_vllm version 0.6.2.2, supporting vllm versions: ['0.6.3.post1'] [2024-11-19 20:08:05,392] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/__init__.py:35
INFO LMCache: Loading LMCache config file example.yaml [2024-11-19 20:08:05,470] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_adapter.py:123
INFO 11-19 20:08:10 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='meta-llama/Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir='/local/yuweia/HF/vllm/', load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=meta-llama/Llama-3.1-8B-Instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None)
INFO 11-19 20:08:12 model_runner.py:1056] Starting to load model meta-llama/Llama-3.1-8B-Instruct...
INFO 11-19 20:08:12 weight_utils.py:243] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.18s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:02<00:02,  1.28s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.14it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:04<00:00,  1.03it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:04<00:00,  1.01s/it]

INFO 11-19 20:08:17 model_runner.py:1067] Loading model weights took 14.9888 GB
INFO LMCache: Initializing local-only (cpu) backend [2024-11-19 20:08:17,480] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/storage_backend/__init__.py:37
INFO LMCache: Initializing cpu mem, is_pinned: True [2024-11-19 20:08:17,480] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/storage_backend/mem_pool/local_pool.py:64
DEBUG LMCache: Current storage backend type <class 'lmcache.storage_backend.local_backend.LMCLocalBackend'> [2024-11-19 20:08:21,912] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/cache_engine.py:34
INFO 11-19 20:08:27 gpu_executor.py:122] # GPU blocks: 8386, # CPU blocks: 2048
INFO 11-19 20:08:27 gpu_executor.py:126] Maximum concurrency for 32768 tokens per request: 4.09x
INFO 11-19 20:08:30 model_runner.py:1395] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 11-19 20:08:30 model_runner.py:1399] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 11-19 20:08:43 model_runner.py:1523] Graph capturing finished in 13 secs.
DEBUG LMCache: original done [2024-11-19 20:08:43,483] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_injection.py:440
Processed prompts:   0%|                                                                           | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]INFO LMCache: KV cache retrieving mode: RetrieveStatus.PREFILL [2024-11-19 20:08:43,505] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_injection.py:49
INFO LMCache: Using default batched implementation of the get() method [2024-11-19 20:08:43,506] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/storage_backend/abstract_backend.py:120
INFO LMCache: Retrieved 0 chunks [2024-11-19 20:08:43,506] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/cache_engine.py:381
DEBUG LMCache: Returning the original input! [2024-11-19 20:08:43,506] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_adapter.py:647
INFO LMCache: KV cache saving mode: [<StoreStatus.PREFILL: 1>] [2024-11-19 20:08:43,739] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_injection.py:129
INFO LMCache: Using default batched implementation of the put() method [2024-11-19 20:08:43,749] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/storage_backend/abstract_backend.py:99
INFO LMCache: Stored/updated 1 chunks, total time 0.00s, make chunks time 0.00s [2024-11-19 20:08:43,749] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/cache_engine.py:311
DEBUG LMCache: Store skips 0 tokens and then stores 74 tokens [2024-11-19 20:08:43,749] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_adapter.py:473
INFO LMCache: KV cache saving mode: [<StoreStatus.DECODE: 3>] [2024-11-19 20:08:49,220] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_injection.py:129
INFO LMCache: Using default batched implementation of the put() method [2024-11-19 20:08:49,265] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/storage_backend/abstract_backend.py:99
INFO LMCache: Stored/updated 1 chunks, total time 0.00s, make chunks time 0.00s [2024-11-19 20:08:49,265] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/cache_engine.py:311
DEBUG LMCache: Store skips 0 tokens and then stores 256 tokens [2024-11-19 20:08:49,266] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_adapter.py:473
INFO LMCache: KV cache saving mode: [<StoreStatus.DECODE: 3>] [2024-11-19 20:08:56,876] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_injection.py:129
INFO LMCache: Using default batched implementation of the put() method [2024-11-19 20:08:56,911] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/storage_backend/abstract_backend.py:99
INFO LMCache: Stored/updated 1 chunks, total time 0.00s, make chunks time 0.00s [2024-11-19 20:08:56,911] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/cache_engine.py:311
DEBUG LMCache: Store skips 256 tokens and then stores 256 tokens [2024-11-19 20:08:56,911] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_adapter.py:473
INFO LMCache: KV cache saving mode: [<StoreStatus.DECODE: 3>] [2024-11-19 20:09:04,530] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_injection.py:129
INFO LMCache: Using default batched implementation of the put() method [2024-11-19 20:09:04,564] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/storage_backend/abstract_backend.py:99
INFO LMCache: Stored/updated 1 chunks, total time 0.00s, make chunks time 0.00s [2024-11-19 20:09:04,565] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/cache_engine.py:311
DEBUG LMCache: Store skips 512 tokens and then stores 256 tokens [2024-11-19 20:09:04,565] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_adapter.py:473
INFO LMCache: KV cache saving mode: [<StoreStatus.DECODE: 3>] [2024-11-19 20:09:12,197] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_injection.py:129
INFO LMCache: Using default batched implementation of the put() method [2024-11-19 20:09:12,231] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/storage_backend/abstract_backend.py:99
INFO LMCache: Stored/updated 1 chunks, total time 0.00s, make chunks time 0.00s [2024-11-19 20:09:12,231] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/cache_engine.py:311
DEBUG LMCache: Store skips 768 tokens and then stores 256 tokens [2024-11-19 20:09:12,231] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_adapter.py:473
Processed prompts: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:30<00:00, 30.95s/it, est. speed input: 2.39 toks/s, output: 33.09 toks/s]


First request Time: 30.95908421650529 seconds


Processed prompts:   0%|                                                                           | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]INFO LMCache: KV cache retrieving mode: RetrieveStatus.PREFILL [2024-11-19 20:09:14,464] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_injection.py:49
INFO LMCache: Using default batched implementation of the get() method [2024-11-19 20:09:14,465] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/storage_backend/abstract_backend.py:120
INFO LMCache: Concatenated 4 chunks -- elapsed time 0.001274898648262024 [2024-11-19 20:09:14,574] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/cache_engine.py:397
INFO LMCache: Retrieved 4 chunks (1024 tokens in total) --elapsed time 0.10927902162075043 [2024-11-19 20:09:14,574] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/cache_engine.py:407
DEBUG LMCache: Injected token number: 1024 [2024-11-19 20:09:14,574] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_adapter.py:599
DEBUG LMCache: Rebuilt the input! [2024-11-19 20:09:14,579] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_adapter.py:644
INFO LMCache: KV cache saving mode: [<StoreStatus.SUFFIX_PREFILL: 4>] [2024-11-19 20:09:14,615] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_injection.py:129
INFO LMCache: Using default batched implementation of the put() method [2024-11-19 20:09:14,620] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/storage_backend/abstract_backend.py:99
INFO LMCache: Stored/updated 1 chunks, total time 0.00s, make chunks time 0.00s [2024-11-19 20:09:14,620] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/cache_engine.py:311
DEBUG LMCache: Store skips 1024 tokens and then stores 92 tokens [2024-11-19 20:09:14,620] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_adapter.py:473
Processed prompts: 100%|███████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.08it/s, est. speed input: 2321.14 toks/s, output: 20.80 toks/s]


Second request Time: 0.4893089309334755 seconds


DEBUG LMCache: Closing LMCache Engine [2024-11-19 20:09:14,947] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_adapter.py:236
INFO LMCache: Closed the put worker in local backend [2024-11-19 20:09:14,947] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/storage_backend/local_backend.py:240

@YaoJiayi
Copy link
Contributor

@Oasis-Git Thanks! Quick question: is it backward-compatible?

@Oasis-Git
Copy link
Author

@YaoJiayi Hi jiayi, I do not think it is backward-compatible. Here are some new feature functions like def set_forward_context(context: Any): in vllm 0.6.3.post1 and with previous vllm version, the compilation will fail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants