intergration for vllm=0.6.3.post1 #39

Oasis-Git · 2024-11-20T02:16:16Z

Here is the integration version of lmcache_vllm.

Fix the problem of:

the judgement in lmcache_should_retrieve and lmcache_should_store
fix the compatibility problem for execute model

Oasis-Git · 2024-11-20T02:18:20Z

Here I show the full log output for offline inference:

$ python offline_inference.py 
INFO LMCache: Initializing lmcache_vllm version 0.6.2.2, supporting vllm versions: ['0.6.3.post1'] [2024-11-19 19:58:11,190] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/__init__.py:35
WARNING LMCache: No LMCache configuration file is set. Returning default config [2024-11-19 19:58:11,268] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_adapter.py:113
WARNING LMCache: Please set the configuration file through the environment variable: LMCACHE_CONFIG_FILE [2024-11-19 19:58:11,269] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_adapter.py:114
INFO 11-19 19:58:16 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='meta-llama/Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1300, download_dir='/local/yuweia/HF/vllm/', load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=meta-llama/Llama-3.1-8B-Instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None)
INFO 11-19 19:58:18 model_runner.py:1056] Starting to load model meta-llama/Llama-3.1-8B-Instruct...
INFO 11-19 19:58:18 weight_utils.py:243] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:04,  1.48s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:03<00:03,  1.59s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:03<00:01,  1.08s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:05<00:00,  1.22s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:05<00:00,  1.26s/it]

INFO 11-19 19:58:24 model_runner.py:1067] Loading model weights took 14.9888 GB
INFO LMCache: Initializing local-only (cpu) backend [2024-11-19 19:58:24,287] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/storage_backend/__init__.py:37
INFO LMCache: Initializing cpu mem, is_pinned: True [2024-11-19 19:58:24,288] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/storage_backend/mem_pool/local_pool.py:64
DEBUG LMCache: Current storage backend type <class 'lmcache.storage_backend.local_backend.LMCLocalBackend'> [2024-11-19 19:58:29,359] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/cache_engine.py:34
INFO 11-19 19:58:30 gpu_executor.py:122] # GPU blocks: 9739, # CPU blocks: 2048
INFO 11-19 19:58:30 gpu_executor.py:126] Maximum concurrency for 1300 tokens per request: 119.86x
INFO 11-19 19:58:33 model_runner.py:1395] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 11-19 19:58:33 model_runner.py:1399] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 11-19 19:58:49 model_runner.py:1523] Graph capturing finished in 16 secs.
DEBUG LMCache: original done [2024-11-19 19:58:50,142] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_injection.py:440
Processed prompts:   0%|                                                                           | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]INFO LMCache: KV cache retrieving mode: RetrieveStatus.PREFILL [2024-11-19 19:58:50,164] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_injection.py:49
INFO LMCache: Using default batched implementation of the get() method [2024-11-19 19:58:50,165] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/storage_backend/abstract_backend.py:120
INFO LMCache: Retrieved 0 chunks [2024-11-19 19:58:50,165] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/cache_engine.py:381
DEBUG LMCache: Returning the original input! [2024-11-19 19:58:50,166] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_adapter.py:647
INFO LMCache: KV cache saving mode: [<StoreStatus.PREFILL: 1>] [2024-11-19 19:58:50,320] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_injection.py:129
INFO LMCache: Using default batched implementation of the put() method [2024-11-19 19:58:50,545] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/storage_backend/abstract_backend.py:99
INFO LMCache: Stored/updated 5 chunks, total time 0.19s, make chunks time 0.19s [2024-11-19 19:58:50,545] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/cache_engine.py:311
DEBUG LMCache: Store skips 0 tokens and then stores 1086 tokens [2024-11-19 19:58:50,546] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_adapter.py:473
Processed prompts: 100%|████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.44s/it, est. speed input: 754.89 toks/s, output: 22.94 toks/s]


First request Time: 1.4499170444905758 seconds


Processed prompts:   0%|                                                                           | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]INFO LMCache: KV cache retrieving mode: RetrieveStatus.PREFILL [2024-11-19 19:58:51,607] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_injection.py:49
INFO LMCache: Using default batched implementation of the get() method [2024-11-19 19:58:51,608] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/storage_backend/abstract_backend.py:120
INFO LMCache: Concatenated 5 chunks -- elapsed time 0.0003560855984687805 [2024-11-19 19:58:51,623] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/cache_engine.py:397
INFO LMCache: Retrieved 5 chunks (1086 tokens in total) --elapsed time 0.014969948679208755 [2024-11-19 19:58:51,623] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/cache_engine.py:407
DEBUG LMCache: Injected token number: 1085 [2024-11-19 19:58:51,623] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_adapter.py:599
DEBUG LMCache: Rebuilt the input! [2024-11-19 19:58:51,627] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_adapter.py:644
INFO LMCache: KV cache saving mode: [<StoreStatus.SUFFIX_PREFILL: 4>] [2024-11-19 19:58:51,656] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_injection.py:129
DEBUG LMCache: Store skips 1086 tokens and then stores 0 tokens [2024-11-19 19:58:51,656] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_adapter.py:473
Processed prompts: 100%|███████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.12it/s, est. speed input: 1218.62 toks/s, output: 31.42 toks/s]


Second request Time: 0.8962920755147934 seconds

Oasis-Git · 2024-11-20T02:21:17Z

Here I show the full log output for LMCache/examples/save_decode_cache experiment

$LMCACHE_CONFIG_FILE=example.yaml CUDA_VISIBLE_DEVICES=1 python offline_inference.py
INFO LMCache: Initializing lmcache_vllm version 0.6.2.2, supporting vllm versions: ['0.6.3.post1'] [2024-11-19 20:08:05,392] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/__init__.py:35
INFO LMCache: Loading LMCache config file example.yaml [2024-11-19 20:08:05,470] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_adapter.py:123
INFO 11-19 20:08:10 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='meta-llama/Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir='/local/yuweia/HF/vllm/', load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=meta-llama/Llama-3.1-8B-Instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None)
INFO 11-19 20:08:12 model_runner.py:1056] Starting to load model meta-llama/Llama-3.1-8B-Instruct...
INFO 11-19 20:08:12 weight_utils.py:243] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.18s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:02<00:02,  1.28s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.14it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:04<00:00,  1.03it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:04<00:00,  1.01s/it]

INFO 11-19 20:08:17 model_runner.py:1067] Loading model weights took 14.9888 GB
INFO LMCache: Initializing local-only (cpu) backend [2024-11-19 20:08:17,480] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/storage_backend/__init__.py:37
INFO LMCache: Initializing cpu mem, is_pinned: True [2024-11-19 20:08:17,480] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/storage_backend/mem_pool/local_pool.py:64
DEBUG LMCache: Current storage backend type <class 'lmcache.storage_backend.local_backend.LMCLocalBackend'> [2024-11-19 20:08:21,912] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/cache_engine.py:34
INFO 11-19 20:08:27 gpu_executor.py:122] # GPU blocks: 8386, # CPU blocks: 2048
INFO 11-19 20:08:27 gpu_executor.py:126] Maximum concurrency for 32768 tokens per request: 4.09x
INFO 11-19 20:08:30 model_runner.py:1395] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 11-19 20:08:30 model_runner.py:1399] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 11-19 20:08:43 model_runner.py:1523] Graph capturing finished in 13 secs.
DEBUG LMCache: original done [2024-11-19 20:08:43,483] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_injection.py:440
Processed prompts:   0%|                                                                           | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]INFO LMCache: KV cache retrieving mode: RetrieveStatus.PREFILL [2024-11-19 20:08:43,505] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_injection.py:49
INFO LMCache: Using default batched implementation of the get() method [2024-11-19 20:08:43,506] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/storage_backend/abstract_backend.py:120
INFO LMCache: Retrieved 0 chunks [2024-11-19 20:08:43,506] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/cache_engine.py:381
DEBUG LMCache: Returning the original input! [2024-11-19 20:08:43,506] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_adapter.py:647
INFO LMCache: KV cache saving mode: [<StoreStatus.PREFILL: 1>] [2024-11-19 20:08:43,739] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_injection.py:129
INFO LMCache: Using default batched implementation of the put() method [2024-11-19 20:08:43,749] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/storage_backend/abstract_backend.py:99
INFO LMCache: Stored/updated 1 chunks, total time 0.00s, make chunks time 0.00s [2024-11-19 20:08:43,749] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/cache_engine.py:311
DEBUG LMCache: Store skips 0 tokens and then stores 74 tokens [2024-11-19 20:08:43,749] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_adapter.py:473
INFO LMCache: KV cache saving mode: [<StoreStatus.DECODE: 3>] [2024-11-19 20:08:49,220] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_injection.py:129
INFO LMCache: Using default batched implementation of the put() method [2024-11-19 20:08:49,265] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/storage_backend/abstract_backend.py:99
INFO LMCache: Stored/updated 1 chunks, total time 0.00s, make chunks time 0.00s [2024-11-19 20:08:49,265] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/cache_engine.py:311
DEBUG LMCache: Store skips 0 tokens and then stores 256 tokens [2024-11-19 20:08:49,266] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_adapter.py:473
INFO LMCache: KV cache saving mode: [<StoreStatus.DECODE: 3>] [2024-11-19 20:08:56,876] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_injection.py:129
INFO LMCache: Using default batched implementation of the put() method [2024-11-19 20:08:56,911] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/storage_backend/abstract_backend.py:99
INFO LMCache: Stored/updated 1 chunks, total time 0.00s, make chunks time 0.00s [2024-11-19 20:08:56,911] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/cache_engine.py:311
DEBUG LMCache: Store skips 256 tokens and then stores 256 tokens [2024-11-19 20:08:56,911] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_adapter.py:473
INFO LMCache: KV cache saving mode: [<StoreStatus.DECODE: 3>] [2024-11-19 20:09:04,530] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_injection.py:129
INFO LMCache: Using default batched implementation of the put() method [2024-11-19 20:09:04,564] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/storage_backend/abstract_backend.py:99
INFO LMCache: Stored/updated 1 chunks, total time 0.00s, make chunks time 0.00s [2024-11-19 20:09:04,565] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/cache_engine.py:311
DEBUG LMCache: Store skips 512 tokens and then stores 256 tokens [2024-11-19 20:09:04,565] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_adapter.py:473
INFO LMCache: KV cache saving mode: [<StoreStatus.DECODE: 3>] [2024-11-19 20:09:12,197] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_injection.py:129
INFO LMCache: Using default batched implementation of the put() method [2024-11-19 20:09:12,231] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/storage_backend/abstract_backend.py:99
INFO LMCache: Stored/updated 1 chunks, total time 0.00s, make chunks time 0.00s [2024-11-19 20:09:12,231] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/cache_engine.py:311
DEBUG LMCache: Store skips 768 tokens and then stores 256 tokens [2024-11-19 20:09:12,231] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_adapter.py:473
Processed prompts: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:30<00:00, 30.95s/it, est. speed input: 2.39 toks/s, output: 33.09 toks/s]


First request Time: 30.95908421650529 seconds


Processed prompts:   0%|                                                                           | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]INFO LMCache: KV cache retrieving mode: RetrieveStatus.PREFILL [2024-11-19 20:09:14,464] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_injection.py:49
INFO LMCache: Using default batched implementation of the get() method [2024-11-19 20:09:14,465] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/storage_backend/abstract_backend.py:120
INFO LMCache: Concatenated 4 chunks -- elapsed time 0.001274898648262024 [2024-11-19 20:09:14,574] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/cache_engine.py:397
INFO LMCache: Retrieved 4 chunks (1024 tokens in total) --elapsed time 0.10927902162075043 [2024-11-19 20:09:14,574] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/cache_engine.py:407
DEBUG LMCache: Injected token number: 1024 [2024-11-19 20:09:14,574] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_adapter.py:599
DEBUG LMCache: Rebuilt the input! [2024-11-19 20:09:14,579] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_adapter.py:644
INFO LMCache: KV cache saving mode: [<StoreStatus.SUFFIX_PREFILL: 4>] [2024-11-19 20:09:14,615] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_injection.py:129
INFO LMCache: Using default batched implementation of the put() method [2024-11-19 20:09:14,620] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/storage_backend/abstract_backend.py:99
INFO LMCache: Stored/updated 1 chunks, total time 0.00s, make chunks time 0.00s [2024-11-19 20:09:14,620] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/cache_engine.py:311
DEBUG LMCache: Store skips 1024 tokens and then stores 92 tokens [2024-11-19 20:09:14,620] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_adapter.py:473
Processed prompts: 100%|███████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.08it/s, est. speed input: 2321.14 toks/s, output: 20.80 toks/s]


Second request Time: 0.4893089309334755 seconds


DEBUG LMCache: Closing LMCache Engine [2024-11-19 20:09:14,947] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache_vllm-0.6.2.2-py3.10.egg/lmcache_vllm/vllm_adapter.py:236
INFO LMCache: Closed the put worker in local backend [2024-11-19 20:09:14,947] -- /local/yuweia/anaconda3/envs/dev/lib/python3.10/site-packages/lmcache-0.1.3-py3.10.egg/lmcache/storage_backend/local_backend.py:240

YaoJiayi · 2024-11-20T21:42:56Z

@Oasis-Git Thanks! Quick question: is it backward-compatible?

Oasis-Git · 2024-11-20T21:48:39Z

@YaoJiayi Hi jiayi, I do not think it is backward-compatible. Here are some new feature functions like def set_forward_context(context: Any): in vllm 0.6.3.post1 and with previous vllm version, the compilation will fail.

intergration for vllm=0.6.3.post1

acfcff4

Oasis-Git requested review from ApostaC and YaoJiayi November 20, 2024 18:36

YaoJiayi mentioned this pull request Nov 25, 2024

Support for newest vllm version 0.6.3 and 0.6.4 LMCache/LMCache#236

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

intergration for vllm=0.6.3.post1 #39

intergration for vllm=0.6.3.post1 #39

Oasis-Git commented Nov 20, 2024

Oasis-Git commented Nov 20, 2024

Oasis-Git commented Nov 20, 2024

YaoJiayi commented Nov 20, 2024

Oasis-Git commented Nov 20, 2024

intergration for vllm=0.6.3.post1 #39

Are you sure you want to change the base?

intergration for vllm=0.6.3.post1 #39

Conversation

Oasis-Git commented Nov 20, 2024

Oasis-Git commented Nov 20, 2024

Oasis-Git commented Nov 20, 2024

YaoJiayi commented Nov 20, 2024

Oasis-Git commented Nov 20, 2024