VLM Support via GPTQ Hooks and Data Pipelines #914

kylesayrs · 2024-11-13T21:35:23Z

Purpose

Enable oneshot quantization of vision-language models

Llama_3 2-Vision Graphviz

Related Issues

Prerequisites

Changes

VLM Support

Add multimodal examples in examples/multimodal_vision
Modify custom_offload_device_map to support models which are not XForCausalLM
Add custom data collators for VLM models in src/llmcompressor/transformers/utils/data_collator.py

GPTQModifier

Implement hooks-based compression in GPTQModifier
- This replaces layer-compressor, which made many assumptions about model architecture
- This also enables finer-grained sequential compression such as true_sequential
- Functions previously implemented in gptq_wrapper.py are now implemented in gptq_quantize.py
Implement offload_hessians parameter in GPTQModifier
Implement data-pipelines-based calibration in GPTQModifier
- First an attempt will be made to trace the model and run the sequential pipeline
- If that fails, assumptions will be made about the model architecture and an attempt will be made to run the layer_sequential pipeline
  - This ensures backwards compatibility with any previously supported models
- If that fails, then the basic pipeline will be used, which is guaranteed to run but may require using offlo ad_hessians
Change hessian instability from a ValueError to a _LinAlgError so it can be ignored by the gptq pipeline fallback mechanism
Add support for conv2d as indicated by AutoGPTQ

Data Pipelines

Implement the basic skeletons of data pipelines, which are subject to change when data pipelines are pulled out of modifiers
Basic Pipeline
- Performs standard forward passes through the model with provided dataloader
- Used as fallback, as well as in the future for basic calibration passes
Layer Sequential Pipeline
- Refactor of LayerCompressor as a straight-forward data pipeline
- Uses IntermediatesCache to handle activation offloading
Sequential Pipeline
- Utilizes graph tracing implemented by torch.fx to trace the graph in order to determine where sequential targets (layers) exist in the graph and what their inputs and outputs are
- Implements BFS algorithm to assign nodes to partitions
  - An ideal implementation consolidates partition indices to assign each node to the latest possible partition, delaying execution. The current implementation addresses the most common case (node.op == get_attr)
- Each partition (Subgraph) is compiled as an executable python function with the proper inputs and outputs
- Uses IntermediatesCache to handle activation offloading
Implement IntermediatesCache which automagically handles the offloading and onloading of activations from batches
- This class is capable of offloading many non-standard activation types such as Tuples and dataclasses such as BaseModelOutputWithPast
- For convenience, the class also handles masking padding
- The class is tested in tests/llmcompressor/pipelines/test_cache.py

Tracing

In order to support sequential quantization of the large variety of different multimodal model architectures, some model definitions have to be altered to support tracing
- If the calibration dataset is text only, most LLMs and VLMs are traceable without additional work. Multimodal calibration datasets are more likely to require additional work to make tracable
- For many VLMs (but not all), the vision tower is not traceable without significant work. However, this only affects sequential error propagation and (minimal?) increased memory usage, which leaves the door open for future support for quantizing modules in the vision tower
Add traceable model definitions for llava, mistral, mllama, and glm
All copyright licenses allow for alteration and redistribution, the line # vllm-project: no copyright was added in similar style to text_generation.py

Future Work/ Follow ups

VLM: TraceableQwen2VLForConditionalGeneration #1027
VLM: Phi3 Vision Example #1032
VLM: TraceableChatGLMForConditionalGeneration #1039
VLM: Model Tracing Guide #1030
Create better data collators capable of handling larger batch sizes in order to support VLM fine tuning
Better support prompt masking for multimodal processors in order to support VLM fine tuning

Winogrande Evaluations

Model	Dataset	Scheme	Runtime	Winogrande
Llama-3-8B	ultrachat	W4A16	43m, 2xA4000	0.7545
Llama-3-70B	ultrachat	W4A16	303m, 1xH100	0.8216
Mixtral-8x7B	ultrachat	W4A16	317m, 1xA100	0.8200
openbmb/MiniCPM3-4B	ultrachat	W4A16	63m, 1xA100	0.6701
Qwen2-VL-2B-Instruct	ultrachat	W8A8	12m, 2xA4000	0.6188
Qwen2-VL-2B-Instruct	flickr	W8A8	24m, 2xA4000	0.6093
Llama-3.2-11B-Vision-Instruct	flickr	W8A8	75m, 1xA100	0.7837
Pixtral-12B-2409	flickr	W8A8	52m, 1xA100	0.7924
llava-1.5-7b-hf	flickr	W8A8	15m, 1xH100	0.7214
Phi-3-vision-128k-instruct	flickr	W4A16	51m, 1xA100	0.7151

lm_eval --model vllm --model_args pretrained="path/to/model",dtype=auto,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enforce_eager=True,add_bos_token=True --tasks winogrande --num_fewshot 5 --batch_size 32
lm_eval --model vllm --model_args pretrained="path/to/model",dtype=bfloat16,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enforce_eager=True,add_bos_token=True,max_num_seqs=1 --tasks winogrande --num_fewshot 5 --batch_size 1

MMMU Evaluations

Credit to @shubhra

Model	Dataset	Scheme	MMMU
Llama-3.2-11B-Vision	N/A	Dense	0.4144
Llama-3.2-11B-Vision	N/A	FP8-dynamic	0.4300
Llama-3.2-11B-Vision	flickr	W4A16	0.4377
Llama-3.2-11B-Vision	flickr	W4A16-group	0.4211

Model	Dataset	Scheme	MMMU
Llama-3.2-90B-Vision	N/A	Dense	0.5388
Llama-3.2-90B-Vision	N/A	FP8-dynamic	0.5278
Llama-3.2-90B-Vision	flickr	W4A16	0.5111
Llama-3.2-90B-Vision	flickr	W4A16-group	0.5477

Model	Dataset	Scheme	MMMU
Pixtral-12B-2409	N/A	Dense	0.5022
Pixtral-12B-2409	N/A	FP8-dynamic	0.5322
Pixtral-12B-2409	flickr	W4A16	0.4500
Pixtral-12B-2409	flickr	W4A16-group	0.4689

Testing

Nightly

Signed-off-by: Kyle Sayers <[email protected]>

…s. Requires patching modeling_llava

Signed-off-by: Kyle Sayers <[email protected]>

src/llmcompressor/pipelines/layer_sequential/pipeline.py

src/llmcompressor/pytorch/utils/helpers.py

src/llmcompressor/pipelines/cache.py

src/llmcompressor/pipelines/layer_sequential/helpers.py

src/llmcompressor/pipelines/sequential/pipeline.py

dsikka · 2025-01-06T19:03:18Z

src/llmcompressor/modifiers/quantization/gptq/base.py

+        input_names = state.data.calib.dataset.column_names
+        unfixable_errors = (torch.OutOfMemoryError, torch._C._LinAlgError)
+        try:
+            run_sequential(


Could we do "Layer Sequential" and "Subgraph Sequential" ? Sequential being indicative of the data/error propagation while using "layer" and "subgraph" to differentiate between data structures?

src/llmcompressor/pipelines/layer_sequential/pipeline.py

src/llmcompressor/pipelines/sequential/pipeline.py

examples/multimodal_vision/llava_example.py

Signed-off-by: Kyle Sayers <[email protected]>

…allbacks Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs · 2025-01-06T21:14:38Z

https://github.com/neuralmagic/llm-compressor-testing/actions/runs/12640439996

rahul-tuli

Really like the IntermediatesCache implementation, good job!

src/llmcompressor/pipelines/basic/pipeline.py

src/llmcompressor/pipelines/layer_sequential/helpers.py

src/llmcompressor/pipelines/layer_sequential/pipeline.py

src/llmcompressor/transformers/tracing/glm/LICENSE

Signed-off-by: Kyle Sayers <[email protected]>

dsikka · 2025-01-07T01:18:23Z

src/llmcompressor/modifiers/quantization/gptq/base.py

+        input_names = state.data.calib.dataset.column_names
+        unfixable_errors = (torch.OutOfMemoryError, torch._C._LinAlgError)
+        try:
+            run_sequential(


hm let me think of other descriptors
I think we just want each of the pipelines beyond the basic pipeline to be a little more verbose in its name

dsikka · 2025-01-07T01:26:08Z

src/llmcompressor/pipelines/sequential/helpers.py

+) -> HFTracer:
+    """
+    Get a tracer specialized for the given model. The resulting tracer will not trace
+    inside of sequential targets, ignored targets, or offloaded modules.


Trying to understand this comment. If the resulting tracer does not trace offloaded modules, how does this work for cases when we have parts of the model offloaded?

What does the tracer actually trace?

Tracing within sequential targets and ignored targets is unnecessary, and tracing within offloaded modules may result in meta tensors being added to the model graph

When a module is "not traced", this means that the internals of module are not traced, but the module still appears in the graph as a call_module node.

For example, in this model graph, the internals of language_model_model_layers_39 are not traced (we don't see individual nodes for the attention mechanism, ffn, ect.) but all the operations after the module are traced with high granularity (we see individual pow, .to and .dtype functions).

As a side note, even if a module contains untraceable code internally, if the internals of the module skip tracing via ignore or sequential_targets or has_offloaded_params, then the model graph as a whole will still be traceable, but just with less granularity.

@mgoin The tracer traces all of the objects and operations that are need perform a forward pass of the model.

src/llmcompressor/pipelines/sequential/helpers.py

dsikka · 2025-01-07T01:30:44Z

src/llmcompressor/pipelines/sequential/helpers.py

+    skip_trace_modules = sequential_targets | offloaded_modules | ignore
+
+    class SequentialTracer(HFTracer):
+        def create_arg(self, a: Any) -> Argument:


can you explain why create_arg is needed?

special extension allows models which depend on config values to be traced

I overload this function to insert my own definition for creating an argument which is of type PretrainedConfig. Many models use values from the config during execution, but torch.fx is only capable of "baking" a limited set of class types into the model graph.

This code says, whenever the graph would try to reference an instance of a PretrainedConfig (for example, to get an attribute like config.max_sequence_length), instead just create a PretrainedConfig on the fly and initialize it with all of the args from the original config.

There are a few things in this file that we should consider upstreaming HF, this might be one of them

src/llmcompressor/pipelines/sequential/helpers.py

dsikka · 2025-01-07T01:45:26Z

src/llmcompressor/pipelines/sequential/helpers.py

+
+    :param subgraphs: list of subgraphs with empty `consumed_names` attributes
+    """
+    # populate consumed_names according to when inputs are last used


what does this mean by "last used"?

Some input names are used by multiple subgraphs in the model (for example, the cross attention output is used by every text decoder layer in mllama) while other input names are only used once (for example, the output text decoder layer is only used as the input to the next text decoder layer).

All subgraph outputs are stored in the IntermediatesCache. However, we only want to keep outputs which will be later used as inputs to later subgraphs and vacate outputs which are never used again (this is really only to reduce cpu memory usage). Therefore, for each name, we need to find the index of the subgraph which is the last user of that name. After that, we can vacate that output from the cache.

Side note, outputs which do not lead to inputs are automatically pruned by the instantiation of GraphModule, and this is validated by check_assumption).

src/llmcompressor/transformers/tracing/mllama.py

examples/multimodal_vision/pixtral_example.py

mgoin

I think these Traceable model definitions have very opaque changes compared to the reference model definitions. This architecture seems like an intensive blocker to add support for a new model, as it requires a lot of knowledge of tracing limitations. However I understand the need - I'll look in more detail tomorrow

mgoin · 2025-01-07T03:25:41Z

src/llmcompressor/pipelines/sequential/helpers.py

+                # bug in trace throws an error for variadic
+                # args and kwargs in function signature


Is this just explaining that you couldn't pass *args, **kwargs here?

This is explaining why I need to write my own populate_concrete_args function rather than rely on the one provided by transformers

mgoin · 2025-01-07T03:26:24Z

src/llmcompressor/pipelines/sequential/helpers.py

+) -> HFTracer:
+    """
+    Get a tracer specialized for the given model. The resulting tracer will not trace
+    inside of sequential targets, ignored targets, or offloaded modules.


What does the tracer actually trace?

mgoin · 2025-01-07T03:43:15Z

src/llmcompressor/transformers/tracing/mistral.py

+# TRACING: Must use MistralModel
+class MistralForCausalLM(MistralForCausalLM):
+    def __init__(self, config):
+        super(MistralPreTrainedModel, self).__init__(config)
+        # TRACING: Must use MistralModel
+        self.model = MistralModel(config)
+        self.vocab_size = config.vocab_size
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+
+        # Initialize weights and apply final processing
+        self.post_init()


This looks to be the same as the modeling definition https://github.com/huggingface/transformers/blob/12ba96aa3cb3e4ed2a3ffb77b59f53f8ce9ac1fa/src/transformers/models/mistral/modeling_mistral.py#L752-L763
What is the purpose of this comment and code?

That comment should say

# TRACING: Must use MistralModel with wrapped _prepare_4d_causal_attention_mask_with_cache_position function

Note that we define a version of the MistralModel which wraps the problematic function, and it is this definition that is used by MistralForCausalLM.

https://github.com/vllm-project/llm-compressor/pull/914/files/153a4fa3c4d4831e01219b8d7a901366555bd960#diff-75d8b6c11cbd2f8c8efbb720ba0566c4c895c4bbadeee774382f7fe7a4f6a3baR105-R106

kylesayrs · 2025-01-07T06:32:28Z

@mgoin I think the Tracing Guide will clarify how and why to make changes to your model to make it traceable and why tracing is the best and least invasive solution currently available.

Also note that

Unlike vllm, custom model definitions are not needed for every model. For the vast majority of text models, custom definitions are not required. Most vision models when calibrated with text datasets also do not require custom tracing. Custom definitions are mostly required for vision models when calibrated with vision datasets, and even then some models like phi3_vision do not require any changes.
Even if a text model is not traceable, gptq falls back to the layer_sequential pipeline, which is equivalent to what is currently on main. Therefore these changes only extend what is possible with llm-compressor now.

Signed-off-by: Kyle Sayers <[email protected]>

This was referenced Nov 14, 2024

Implement HooksMixin #917

Merged

[Docs] GPTQ Docstring, better argument grouping #841

Closed

kylesayrs added 7 commits November 15, 2024 21:50

integrate with QuantizationModifier

1ae3ce0

Signed-off-by: Kyle Sayers <[email protected]>

update hooks in tests

fc2488f

Signed-off-by: Kyle Sayers <[email protected]>

integrate with wanda

d0dc807

Signed-off-by: Kyle Sayers <[email protected]>

integrate with magnitude and constant

55f69d6

Signed-off-by: Kyle Sayers <[email protected]>

integrate with SparseGPTModifier

59ffe44

Signed-off-by: Kyle Sayers <[email protected]>

add hooksmixin to modifier

21fe61b

Signed-off-by: Kyle Sayers <[email protected]>

Merge remote-tracking branch 'origin' into kylesayrs/HooksMixin

ba01137

This was referenced Nov 16, 2024

[GPTQ] Vision Model Support #850

Closed

Kylesayrs/gptq batched updates #879

Closed

kylesayrs added 19 commits November 18, 2024 19:30

Merge remote-tracking branch 'origin' into kylesayrs/HooksMixin

3771a89

Merge branch 'kylesayrs/HooksMixin' into kylesayrs/gptq-partition

ccc5458

merge

a5635a1

Signed-off-by: Kyle Sayers <[email protected]>

small updates

83ed409

Signed-off-by: Kyle Sayers <[email protected]>

Merge branch 'main' into kylesayrs/HooksMixin

7fd142b

WIP

d104282

WIP

236a47a

able to run without hooks

188896e

issue with different sizes

8ef9c23

able to run through pixtral without issue and using real proxy tensor…

1362ca2

…s. Requires patching modeling_llava

nits

0539df7

Signed-off-by: Kyle Sayers <[email protected]>

Merge remote-tracking branch 'origin' into kylesayrs/HooksMixin

a734393

Merge branch 'kylesayrs/HooksMixin' into kylesayrs/gptq-partition

ea10aed

fix all variable

ed96ee4

tmp

5f26711

wip

ebc2c41

wip

922b407

testing with lots of models

0577f36

preliminary data pipeline

3830696

kylesayrs added 7 commits January 4, 2025 07:42

IntermediatesCache docstrings

00309e9

Signed-off-by: Kyle Sayers <[email protected]>

free hessians on finalize

57e8f21

Signed-off-by: Kyle Sayers <[email protected]>

remove unnecessary examples

378afb3

Signed-off-by: Kyle Sayers <[email protected]>

make diff closer to original implementation

83b81be

Signed-off-by: Kyle Sayers <[email protected]>

Merge branch 'main' into kylesayrs/gptq-partition

b6c0a50

use original mask padding function

5363d40

Signed-off-by: Kyle Sayers <[email protected]>

reduce diff

ae89688

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs requested a review from dsikka January 5, 2025 05:57

dsikka reviewed Jan 6, 2025

View reviewed changes

kylesayrs added 6 commits January 6, 2025 21:00

replace list comprehesion

d3eebfe

Signed-off-by: Kyle Sayers <[email protected]>

nit: only pass first layer

412086c

Signed-off-by: Kyle Sayers <[email protected]>

revert changes to tensors_to_device

8433304

Signed-off-by: Kyle Sayers <[email protected]>

type hint intermediates cache for clarity

07b3cc3

Signed-off-by: Kyle Sayers <[email protected]>

make hessian instability a _LinAlgError so it can be caught by gptq f…

895b409

…allbacks Signed-off-by: Kyle Sayers <[email protected]>

Merge remote-tracking branch 'origin' into kylesayrs/gptq-partition

18fe751

rahul-tuli previously approved these changes Jan 6, 2025

View reviewed changes

defer chatglm for later

336e064

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs dismissed rahul-tuli’s stale review via 336e064 January 6, 2025 23:55

kylesayrs added 2 commits January 7, 2025 00:13

docstrings, reorder pipeline args

f6312d0

Signed-off-by: Kyle Sayers <[email protected]>

correct typos

153a4fa

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs requested review from rahul-tuli and dsikka January 7, 2025 00:22

dsikka reviewed Jan 7, 2025

View reviewed changes

examples/multimodal_vision/pixtral_example.py Show resolved Hide resolved

mgoin reviewed Jan 7, 2025

View reviewed changes

kylesayrs requested review from mgoin and dsikka January 7, 2025 06:35

code clarity

3f9dd7d

Signed-off-by: Kyle Sayers <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VLM Support via GPTQ Hooks and Data Pipelines #914

VLM Support via GPTQ Hooks and Data Pipelines #914

kylesayrs commented Nov 13, 2024 •

edited

Loading

dsikka Jan 6, 2025

kylesayrs commented Jan 6, 2025

rahul-tuli left a comment

dsikka Jan 7, 2025

dsikka Jan 7, 2025

mgoin Jan 7, 2025

kylesayrs Jan 7, 2025

kylesayrs Jan 7, 2025

kylesayrs Jan 7, 2025

kylesayrs Jan 7, 2025

dsikka Jan 7, 2025 •

edited

Loading

kylesayrs Jan 7, 2025

kylesayrs Jan 7, 2025

dsikka Jan 7, 2025

kylesayrs Jan 7, 2025 •

edited

Loading

mgoin left a comment

mgoin Jan 7, 2025

kylesayrs Jan 7, 2025

mgoin Jan 7, 2025

mgoin Jan 7, 2025

kylesayrs Jan 7, 2025 •

edited

Loading

kylesayrs commented Jan 7, 2025

		# bug in trace throws an error for variadic
		# args and kwargs in function signature

VLM Support via GPTQ Hooks and Data Pipelines #914

Are you sure you want to change the base?

VLM Support via GPTQ Hooks and Data Pipelines #914

Conversation

kylesayrs commented Nov 13, 2024 • edited Loading

Purpose

Related Issues

Prerequisites

Changes

VLM Support

GPTQModifier

Data Pipelines

Tracing

Future Work/ Follow ups

Winogrande Evaluations

MMMU Evaluations

Testing

Choose a reason for hiding this comment

kylesayrs commented Jan 6, 2025

rahul-tuli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsikka Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylesayrs Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

mgoin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylesayrs Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

kylesayrs commented Jan 7, 2025

kylesayrs commented Nov 13, 2024 •

edited

Loading

dsikka Jan 7, 2025 •

edited

Loading

kylesayrs Jan 7, 2025 •

edited

Loading

kylesayrs Jan 7, 2025 •

edited

Loading