-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add LoRA multihead attention module #1324
base: main
Are you sure you want to change the base?
[WIP] Add LoRA multihead attention module #1324
Conversation
For now, only works with _qkv_same_embed_dim=True.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
This is no longer necessary when unloading the model because the base_layer is already the original layer. This is just a leftover from before we adopted the base_layer pattern.
There was a bug because the removal of the parameter resulted in it no longer appearing in the state_dict and named_parameters. This commit fixes this bug. The bug also exists in the referenced lora-torch library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work ! I left few preliminary comments, I think we can go for the _restore_weights
approach for now as I don't see any other alternative
src/peft/tuners/lora/layer.py
Outdated
lora_alpha: int = 1, | ||
lora_dropout: float = 0.0, | ||
fan_in_fan_out: bool = False, # Set this to True if the layer to replace stores weight like (fan_in, fan_out) | ||
is_target_conv_1d_layer: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_target_conv_1d_layer: bool = False, |
I don't think this is used?
src/peft/tuners/lora/layer.py
Outdated
|
||
self._active_adapter = adapter_name | ||
self.update_layer(adapter_name, r, lora_alpha, lora_dropout, init_lora_weights, use_rslora) | ||
self.is_target_conv_1d_layer = is_target_conv_1d_layer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.is_target_conv_1d_layer = is_target_conv_1d_layer |
We can also just hard-code it to False
self._restore_weights() | ||
return super().state_dict(*args, **kwargs) | ||
|
||
def named_modules(self, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need also to over-write the modules()
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed, as modules
calls named_modules
under the hood. I added a comment to that effect.
@@ -193,11 +193,6 @@ def _replace_module(self, parent, child_name, new_module, child): | |||
if hasattr(child, "base_layer"): | |||
child = child.base_layer | |||
|
|||
if not hasattr(new_module, "base_layer"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this has been removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, forgot to put this into the description of the PR.
These lines are obsolete for some time now. They only apply when we unload the model (otherwise, the if
does not match). Remember when we made the base_layer
switch, we ensured that when unloading, we simply return the base_layer
, no more need to create a new layer (say, a new nn.Linear
when using lora.Linear
) and replace the new layer's weight
by the parent layer's weight
. The base_layer
already has the original weight
. Therefore, these lines are unnecessary.
I removed them now because they were annoying with MultiheadAttention
, because that layer has no weight
attribute, so this line would fail.
- Some clarifying comments - Remove fan_in_fan_out Also: - Raise proper error instead of assert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Benjamin for adding support for torch MHA layer in LoRA, interesting way to use merge, forward and unmerge logic!
@younesbelkada Could I address all your concerns? I pinged the user who wanted to test it on their case. When it comes to docs, I didn't really find a place where we list all supported layers, so no update needed really. |
Before, LoRA was applied only to the in_proj. Now it is also applied to the out_proj. Unfortunately, there is no easy way to just apply a normal lora.Linear to the out_proj by targeting it with target_modules. If that worked, it would be much nicer to do that, so that users can decide for themselves if they want to apply LoRA to the out_proj or not. The reason why it doesn't work is twofold: 1. We cannot really control the order in which LoRA is applied, so when the LoRA adapter is injected to out_proj, the whole MHA layer may already be wrapped by lora.MultiheadAttention. 2. Even if we successfully applied a normal lora.Linear to the out_proj, it would not work correctly. This is because the forward method of out_proj is not used at all by nn.MultiheadAttention. Instead, it just passes the weight and bias to F.multi_head_attention_forward. Therefore, we must ensure that the weights are merged and unmerged correctly, same as for in_proj, and we cannot do that if we use a normal lora.Linear. Note that the test test_merge_layers for MHA fails. This is most likely because of an existing bug in now merging is implemented, see PR huggingface#1355. Once that is merged, the test should pass.
Note: The test |
just wanted to bump this one because it's really the only way for tuning CLIP models after they are released. |
@bghira Do you happen to have a use case where you could test if this PR works and is working well enough speed-wise? I think the implementation could be ready to be merged but ideally we'd have someone with a real use case give it a try. |
i do and i may be able to test it. stupid question but is the code example above complete? i dont see the hinge loss function |
You mean the code right at the top? No, it's not complete at all, just a quick test to show that MHA is applied and the backward pass does not fail. This is not proper nor complete training code. |
Extend the functionality of having different adapters in the same batch to also work with `modules_to_save`.
There was a bug in BOFT that made it impossible in some circumstances to load more than one adapter (creating more than 1 adapter was possible though). This was because a code path that adjusts boft_n_butterfly_factor was only visited when creating a fresh adapter, but not when updating with the 2nd adapter. This was fixed by moving this code path from the BOFT layer's __init__ method to update_layer. A test for loading multiple adapters was added. Since this was a gap in our test suite, this test will be applied to all appropriate PEFT methods, not only BOFT, but the others methods are all passing without needing further changes. For good measure, I also added BOFT to the test suite that checks multiple active adapters. These tests would have also passed without the fix in this PR, since these tests do not load multiple adapters but instead create them, which always worked. Still it's better to have these tests as well.
Eetq/hqq/aqlm don't support XPU yet.
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
@BenjaminBossan lora_config = LoraConfig(
r=16,
target_modules=["in_proj_weight"],
lora_alpha=32,
lora_dropout=0.05
) An error occurs as By the way, I download
(I report the same issue here) |
Params need to be re-registered to appear in state dict.
Had to port some accelerate functions to peft and modify them for this to work.
not stale |
@BenjaminBossan I tried using the code from this PR, but I found that the import torch
import torch.nn as nn
from copy import deepcopy
from peft import LoraConfig, get_peft_model
class Net(nn.Module):
def __init__(self):
super().__init__()
self.mha = nn.MultiheadAttention(1024, 8)
net = Net()
lora_config = LoraConfig(inference_mode=False,
r=8,
lora_alpha=32,
lora_dropout=0.1,
target_modules=['mha']
)
lora_net = get_peft_model(deepcopy(net), lora_config)
print(hasattr(net.mha, 'batch_first'))
print(hasattr(lora_net.mha, 'batch_first')) Output:
As you can see, Is this an expected behavior, or is there a way to preserve the |
@coding-kuku thanks for the question.
Generally, when PEFT wraps a layer, you can access the original layer by inserting the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good in general but I had some questions / comments.
# TODO: work with separate weights | ||
weight_merged = base_layer.in_proj_weight.data.detach() + self.get_delta_weight(active_adapter) | ||
del base_layer.in_proj_weight | ||
base_layer.in_proj_weight = weight_merged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this throw an exception? AFAICS we're assigning a tensor to a parameter value:
foo = torch.nn.Linear(10, 100)
foo.weight = foo.weight.detach() # raises
What am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's true that we change the type here, I guess you could consider this part of the hack to make this work. At the end, through _restore_weights
, the correct type is restored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, yes. I missed the del
statement which unregisters the parameter and, thus, removes the setattr
constraint. WDYT about something along the lines of
# unregister parameter implicitly and overwrite using merged weights; gradients are computed
# after forward and, thus, after unmerging (see forward()), therefore this is safe to do.
del base_layer.in_proj_weight
base_layer.in_proj_weight = orig_weights_in
base_layer = self.get_base_layer() | ||
weight = base_layer.in_proj_weight | ||
del base_layer.in_proj_weight | ||
base_layer.register_parameter("in_proj_weight", nn.Parameter(weight.data, requires_grad=weight.requires_grad)) | ||
|
||
# out_proj | ||
base_layer = base_layer.out_proj.get_base_layer() | ||
weight = base_layer.weight | ||
del base_layer.weight | ||
base_layer.register_parameter("weight", nn.Parameter(weight.data, requires_grad=weight.requires_grad)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is equivalent to the register_parameter
calls in unregister
except for the weight content, right? Maybe refactor this into a function for brevity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you're referring to, where is unregister
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I meant unmerge
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. There is similar code in unmerge
, unload_and_optionally_merge_module
, and _restore_weights
, true. However, it is not quite identical and we would need two new methods, one for each weight. I think at this point, there is not much gained for refactoring this, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm missing something but I don't think you'd need one for each - the whole section is pretty much identical, no?
But it is absolutely not crucial to change this.
def restore_parameters(base_layer, in_proj_weight, in_req_grad, out_proj_weight, out_req_grad):
del base_layer.in_proj_weight
base_layer.register_parameter(
"in_proj_weight",
nn.Parameter(in_proj_weight.data, requires_grad=in_req_grad)
)
out_proj_base_layer = base_layer.out_proj.get_base_layer()
del out_proj_base_layer.weight
out_proj_base_layer.register_parameter(
"weight",
nn.Parameter(out_proj_weight.data, requires_grad=out_req_grad),
)
"""
base_layer = self.get_base_layer()
weight = base_layer.in_proj_weight
del base_layer.in_proj_weight
base_layer.register_parameter("in_proj_weight", nn.Parameter(weight.data, requires_grad=weight.requires_grad))
# out_proj
base_layer = base_layer.out_proj.get_base_layer()
weight = base_layer.weight
del base_layer.weight
base_layer.register_parameter("weight", nn.Parameter(weight.data, requires_grad=weight.requires_grad))
"""
restore_parameters(
self.get_base_layer(),
base_layer.in_proj.weight,
base_layer.in_proj.weight.requires_grad,
base_layer.weight,
base_layer.weight.requires_grad,
)
"""
# in_proj
old_weight = base_layer.in_proj_weight.data - self.get_delta_weight(active_adapter)
del base_layer.in_proj_weight
base_layer.register_parameter("in_proj_weight", nn.Parameter(old_weight, requires_grad=False))
# out_proj
old_weight = base_layer.out_proj.base_layer.weight.data - base_layer.out_proj.get_delta_weight(
active_adapter
)
del base_layer.out_proj.base_layer.weight
base_layer.out_proj.base_layer.register_parameter(
"weight", nn.Parameter(old_weight, requires_grad=False)
)
"""
restore_parameters(
base_layer,
base_layer.in_proj_weight.data - self.get_delta_weight(active_adapter),
False,
base_layer.out_proj.base_layer.weight.data - base_layer.out_proj.get_delta_weight(active_adapter)
False,
)
"""
# extra steps: re-register weights, take care of out_proj layer
# in_proj
weight = base_layer.in_proj_weight
del base_layer.in_proj_weight
base_layer.register_parameter("in_proj_weight", nn.Parameter(weight.data, requires_grad=weight.requires_grad))
# out_proj
out_proj_layer = base_layer.out_proj.get_base_layer()
weight = out_proj_layer.weight
del out_proj_layer.weight
out_proj_layer.register_parameter("weight", nn.Parameter(weight.data, requires_grad=weight.requires_grad))
"""
restore_parameters(
base_layer,
base_layer.in_proj_weight,
base_layer.in_proj_weight.requires_grad,
base_layer.weight,
base_layer.weight.requires_grad,
)
Co-authored-by: githubnemo <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @githubnemo, I committed your suggestions and replied to your comments.
# TODO: work with separate weights | ||
weight_merged = base_layer.in_proj_weight.data.detach() + self.get_delta_weight(active_adapter) | ||
del base_layer.in_proj_weight | ||
base_layer.in_proj_weight = weight_merged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's true that we change the type here, I guess you could consider this part of the hack to make this work. At the end, through _restore_weights
, the correct type is restored.
base_layer = self.get_base_layer() | ||
weight = base_layer.in_proj_weight | ||
del base_layer.in_proj_weight | ||
base_layer.register_parameter("in_proj_weight", nn.Parameter(weight.data, requires_grad=weight.requires_grad)) | ||
|
||
# out_proj | ||
base_layer = base_layer.out_proj.get_base_layer() | ||
weight = base_layer.weight | ||
del base_layer.weight | ||
base_layer.register_parameter("weight", nn.Parameter(weight.data, requires_grad=weight.requires_grad)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you're referring to, where is unregister
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarifications, some comments left.
elif getattr(child, "q_proj_weight", None) is not None: # MHA | ||
weight = child.q_proj_weight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case we support this is never not None
, right?
# TODO: work with separate weights | ||
weight_merged = base_layer.in_proj_weight.data.detach() + self.get_delta_weight(active_adapter) | ||
del base_layer.in_proj_weight | ||
base_layer.in_proj_weight = weight_merged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, yes. I missed the del
statement which unregisters the parameter and, thus, removes the setattr
constraint. WDYT about something along the lines of
# unregister parameter implicitly and overwrite using merged weights; gradients are computed
# after forward and, thus, after unmerging (see forward()), therefore this is safe to do.
del base_layer.in_proj_weight
base_layer.in_proj_weight = orig_weights_in
base_layer = self.get_base_layer() | ||
weight = base_layer.in_proj_weight | ||
del base_layer.in_proj_weight | ||
base_layer.register_parameter("in_proj_weight", nn.Parameter(weight.data, requires_grad=weight.requires_grad)) | ||
|
||
# out_proj | ||
base_layer = base_layer.out_proj.get_base_layer() | ||
weight = base_layer.weight | ||
del base_layer.weight | ||
base_layer.register_parameter("weight", nn.Parameter(weight.data, requires_grad=weight.requires_grad)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm missing something but I don't think you'd need one for each - the whole section is pretty much identical, no?
But it is absolutely not crucial to change this.
def restore_parameters(base_layer, in_proj_weight, in_req_grad, out_proj_weight, out_req_grad):
del base_layer.in_proj_weight
base_layer.register_parameter(
"in_proj_weight",
nn.Parameter(in_proj_weight.data, requires_grad=in_req_grad)
)
out_proj_base_layer = base_layer.out_proj.get_base_layer()
del out_proj_base_layer.weight
out_proj_base_layer.register_parameter(
"weight",
nn.Parameter(out_proj_weight.data, requires_grad=out_req_grad),
)
"""
base_layer = self.get_base_layer()
weight = base_layer.in_proj_weight
del base_layer.in_proj_weight
base_layer.register_parameter("in_proj_weight", nn.Parameter(weight.data, requires_grad=weight.requires_grad))
# out_proj
base_layer = base_layer.out_proj.get_base_layer()
weight = base_layer.weight
del base_layer.weight
base_layer.register_parameter("weight", nn.Parameter(weight.data, requires_grad=weight.requires_grad))
"""
restore_parameters(
self.get_base_layer(),
base_layer.in_proj.weight,
base_layer.in_proj.weight.requires_grad,
base_layer.weight,
base_layer.weight.requires_grad,
)
"""
# in_proj
old_weight = base_layer.in_proj_weight.data - self.get_delta_weight(active_adapter)
del base_layer.in_proj_weight
base_layer.register_parameter("in_proj_weight", nn.Parameter(old_weight, requires_grad=False))
# out_proj
old_weight = base_layer.out_proj.base_layer.weight.data - base_layer.out_proj.get_delta_weight(
active_adapter
)
del base_layer.out_proj.base_layer.weight
base_layer.out_proj.base_layer.register_parameter(
"weight", nn.Parameter(old_weight, requires_grad=False)
)
"""
restore_parameters(
base_layer,
base_layer.in_proj_weight.data - self.get_delta_weight(active_adapter),
False,
base_layer.out_proj.base_layer.weight.data - base_layer.out_proj.get_delta_weight(active_adapter)
False,
)
"""
# extra steps: re-register weights, take care of out_proj layer
# in_proj
weight = base_layer.in_proj_weight
del base_layer.in_proj_weight
base_layer.register_parameter("in_proj_weight", nn.Parameter(weight.data, requires_grad=weight.requires_grad))
# out_proj
out_proj_layer = base_layer.out_proj.get_base_layer()
weight = out_proj_layer.weight
del out_proj_layer.weight
out_proj_layer.register_parameter("weight", nn.Parameter(weight.data, requires_grad=weight.requires_grad))
"""
restore_parameters(
base_layer,
base_layer.in_proj_weight,
base_layer.in_proj_weight.requires_grad,
base_layer.weight,
base_layer.weight.requires_grad,
)
First stab at adding LoRA support for
nn.MultiheadAttention
. See #761.Todos:
For now, only works with_qkv_same_embed_dim=True
-- make it work withFalse
too._qkv_same_embed_dim=False
is out of scope for this PR and can be added in a later PR if needed.DocsApart from docstrings, I don't think anything else needs to be addedUpdate: I now also included the
out_proj
to apply LoRA to.This is a simple test that I ran successfully with the PR in its current state: