You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
From this line, notice that it will cast input into appropriate precision with no_grad operation if mixed_precision is on.
In this case, if the input contains grad tensor, for instance, input for this FSDP module is the output computed from another learnable module, then the gradient cannot be backpropagated into this learnable module, causing this learnable module does not receive gradient during optimizer.step()
Wondering why do we casting this with no grad? Is turning off no grad (set to false) safe?
At the moment, i set no_grad as false to bypass the problem.
The text was updated successfully, but these errors were encountered:
fairscale/fairscale/nn/data_parallel/fully_sharded_data_parallel.py
Line 1424 in 5f484b3
From this line, notice that it will cast input into appropriate precision with no_grad operation if mixed_precision is on.
In this case, if the input contains grad tensor, for instance, input for this FSDP module is the output computed from another learnable module, then the gradient cannot be backpropagated into this learnable module, causing this learnable module does not receive gradient during optimizer.step()
Wondering why do we casting this with no grad? Is turning off no grad (set to false) safe?
At the moment, i set no_grad as false to bypass the problem.
The text was updated successfully, but these errors were encountered: