When mixed precision input contains grad tensor, FSDP cast it with no grad #1191

kamwoh · 2024-12-08T16:47:07Z

fairscale/fairscale/nn/data_parallel/fully_sharded_data_parallel.py

Line 1424 in 5f484b3

args, kwargs = cast_floats_to_right_precision(True, True, *args, **kwargs)

From this line, notice that it will cast input into appropriate precision with no_grad operation if mixed_precision is on.

In this case, if the input contains grad tensor, for instance, input for this FSDP module is the output computed from another learnable module, then the gradient cannot be backpropagated into this learnable module, causing this learnable module does not receive gradient during optimizer.step()

Wondering why do we casting this with no grad? Is turning off no grad (set to false) safe?

At the moment, i set no_grad as false to bypass the problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When mixed precision input contains grad tensor, FSDP cast it with no grad #1191

When mixed precision input contains grad tensor, FSDP cast it with no grad #1191

kamwoh commented Dec 8, 2024

When mixed precision input contains grad tensor, FSDP cast it with no grad #1191

When mixed precision input contains grad tensor, FSDP cast it with no grad #1191

Comments

kamwoh commented Dec 8, 2024