Question about Ulysses and loss agregation #6841

pavelgein · 2024-12-09T11:19:04Z

HI,
I am using Ulysses Attention and DeepSpeed Zero3 optimizer for DPO training.
My question what is the right way of loss aggregation?
When one trains model with CrossEntropyLoss, each rank yields loss for each own sequence subset.
But when on trains model with DPO loss, we have only loss for the whole example.
What is right way to deal with it?

samadejacobs · 2024-12-09T18:47:32Z

Please take a look at sequence-parallel-aware cross entropy here.

pavelgein · 2024-12-10T05:03:55Z

Yes, I have seen this before. As far as I understand, in this approach on each rank all the logits are stored and on the backward pass only required parts of gradients are taken into account.

I was trying to use approach when some reduction was done before communication inside sequence parallel group (I was trying to reduce the communication load), I am going to try to create something in that way

pavelgein · 2024-12-10T08:26:13Z

I see gradient difference.
I have two ranks in the sequence parallel group (SPG), compute gradients of the loss and slicing it with respect to rank in SPG, therefore I have two gradients g_1 and g_2, one on each rank.

When I run the same setup without sequence parallelism, I have gradient g, and for all layers I have approximately g == g_1 + g_2.
So my question is whether DeepSpeed Zero3 optimizer handle this case correctly or it just take the average across all ranks?

I see the difference of gradient norm as well, I have around 3.24 on each rank with sequence parallelism and around 9.11 without sequence parallelism

ronald-d-rogers · 2024-12-17T21:39:09Z

@pavelgein, it does not do either. The way it is currently implemented it just returns the loss without any reduction. You are meant to do a loss.mean() or a loss.sum() yourself.

I am going to start testing the loss outputs soon and was wondering if this fixes your issue.

pavelgein · 2024-12-26T08:46:59Z

@ronald-d-rogers I think, I didn't used the rights words in my question.

When we do DDP, we split the dataset across workers, compute gradient of outputs with respect input on every worker, and then we take the average of gradients across all the workers. Since output on each worker depends only on its input, it gives as a gradient estimation.

Now, when we use sequence parallelism, independence worker output from input of other worker does not hold.
If we consider the sequence parallel group as one worker, then we should take the sum of gradients inside this group, not the average.

pavelgein · 2024-12-29T17:41:13Z

Here Zero3 optimizer reduce the number of workers by factor of sequence parallel group size

DeepSpeed/deepspeed/runtime/zero/stage3.py

Line 1343 in eea5304

buffer_to_reduce.div_(world_sz / float(self.sequence_parallel_size))

ronald-d-rogers · 2025-01-02T22:20:57Z

Ah I think I understand now. I actually tried the same thing as you did -- do the reduction in the sequence parallel group -- but gave up. I tried to modify the method he provided (VocabSequenceParallelCrossEntropy) to accept reduction as an arg, and both passing it in to nll_loss or returning loss_all.sum()/loss_all.mean() at the end. I think it is possible, but ran into issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about Ulysses and loss agregation #6841

Question about Ulysses and loss agregation #6841

pavelgein commented Dec 9, 2024

samadejacobs commented Dec 9, 2024

pavelgein commented Dec 10, 2024

pavelgein commented Dec 10, 2024

ronald-d-rogers commented Dec 17, 2024

pavelgein commented Dec 26, 2024

pavelgein commented Dec 29, 2024

ronald-d-rogers commented Jan 2, 2025 •

edited

Loading

Question about Ulysses and loss agregation #6841

Question about Ulysses and loss agregation #6841

Comments

pavelgein commented Dec 9, 2024

samadejacobs commented Dec 9, 2024

pavelgein commented Dec 10, 2024

pavelgein commented Dec 10, 2024

ronald-d-rogers commented Dec 17, 2024

pavelgein commented Dec 26, 2024

pavelgein commented Dec 29, 2024

ronald-d-rogers commented Jan 2, 2025 • edited Loading

ronald-d-rogers commented Jan 2, 2025 •

edited

Loading