-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compared with the native Torch cross-entropy, the gradient differences of the classifier are very large. #14
Comments
Probably related to this, I observe that the loss increases during training using |
This is something we've noticed too, specifically of training models from scratch. @BangguWu is the loss gap on train or val? I have seldom seen train loss be different (except in the case of triton bugs that we haven't worked around yet), but I have seen val loss be different when training from scratch and the validation set has tokens that aren't present in the train set. |
We have been working on some updates on this branch. You can it via That adds two options to
@BangguWu @zhixuan-lin If you are up for some beta testing, feel free to try these out and let me know how it goes. |
I have implemented a toy code
and the output is:
the gradient of classifier looks like very large.
Also I have tried to train a llm model using gpt2 arch, the loss gap is about 0.06 when training 100B tokens.
any wrong usage is there?
The text was updated successfully, but these errors were encountered: