Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hello,
I have been working on implementing caseHOLD within the lm-harness-evaluation framework. While I managed to create the necessary files (
casehold.yaml
andutils.py
) to preprocess and evaluate the dataset, I encountered significant challenges during evaluation. Specifically, all models I tested consistently performed at or below random chance, which indicates an issue I have been unable to resolve.Results with
I’ve made this pull request to share my implementation and kindly request assistance in debugging or refining it. I’ve also opened an issue linked to this pull request to facilitate further discussion.
For context, here are the resources relevant to the task:
The caseHOLD dataset
Paper on caseHOLD
Any guidance or feedback would be greatly appreciated. Thank you in advance for your support!
Best regards,
David