LingOly

Paper

Title: LINGOLY: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in Low-Resource and Extinct Languages

Abstract: https://arxiv.org/abs/2406.06196

In this paper, we present the LingOly benchmark, a novel benchmark for advanced reasoning abilities in large language models. Using challenging Linguistic Olympiad puzzles, we evaluate (i) capabilities for in-context identification and generalisation of linguistic patterns in very low-resource or extinct languages, and (ii) abilities to follow complex task instructions. The LingOly benchmark covers more than 90 mostly low-resource languages, minimising issues of data contamination, and contains 1,133 problems across 6 formats and 5 levels of human difficulty. We assess performance with both direct accuracy and comparison to a no-context baseline to penalise memorisation. Scores from 11 state-of-the-art LLMs demonstrate the benchmark to be challenging, and models perform poorly on the higher difficulty problems. On harder problems, even the top model only achieved 38.7% accuracy, 24.7% improvement over the no-context baseline. Large closed models typically outperform open models, and in general, the higher resource the language, the better the scores. These results indicate, in absence of memorisation, true multi-step out-of-domain reasoning remains a challenge for current language models.

Homepage: https://github.com/am-bean/lingOly

Citation

@article{beanLINGOLYBenchmarkOlympiadLevel2024,
  title = {{LINGOLY}: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in Low-Resource and Extinct Languages},
  shorttitle = {{LINGOLY}},
  url = {http://arxiv.org/abs/2406.06196},
  author = {Bean, Andrew M. and Hellsten, Simi and Mayne, Harry and Magomere, Jabez and Chi, Ethan A. and Chi, Ryan and Hale, Scott A. and Kirk, Hannah Rose},
  month = jun,
  year = {2024},
  keywords = {Computer Science - Computation and Language}
}

Tasks

lingoly: runs both _context and _nocontext and computes the difference
lingoly_context: exact match of generations to reference answers
lingoly_nocontext: exact match of generations to reference answers, but with context removed

Checklist

For adding novel benchmarks/datasets to the library:

Is the task an existing benchmark in the literature?
- Have you referenced the original paper that introduced the task?
- If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?

If other tasks on this dataset are already supported:

Is the "Main" variant of this task clearly denoted?
Have you provided a short sentence in a README on what each new variant adds / evaluates?
Have you noted which, if any, published evaluation setups are matched by this variant?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

LingOly

Paper

Citation

Tasks

Checklist

Files

README.md

Latest commit

History

README.md

File metadata and controls

LingOly

Paper

Citation

Tasks

Checklist