-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mlx Model (loglikelihood & generate_until) #1902
base: main
Are you sure you want to change the base?
Conversation
I'm getting the following traceback running the evaluation this way (in an environment with mlx and mlx-lm): lm_eval --model mlx --model_args model=internistai/base-7b-v0.2 \
--tasks medqa_4options \
--batch_size 64 Traceback: 2024-05-29:13:18:14,114 INFO [__main__.py:254] Verbosity set to INFO
2024-05-29:13:18:16,354 INFO [__main__.py:341] Selected Tasks: ['medqa_4options']
2024-05-29:13:18:16,355 INFO [evaluator.py:141] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-05-29:13:18:16,355 INFO [evaluator.py:178] Initializing mlx model, with arguments: {'model': 'internistai/base-7b-v0.2'}
Fetching 9 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 32968.33it/s]
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
2024-05-29:13:18:20,863 INFO [mlx_llms.py:28] Model type is '<class 'mlx_lm.models.llama.Model'>
2024-05-29:13:18:22,781 INFO [task.py:398] Building contexts for medqa_4options on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1273/1273 [00:00<00:00, 198223.53it/s]
2024-05-29:13:18:22,818 INFO [evaluator.py:395] Running loglikelihood requests
Running loglikelihood requests (79 batches): 37%|███████████████████████████████████████▋ | 29/79 [10:13<15:22, 18.46s/it]Running loglikelihood requests (79 batches): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 79/79 [26:40<00:00, 20.26s/it]
[..snip..]
Traceback (most recent call last):
File "/path/to/bin/lm_eval", line 8, in <module>
sys.exit(cli_evaluate())
^^^^^^^^^^^^^^
File "/path/to/lm_eval/__main__.py", line 347, in cli_evaluate
results = evaluator.simple_evaluate(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/lm_eval/utils.py", line 321, in _wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/path/to/lm_eval/evaluator.py", line 256, in simple_evaluate
results = evaluate(
^^^^^^^^^
File "/path/to/lm_eval/utils.py", line 321, in _wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/path/to/lm_eval/evaluator.py", line 421, in evaluate
task.apply_filters()
File "/path/to/lm_eval/api/task.py", line 1000, in apply_filters
f.apply(self._instances)
File "/path/to/lm_eval/api/filter.py", line 55, in apply
for inst, resp in zip(instances, resps):
File "/path/to/lm_eval/filters/selection.py", line 23, in <lambda>
return map(lambda r: r[0], resps) The implemented loglikelihood function returns a list of 5,056 pairs of (log-likelihood, boolean). However, for some reason, the TakeFirstFilter.apply method receives a resps parameter with 5,092 resources, the last of which are empty lists, which seems to be causing the traceback. Any help would be greatly appreciated. |
However, I was able to run it against mmlu_professional_medicine: lm_eval --model mlx --model_args model=internistai/base-7b-v0.2 \
> --tasks mmlu_professional_medicine \
> --batch_size 64
[..snip..]
mlx (model=internistai/base-7b-v0.2), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64
| Tasks |Version|Filter|n-shot|Metric|Value | |Stderr|
|---------------------|------:|------|-----:|------|-----:|---|-----:|
|professional_medicine| 0|none | 0|acc |0.1838|± |0.0235| |
Oddly enough, I can get a clean eval of internistai/base-7b-v0.2 against mmlu_professional_medicine tasks on MLX and then HF but still get the issue above when run against the medqa_4options task: % time lm_eval --model mlx --model_args model=internistai/base-7b-v0.2 \
--tasks mmlu_professional_medicine \
--batch_size 64
2024-05-31:15:31:05,832 INFO [evaluator.py:395] Running loglikelihood requests
Running loglikelihood requests (17 batches): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 17/17 [04:55<00:00, 17.36s/it]
mlx (model=internistai/base-7b-v0.2), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64
| Tasks |Version|Filter|n-shot|Metric|Value | |Stderr|
|---------------------|------:|------|-----:|------|-----:|---|-----:|
|professional_medicine| 0|none | 0|acc |0.7647|± |0.0258|
lm_eval --model mlx --model_args model=internistai/base-7b-v0.2 --tasks 64 7.96s user 35.39s system 13% cpu 5:10.00 total Hugging Face run on the same model: % time lm_eval --model hf --model_args pretrained=internistai/base-7b-v0.2,dtype="float" --tasks mmlu_professional_medicine --device mps --batch_size 64
hf (pretrained=internistai/base-7b-v0.2,dtype=float), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64
| Tasks |Version|Filter|n-shot|Metric|Value | |Stderr|
|---------------------|------:|------|-----:|------|-----:|---|-----:|
|professional_medicine| 0|none | 0|acc |0.7647|± |0.0258|
lm_eval --model hf --model_args --tasks mmlu_professional_medicine --device 28.83s user 117.90s system 63% cpu 3:49.41 total |
I fixed some handling of batch remainders, and it looks good; running comparisons against HF/MPS/Pytorch for medqa and some related subsets of MMLU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could add installation dependancies (like lm_eval[mlx]
see pyproject.toml
) and a way to check if library is installed when called (see lm_eval/models/anthropic_llms.py
)
@haileyschoelkopf bringing this to your attention as well. |
…en continuation caching, but getting a Key Error in re_ord.get_cache(..)
This comment was marked as outdated.
This comment was marked as outdated.
Fixed logic and include mask in what is returned from _preserve_last_target_len_logits. Fixed padding, cont_tok tracking, and removed one-token continuation caching
@baberabb I've removed all dependencies on the caching and I'm able to get similar answer log prob and greedy = continuation values for a handful of questions I probed. However, the final top-level figures still don't match, and I have run out of ideas why and wonder if the issue is at the level above _loglikelihood_tokens:
|
Update _preserve_last_target_len_logits to fix identification of target sequences. Moved calculation of log-probs at the corresponding continuation token indices to be done entirely as array manipulation for efficiency Fixed returning of results
I have made many updates and now have figures that seem reasonably close to those of the HF model. I reviewed log prob scores (via --log_samples) for individual answers between the two, and they were comparable as well. Prefix prompt caching was also added, and generate_until support was removed (I can add a more robust implementation in a subsequent PR).
|
This adds a new model type for mlx models. In particular, it implements the loglikelihood and generate_until interfaces. Works with the current versions of mlx and mlx-lm
The new model type is mlx, so the harness can be run this way to evaluate against a local mlx model:
The expected model args are: