mlx Model (loglikelihood & generate_until) #1902

chimezie · 2024-05-29T21:00:33Z

This adds a new model type for mlx models. In particular, it implements the loglikelihood and generate_until interfaces. Works with the current versions of mlx and mlx-lm

The new model type is mlx, so the harness can be run this way to evaluate against a local mlx model:

lm_eval --model mlx --model_args model=.. model name or path ..   --tasks medqa_4options

The expected model args are:

model (huggingface model or local path to mlx model)
adapter_path (path to a LoRa adapter to apply to the model)
trust_remote_code
eos_token
top_p (defaults to 1)
max_tokens (defaults to 2048)
batch_size (defaults to 4)
max_gen_tokens (defaults to 256)
ensure_bos_token (defaults to False) : Whether or not to ensure the first token is a defined BOS token

CLAassistant · 2024-05-29T21:00:39Z

All committers have signed the CLA.

chimezie · 2024-05-29T21:10:03Z

I'm getting the following traceback running the evaluation this way (in an environment with mlx and mlx-lm):

lm_eval --model mlx --model_args model=internistai/base-7b-v0.2 \
    --tasks medqa_4options \
    --batch_size 64

Traceback:

2024-05-29:13:18:14,114 INFO     [__main__.py:254] Verbosity set to INFO
2024-05-29:13:18:16,354 INFO     [__main__.py:341] Selected Tasks: ['medqa_4options']
2024-05-29:13:18:16,355 INFO     [evaluator.py:141] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-05-29:13:18:16,355 INFO     [evaluator.py:178] Initializing mlx model, with arguments: {'model': 'internistai/base-7b-v0.2'}
Fetching 9 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 32968.33it/s]
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
2024-05-29:13:18:20,863 INFO     [mlx_llms.py:28] Model type is '<class 'mlx_lm.models.llama.Model'>
2024-05-29:13:18:22,781 INFO     [task.py:398] Building contexts for medqa_4options on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1273/1273 [00:00<00:00, 198223.53it/s]
2024-05-29:13:18:22,818 INFO     [evaluator.py:395] Running loglikelihood requests
Running loglikelihood requests (79 batches):  37%|███████████████████████████████████████▋                                                                    | 29/79 [10:13<15:22, 18.46s/it]Running loglikelihood requests (79 batches): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 79/79 [26:40<00:00, 20.26s/it]
[..snip..]
Traceback (most recent call last):
  File "/path/to/bin/lm_eval", line 8, in <module>
    sys.exit(cli_evaluate())
             ^^^^^^^^^^^^^^
  File "/path/to/lm_eval/__main__.py", line 347, in cli_evaluate
    results = evaluator.simple_evaluate(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/lm_eval/utils.py", line 321, in _wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/path/to/lm_eval/evaluator.py", line 256, in simple_evaluate
    results = evaluate(
              ^^^^^^^^^
  File "/path/to/lm_eval/utils.py", line 321, in _wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/path/to/lm_eval/evaluator.py", line 421, in evaluate
    task.apply_filters()
  File "/path/to/lm_eval/api/task.py", line 1000, in apply_filters
    f.apply(self._instances)
  File "/path/to/lm_eval/api/filter.py", line 55, in apply
    for inst, resp in zip(instances, resps):
  File "/path/to/lm_eval/filters/selection.py", line 23, in <lambda>
    return map(lambda r: r[0], resps)

The implemented loglikelihood function returns a list of 5,056 pairs of (log-likelihood, boolean). However, for some reason, the TakeFirstFilter.apply method receives a resps parameter with 5,092 resources, the last of which are empty lists, which seems to be causing the traceback.

Any help would be greatly appreciated.

chimezie · 2024-05-30T00:54:43Z

However, I was able to run it against mmlu_professional_medicine:

lm_eval --model mlx --model_args model=internistai/base-7b-v0.2 \
>     --tasks mmlu_professional_medicine \
>     --batch_size 64
[..snip..]
mlx (model=internistai/base-7b-v0.2), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64
|        Tasks        |Version|Filter|n-shot|Metric|Value |   |Stderr|
|---------------------|------:|------|-----:|------|-----:|---|-----:|
|professional_medicine|      0|none  |     0|acc   |0.1838|±  |0.0235|

chimezie · 2024-06-01T03:15:24Z

Oddly enough, I can get a clean eval of internistai/base-7b-v0.2 against mmlu_professional_medicine tasks on MLX and then HF but still get the issue above when run against the medqa_4options task:

% time lm_eval --model mlx --model_args model=internistai/base-7b-v0.2 \                                              
    --tasks mmlu_professional_medicine \
    --batch_size 64 
2024-05-31:15:31:05,832 INFO     [evaluator.py:395] Running loglikelihood requests
Running loglikelihood requests (17 batches): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 17/17 [04:55<00:00, 17.36s/it]
mlx (model=internistai/base-7b-v0.2), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64
|        Tasks        |Version|Filter|n-shot|Metric|Value |   |Stderr|
|---------------------|------:|------|-----:|------|-----:|---|-----:|
|professional_medicine|      0|none  |     0|acc   |0.7647|±  |0.0258|

lm_eval --model mlx --model_args model=internistai/base-7b-v0.2 --tasks   64  7.96s user 35.39s system 13% cpu 5:10.00 total

Hugging Face run on the same model:

% time lm_eval --model hf --model_args pretrained=internistai/base-7b-v0.2,dtype="float" --tasks mmlu_professional_medicine --device mps  --batch_size 64
hf (pretrained=internistai/base-7b-v0.2,dtype=float), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64
|        Tasks        |Version|Filter|n-shot|Metric|Value |   |Stderr|
|---------------------|------:|------|-----:|------|-----:|---|-----:|
|professional_medicine|      0|none  |     0|acc   |0.7647|±  |0.0258|

lm_eval --model hf --model_args  --tasks mmlu_professional_medicine --device   28.83s user 117.90s system 63% cpu 3:49.41 total

chimezie · 2024-06-18T01:40:34Z

I fixed some handling of batch remainders, and it looks good; running comparisons against HF/MPS/Pytorch for medqa and some related subsets of MMLU

lintangsutawika

Could add installation dependancies (like lm_eval[mlx] see pyproject.toml) and a way to check if library is installed when called (see lm_eval/models/anthropic_llms.py)

lintangsutawika · 2024-07-12T07:33:37Z

@haileyschoelkopf bringing this to your attention as well.

…en continuation caching, but getting a Key Error in re_ord.get_cache(..)

Fixed logic and include mask in what is returned from _preserve_last_target_len_logits. Fixed padding, cont_tok tracking, and removed one-token continuation caching

… impl

chimezie · 2024-12-01T00:49:17Z

@baberabb I've removed all dependencies on the caching and I'm able to get similar answer log prob and greedy = continuation values for a handful of questions I probed. However, the final top-level figures still don't match, and I have run out of ideas why and wonder if the issue is at the level above _loglikelihood_tokens:

% lm_eval --model mlx --model_args model=internistai/base-7b-v0.2 --tasks mmlusr_question_and_answer_clinical_knowledge --batch_size 56
[..snip..]
mlx (model=internistai/base-7b-v0.2), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 56
|      Tasks       |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|-----:|------|---|-----:|---|-----:|
|clinical knowledge|      1|none  |     0|acc   |↑  |0.2302|±  |0.0259|

% lm_eval --model hf --model_args pretrained=internistai/base-7b-v0.2,dtype="float32" --tasks mmlusr_question_and_answer_clinical_knowledge --batch_size 56 --device mps
[..snip..]
hf (pretrained=internistai/base-7b-v0.2,dtype=float32), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 56
|      Tasks       |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|-----:|------|---|-----:|---|-----:|
|clinical knowledge|      1|none  |     0|acc   |↑  |0.5132|±  |0.0308|

Update _preserve_last_target_len_logits to fix identification of target sequences. Moved calculation of log-probs at the corresponding continuation token indices to be done entirely as array manipulation for efficiency Fixed returning of results

chimezie · 2024-12-05T16:40:01Z

I have made many updates and now have figures that seem reasonably close to those of the HF model. I reviewed log prob scores (via --log_samples) for individual answers between the two, and they were comparable as well. Prefix prompt caching was also added, and generate_until support was removed (I can add a more robust implementation in a subsequent PR).

lm_eval --model mlx --model_args model=internistai/base-7b-v0.2 \
               --tasks mmlusr_question_and_answer_clinical_knowledge --batch_size 56

mlx (model=internistai/base-7b-v0.2), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 56
|      Tasks       |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|-----:|------|---|-----:|---|-----:|
|clinical knowledge|      1|none  |     0|acc   |↑  |0.4566|±  |0.0307|

lm_eval --model hf --model_args pretrained=internistai/base-7b-v0.2,dtype="float32" \
              --tasks mmlusr_question_and_answer_clinical_knowledge --batch_size 56 --device mps

hf (pretrained=internistai/base-7b-v0.2,dtype=float32), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 56
|      Tasks       |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|-----:|------|---|-----:|---|-----:|
|clinical knowledge|      1|none  |     0|acc   |↑  |0.5132|±  |0.0308|

% lm_eval --model mlx --model_args model=m42-health/Llama3-Med42-8B \
                   --tasks mmlu_clinical_knowledge

mlx (model=m42-health/Llama3-Med42-8B), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 56
|      Tasks       |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|-----:|------|---|-----:|---|-----:|
|clinical_knowledge|      1|none  |     0|acc   |↑  |0.7245|±  |0.0275|

% lm_eval --model hf --model_args pretrained=m42-health/Llama3-Med42-8B,dtype="float32" \
                  --tasks mmlu_clinical_knowledge --batch_size 56 --device mps

hf (pretrained=m42-health/Llama3-Med42-8B,dtype=float32), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 56
|      Tasks       |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|-----:|------|---|-----:|---|-----:|
|clinical_knowledge|      1|none  |     0|acc   |↑  |0.7547|±  |0.0265|

chimezie added 6 commits May 20, 2024 13:35

Initial implementation (no caching)

6121d1b

Fix model registration

8319638

Merge branch 'EleutherAI:main' into main

75bc3f9

Self-standing loglikelihood implementation.

01d80ca

Merge remote-tracking branch 'origin/main'

87979f0

pre-commit ruff issues

bcfea6c

chimezie requested review from haileyschoelkopf and lintangsutawika as code owners May 29, 2024 21:00

chimezie mentioned this pull request May 30, 2024

Logprobs info to completion API ml-explore/mlx-examples#806

Merged

chimezie added 2 commits May 31, 2024 18:37

Merge branch 'EleutherAI:main' into mlx

08a0593

Various fixes

9d50820

chimezie added 5 commits May 31, 2024 23:27

Typo

9b575f7

Merge branch 'EleutherAI:main' into mlx

71b2483

Merge branch 'EleutherAI:main' into mlx

80623e4

Merge remote-tracking branch 'upstream/main' into mlx

f496a47

Fix handling of final remainder batch

1214693

chimezie added 6 commits June 20, 2024 22:24

Merge branch 'EleutherAI:main' into mlx

e1d1d71

Merge branch 'EleutherAI:main' into mlx

363024a

Merge branch 'EleutherAI:main' into mlx

eba4fb2

Merge branch 'EleutherAI:main' into mlx

24f1665

Merge branch 'EleutherAI:main' into mlx

ab89e53

Merge branch 'EleutherAI:main' into mlx

e4376d2

lintangsutawika reviewed Jul 12, 2024

View reviewed changes

Merge branch 'EleutherAI:main' into mlx

7a1b419

Removed addition of EOS token

3922864

chimezie requested a review from baberabb November 23, 2024 01:36

chimezie and others added 7 commits November 25, 2024 04:49

Merge branch 'EleutherAI:main' into mlx

5bdf50c

fixup

f478fbc

nit

9df463f

add typehints

7f4aab5

Merge branch 'EleutherAI:main' into mlx

d770703

Attempt to better mimic HF model loglikelihood_tokens impl of one-tok…

9e2d8c9

…en continuation caching, but getting a Key Error in re_ord.get_cache(..)

Merge branch 'EleutherAI:main' into mlx

a67baf0

This comment was marked as outdated.

Sign in to view

chimezie added 6 commits November 30, 2024 08:48

Various fixes toward parity with HF impl

f6e2e20

Fixed logic and include mask in what is returned from _preserve_last_target_len_logits. Fixed padding, cont_tok tracking, and removed one-token continuation caching

Merge remote-tracking branch 'origin/mlx' into mlx

4349b50

Merge branch 'EleutherAI:main' into mlx

853db9f

Calculate greedy tokens off logits not log probs

78fc03e

Swapped back for no other reason than because that is how it is in HF…

0df42c1

… impl

Merge and added context_prefix_cache to mlx model

9e6120a

chimezie added 5 commits December 3, 2024 10:53

Merge branch 'EleutherAI:main' into mlx

9d8a95e

Update various score calculation logic

425543c

Update _preserve_last_target_len_logits to fix identification of target sequences. Moved calculation of log-probs at the corresponding continuation token indices to be done entirely as array manipulation for efficiency Fixed returning of results

Update documentation and remove unused bits

84aceb4

Merge branch 'EleutherAI:main' into mlx

94e207a

Merge branch 'prefix_caching' into mlx

253ee13

chimezie added 3 commits December 5, 2024 14:38

More efficient impl of _preserve_last_target_len_scores

92aa52b

Needed by inherited methods

4c8d1bf

Remove unused method and responses

428df75

chimezie mentioned this pull request Dec 7, 2024

mlx_lm.evaluate ml-explore/mlx-examples#1140

Merged

chimezie added 3 commits December 10, 2024 13:14

Merge branch 'EleutherAI:main' into mlx

c6d8b9f

Merge branch 'EleutherAI:main' into mlx

e85cfd4

Merge branch 'EleutherAI:main' into mlx

da8d872

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mlx Model (loglikelihood & generate_until) #1902

mlx Model (loglikelihood & generate_until) #1902

chimezie commented May 29, 2024 •

edited

Loading

CLAassistant commented May 29, 2024 •

edited

Loading

chimezie commented May 29, 2024 •

edited

Loading

chimezie commented May 30, 2024

chimezie commented Jun 1, 2024 •

edited

Loading

chimezie commented Jun 18, 2024

lintangsutawika left a comment

lintangsutawika commented Jul 12, 2024

This comment was marked as outdated.

chimezie commented Dec 1, 2024

chimezie commented Dec 5, 2024

mlx Model (loglikelihood & generate_until) #1902

Are you sure you want to change the base?

mlx Model (loglikelihood & generate_until) #1902

Conversation

chimezie commented May 29, 2024 • edited Loading

CLAassistant commented May 29, 2024 • edited Loading

chimezie commented May 29, 2024 • edited Loading

chimezie commented May 30, 2024

chimezie commented Jun 1, 2024 • edited Loading

chimezie commented Jun 18, 2024

lintangsutawika left a comment

Choose a reason for hiding this comment

lintangsutawika commented Jul 12, 2024

This comment was marked as outdated.

chimezie commented Dec 1, 2024

chimezie commented Dec 5, 2024

chimezie commented May 29, 2024 •

edited

Loading

CLAassistant commented May 29, 2024 •

edited

Loading

chimezie commented May 29, 2024 •

edited

Loading

chimezie commented Jun 1, 2024 •

edited

Loading