Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't use implicitly elapsed_time in autotuner #3036

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

anmyachev
Copy link
Contributor

@anmyachev anmyachev commented Dec 17, 2024

The main idea of ​​this pull request is not to use elapsed_time that enable profiling mode for sycl queues, as this is not needed for profiling with PyTorch and PTI.

CI runs:

@anmyachev anmyachev marked this pull request as ready for review December 18, 2024 13:51
anmyachev added a commit that referenced this pull request Dec 18, 2024
Signed-off-by: Anatoly Myachev <[email protected]>
@anmyachev
Copy link
Contributor Author

@whitneywhtsang we can try the changes in #2484 on DLE runner, but we need to cherry-pick 2a4b818 into Pavel's branch

@whitneywhtsang
Copy link
Contributor

@whitneywhtsang we can try the changes in #2484 on DLE runner, but we need to cherry-pick 2a4b818 into Pavel's branch

Let's cherry-pick this PR to ptdb-dle-runner.

@anmyachev
Copy link
Contributor Author

anmyachev commented Dec 18, 2024

@whitneywhtsang we can try the changes in #2484 on DLE runner, but we need to cherry-pick 2a4b818 into Pavel's branch

Let's cherry-pick this PR to ptdb-dle-runner.

ok, but let's use 2a4b818 (last commit in #2484) which compatible with changes on Pavel's branch

whitneywhtsang pushed a commit that referenced this pull request Dec 18, 2024
Signed-off-by: Anatoly Myachev <[email protected]>
anmyachev added a commit that referenced this pull request Dec 18, 2024
@whitneywhtsang
Copy link
Contributor

Please rebase this PR.

@anmyachev
Copy link
Contributor Author

Please rebase this PR.

done

@whitneywhtsang
Copy link
Contributor

whitneywhtsang commented Dec 30, 2024

I like the idea of this PR, but it looks like performance is not better:
Screenshot 2024-12-30 091157
Let's rerun after agama update to see if there will be any performance difference.

@anmyachev
Copy link
Contributor Author

I like the idea of this PR, but it looks like performance is not better

This performance difference may be due to the different number of warm-up runs of the function. I use the interface of our functions that warm up a certain number of times (10), instead of running only 10 milliseconds, as is the default in Triton:
return do_bench(*args, n_warmup=10, n_repeat=10, **kwargs)

@whitneywhtsang
Copy link
Contributor

This performance difference may be due to the different number of warm-up runs of the function.

We could do a run with upstream do_bench changed to use exact number of runs (without this PR), and see if there are any performance differences, to isolate the reason. (We could do that after Agama update.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't use implicitly elapsed_time in autotuner when profiling with PyTorch and PTI
2 participants