Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

comm_replay encounters issue #192

Open
x41lakazam opened this issue Dec 15, 2024 · 1 comment
Open

comm_replay encounters issue #192

x41lakazam opened this issue Dec 15, 2024 · 1 comment

Comments

@x41lakazam
Copy link

Describe the Bug

When running comm_replay on ET traces I get the following error:

$  comm_replay --enable-profiler --trace-type et --trace-path /workspace/traces --num-replays 1

 0: [rank0]: Traceback (most recent call last):
 0: [rank0]:   File "/usr/local/bin/comm_replay", line 8, in <module>
 0: [rank0]:     sys.exit(main())
 0: [rank0]:   File "/usr/local/lib/python3.10/dist-packages/et_replay/tools/comm_replay.py", line 1671, in main
 0: [rank0]:     traceBench.runBench(commsParams)
 0: [rank0]:   File "/usr/local/lib/python3.10/dist-packages/et_replay/tools/comm_replay.py", line 1324, in runBench
 0: [rank0]:     self.benchTime(commsParams)
 0: [rank0]:   File "/usr/local/lib/python3.10/dist-packages/et_replay/tools/comm_replay.py", line 1236, in benchTime
 0: [rank0]:     self.replayTrace(commsParams=commsParams, warmup=True)
 0: [rank0]:   File "/usr/local/lib/python3.10/dist-packages/et_replay/tools/comm_replay.py", line 1063, in replayTrace
 0: [rank0]:     (latency, global_latency) = self.runComms(
 0: [rank0]:   File "/usr/local/lib/python3.10/dist-packages/et_replay/tools/comm_replay.py", line 820, in runComms
 0: [rank0]:     self.collectiveArgs.waitObjIds[curComm.req] = retObj
 0: [rank0]: TypeError: unhashable type: 'list'
56: [rank56]: Traceback (most recent call last):
56: [rank56]:   File "/usr/local/bin/comm_replay", line 8, in <module>
56: [rank56]:     sys.exit(main())
56: [rank56]:   File "/usr/local/lib/python3.10/dist-packages/et_replay/tools/comm_replay.py", line 1671, in main
56: [rank56]:     traceBench.runBench(commsParams)
56: [rank56]:   File "/usr/local/lib/python3.10/dist-packages/et_replay/tools/comm_replay.py", line 1324, in runBench
56: [rank56]:     self.benchTime(commsParams)
56: [rank56]:   File "/usr/local/lib/python3.10/dist-packages/et_replay/tools/comm_replay.py", line 1236, in benchTime
56: [rank56]:     self.replayTrace(commsParams=commsParams, warmup=True)
56: [rank56]:   File "/usr/local/lib/python3.10/dist-packages/et_replay/tools/comm_replay.py", line 1063, in replayTrace
56: [rank56]:     (latency, global_latency) = self.runComms(
56: [rank56]:   File "/usr/local/lib/python3.10/dist-packages/et_replay/tools/comm_replay.py", line 820, in runComms
56: [rank56]:     self.collectiveArgs.waitObjIds[curComm.req] = retObj
56: [rank56]: TypeError: unhashable type: 'list'

The chakra schema is 1.1.1-chakra.0.0.4.
I've tried with param@main and param@ 7b19f58 as chakra user guide recommends.

@shengfukevin
Copy link
Contributor

Are you able to share the trace file? Or share one node of type "record_param_comms". Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants