You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The tapered guide currently has threads follow neutrons through their whole journey so the ones that do not make it far through the instrument leave that computation resource useless thereafter.
Could the computation be split by component, where later components consume fewer threads (or other resource) corresponding to how they have to handle fewer neutrons? Would that improve efficiency? But how to filter out the others?
Also, what other approaches might there be to efficiency improvement? For instance, could a multi-GPU deployment system be considered where they handle different parts of the instrument?
The text was updated successfully, but these errors were encountered:
Rather than speculatively refactoring our instrument construction and testing, may be possible to construct a more ad-hoc proof of principle test where we more manually create a component that's actually composite and splits threads unevenly so parts nearer the exit get fewer.
@mtbc@ckendrick can we run profiler to check if memory is the bottleneck? For tapered guide, every threads need to use data in the array for the guide profile (width and height vs z along beam). Could that be a problem?
The tapered guide currently has threads follow neutrons through their whole journey so the ones that do not make it far through the instrument leave that computation resource useless thereafter.
Could the computation be split by component, where later components consume fewer threads (or other resource) corresponding to how they have to handle fewer neutrons? Would that improve efficiency? But how to filter out the others?
@ckendrick points out that https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EXECUTION.html#group__CUDART__EXECUTION_1g504b94170f83285c71031be6d5d15f73 may be helpful.
Also, what other approaches might there be to efficiency improvement? For instance, could a multi-GPU deployment system be considered where they handle different parts of the instrument?
The text was updated successfully, but these errors were encountered: