-
Notifications
You must be signed in to change notification settings - Fork 0
Various other rough language features
Here is a collection of additional 'in progress' ideas with some implementation approaches sketched...
The builtin print() statement has a formatting string where the nth '%' in the formatting string is replaced with the n'th value in the variable arguments list. It would probably be nice to support the full range of C-style printf() formatting operations instead.
For control flow, we currently have both compile-time and run-time mechanisms to determine when things are coherent across the program instances. (uniform test expressions give compile-time indication and constructs like 'cif', which emit code to check for mask coherence give it at run-time).
Is there value to run-time checks for this for more general values? For example, the code generated to do scatter could have a runtime test to see if the lvalues are to sequential locations in memory and could then take a path that did an unaligned vector store instead in that case. Gather code could do the analog and could also test to see if the lvalues are all the same and issue a scalar load if so.
It's a little unclear if these will be wins, since doing those cross-lane equality tests isn't cheap, it's extra branches, more code generated, etc. This probably merits some investigation at some point, though.
With a little dataflow analysis, we should be able to determine if a 'varying' variable could actually be represented as a 'uniform'.
(This is discussed further on this page: Additional IR step between AST and LLVM IR)
Is there any value to issuing prefetch instructions? Probably not on modern CPUs with complex HW prefectchers.
It would also be interesting to try some experiments with fibering in the compiler: when an expected long-memory-latency operation is encountered, issue prefectches for the value needed and then switch to a different set of program instances by just swapping context, not doing a HW context switch. It seems unlikely that this would be a win given the costs of saving/restoring registers, but it would be interesting to try with some workloads with extremely incoherent memory accesses.
Is there value to language syntax that indicates data that will be streamed over? This could be used to drive adding "non-termporal" hints to loads/stores. (Need to quantify how much of a win these are in practice.)
Cilk's reducers ([www.fftw.org/~athena/papers/hyper.pdf]) are very nice; should we have some kind of in-language support for this sort of construct. It's a nice way to abstract away some of the high-performance idioms of "do work in parallel across cores, accumulating values locally and then merge those into a final result at the end". In that ispc + tasks has a little two level hierarchy of parallelism, this idiom has advantages both for accumulating results locally across program instances on one core as well as accumulating them across all of the cores.
CPU performance counters can measure a lot of interesting things. Can we make it easy to learn interesting things about programs through support for inserting calls to them at appropriate places in the generated code? The main advantage of the compiler doing it being that it has higher-level insight into "this block of code is doing the gather we need for line 22 of file foo.ispc", etc.
A pattern that comes up frequently is things along the lines of:
foreach light { // do some computation in parallel, e.g. determine if the light potentially // illuminates a collection of objects if (the light does illuminate them) { foreach object { // compute light's illumination in parallel for multiple objects } } }
i.e. nested data parallelism.
The best way to implement this sort of thing currently in ispc is using the packed_store_active() stdlib function, along the lines of:
int relevantLights[N_LIGHTS]; int numRelevantLights = 0; for (uniform int i = 0; i < N_LIGHTS; i += programCount) { int lightNum = i + programIndex; if (lights[lightNum] illuminates the objects) numRelevantLights += packed_store_active(relevantLights, numRelevantLights, lightNum); } for (uniform int i = 0; i < numRelevantLights; ++i) { uniform int lightNum = relevantLights[i]; for (uniform int j = 0; j < numObjects; j += programCount) { // do processing for lights[lightNum] for multiple objects-- // i.e. objects[j+programIndex]... } }
Is there a syntactic construct that would make it easy / clean to switch over to the 'foreach object, for this light' stuff, in the middle of the 'foreach lights' loop?
For writing library code, it would be nice to write functions parameterized by type. (There is already a lot of C preprocessor ugliness in stdlib.ispc to work around this issue.) It would also be nice to be able to do this with something simpler/more straightforward than the full complexity of e.g. C++ templates.