Releases: ahrefs/ocannl
The "device memory" concept for multicore
Treats the C function stack of the monolithic update step as a "device memory". There is no explicit synchronization; instead, we implement "update on host" where needed: updates that would affect other tasks happen directly on the host (updating, e.g. adding to, the host's value of a tensor cell rather than its task-local copy which might be stale).
Parallel computations (multicore SGD)
Attempt at parallelizing for multicore, failed in that the Gccjit
backend computations are bottlenecked by memory accesses.
Further work in this direction would need to e.g. copy the relevant sub-tensors for each of the parallel tasks.
Virtual nodes and constants inlining
CPU single thread with in-lining optimizations. Operators: arithmetic, power (non-differentiable exponent), ReLu. Shape inference: pointwise; transpose; compose; extended einsum (arbitrary permuting and summing-out of individual or matched axes, pointwise ellipsis, broadcasting); dynamic indexing with inner-product-like (pointwise) and outer-product-like variants. Backends: interpreter with tracing, compiled by ocamlopt
, compiled in-process by gccjit
. Optimizations: virtual nodes -- when cells of a tensor are not "recurrent" (accessed across steps) and not accessed too many times, in-lines the computation and does not materialize tensors; scalar constant subexpression elimination -- for 1D constant tensors, computes the subexpression at compile time and in-lines the value. Text-based visualization: tensors with up to 5 varying axes (other axes fixed), computation graphs with side-by-side subtree layout, plotting "line" graphs and decision boundaries, benchmark tables.