Half precision, mixed precision, CUDA virtual devices
The release 0.4.1 offers: half precision, mixed precision, proper support for cuda virtual devices, and many bug fixes.
From the CHANGELOG:
Added
- Implemented the previously-mocked support for half precision (FP16).
- We work around the missing Ctypes coverage by not using
Ctypes.bigarray_start
. - We check FP16 constants for overflow.
- We output half precision specific code from the CUDA backend.
- We work around the missing Ctypes coverage by not using
- Finally proper support for mixed precision! Lazy precision defaults and delayed precision setting via
Tnode.update_prec
. - A placeholder
nn_blocks.ml
hinting at an intended design pattern for model components. - A memory model for the multiple virtual devices per physical device setup, implemented in the CUDA backend. It fixes the CUDA backend behavior in the data parallelism benchmark.
- Slides for the Fun OCaml meetup: docs/Fun OCaml.
- New syntax: inline tensor declarations with a literal float as initial value.
Changed
- Removed the
pipes_cc, pipes_gccjit
backends (Pipes_multicore_backend
) -- I had fixedPipes_multicore_backend
by using thepoll
library instead ofUnix.select
, but it turns out to be very very slow. - Changed the
%cd
block comment syntax~~
to allow detailed structuring. RewroteTrain.grad_update
to use the%cd
syntax. - Made
Train.sgd_one
slightly more thrifty:p =- learning_rate *. sgd_delta
-->p =- learning_rate * sgd_delta ~logic:"."
without the inline tensor expression.
Fixed
- Log levels related de-confusion:
- Critical bug: logging of computation traces was not properly converted to ppx_minidebug 2.0.
- Properly restore
log_level
and inform about its setting. - By default do not log from tests.
debug_log_from_routines
should only happen whenlog_level > 1
.
- Bugs in
Multicore_backend
:await
was not checking queue emptiness,worker
'sCondition.broadcast
was non-atomically guarded (doesn't need to be), possible deadloop due to the lockfree queue -- now replaced withsaturn_lockfree
. - Reduced busy-waiting inside
c_compile_and_load
, propagating compilation errors now instead of infinite loop on error. - Fixed loss of significant digits for small numbers when outputting files.
- Added missing mixed-precision conversions in the
C_syntax
backend builder. - Restored the functionality of debug logging from the cuda backend.
- Always reinitialize global state at the beginning of
let%expect_test
, to make them more deterministic.