README update for v0.2.1.

ahrefs · Jul 21, 2023 · 5f384dc · 5f384dc
1 parent cbec799
commit 5f384dc
Showing 1 changed file with 8 additions and 8 deletions.
diff --git a/README.md b/README.md
@@ -30,15 +30,14 @@ Warning disclaimer: this project is still "not announced". The features describe
 
 ## Future milestones
 
-* **v0.2.1 cuda**: a super-naive no-model-parallelism CUDA backend.
-* **v0.2.2 tiling**: a slightly-less-naive CUDA backend with some model parallelism, and maybe tiling.
-* **v0.3 LLVM and Triton**:
-  * **v0.3.1 triton-C**: a Triton backend.
-  * **v0.3.2 llvm**: an LLVM backend as an alternative to the GCCJIT backend. This one would be a more CPU-centric release. Blocked by LLVM 17 being released (i.e. aiming to start work around end of July).
-  * **v0.3.2 triton-llvm**: an LLVM-based Triton backend.
-* **v0.4 usability**: examples covering most of Andrej Karpathy's "Neural Networks Zero to Hero" series; data loading; checkpointing.
+* **v0.2.2 reliable speedups**: remove param update races. For Cuda, apply within-block parallelism on a per-loop basis.
+* **v0.3 usability**: examples covering most of Andrej Karpathy's "Neural Networks Zero to Hero" series; data loading; checkpointing.
+* **v0.4 LLVM and Triton?**:
+  * **v0.4.1 triton-C**: a Triton backend.
+  * **v0.4.2 llvm**: an LLVM backend as an alternative to the GCCJIT backend. This one would be a more CPU-centric release. Blocked by LLVM 17 being released (i.e. aiming to start work around end of July).
+  * **v0.4.3 triton-llvm**: an LLVM-based Triton backend.
 * **v0.5 documentation**: `.mli` files and maybe more documentation.
-* **v0.6 scale**: basic distributed computation; runtime-autotuning optimization settings; data ingestion.
+* **v0.6 scale**: model parallelism; data ingestion; maybe basic distributed computation; maybe autotuning optimization settings.
 * **v1 completeness**: whatever not-yet-implemented features that still seem needed and impact the framework design. (E.g. at the time of v0.1.X, convolutions, reshaping, concatenation are not easily expressible.)
 
 ### Releases
@@ -47,6 +46,7 @@ For details, see [CHANGES](CHANGES.md).
 
 * **v0.2 inching toward GPU**:
   * **v0.2.0 stack-as-device**: for multicore CPU, improve cache locality and reduce cache contention by treating the C function stack as the "device memory".
+  * **v0.2.1 naive-cuda**: a Cuda backend where "task id" becomes parallelization over blocks, and a new dedicated axis "sample num" becomes parallelization over threads in a block.
 * **v0.1 GCCJIT backend**:
   * **v0.1.2**: multicore computations using a thread-local "task id" index.
   * **v0.1.1**: inlining scalar constants, improved inlining for virtual nodes.