#1BRC https://github.com/gunnarmorling/1brc
1 Billion rows 13 GB on Disk
- JDK21
- No native Code
- GraalVM allowed
- No third-party libraries
We need to
- process all lines
- Parse line into Station / Value
- Aggreagte Value / Store with Station
- Sort by Station
Records for Measurement/ResultRow Custom Collector Files#lines (...) Streams / Grouping TreeMap for Sorting
time ./calculate_average_baseline.sh Runtime: 4:07*
*) on my Machine
CalculateAverage_alesj
28 Lines!
- Use built-in DoubleSummaryStatistics
- Parallel Streams
time ./calculate_average_alesj.sh Runtime: 2:10*
CalculateAverage_khmarbaise
- Parallel Streams
- Stream Pipeline
- Sorting in Stream
- Double summarize statistics
time ./calculate_average_khmarbaise.sh Runtime: 1:45*
- Load data into Memory (max chunck size 2GB!) -> int index
- Process data chunks in parallel
How:
- FileChannel
- via RandomAccessFile#getChannel(..)
- FileChannel.open(..)
- FileInputStream#getChannel)
- File MappedByteBuffer mappedByteBuffer = fileChannel.map(FileChannel.MapMode.READ_ONLY, start, Math.min(CHUNK_SIZE, size - start));
CalculateAverage_bjhara
Runtime: 0:40*
Can we do better? Flame Graph
jbang --javaagent=ap-loader@jvm-profiling-tools/ap-loader=start,event=cpu,file=profile.html src/main/java/dev/morling/onebrc/CalculateAverage_bjhara.java
open profile.html
How about tuning Double.parseDouble(...) ?
time ./calculate_average_bjhara.sh Runtime: 0:32*
JEP 454: Foreign Function & Memory API
import java.lang.foreign.Arena; import java.lang.foreign.MemorySegment; import java.lang.foreign.ValueLayout;
CalculateAverage_anitasv
Map whole File into memory! -> no 2GB limit anymore yay!
MemorySegment mmapMemory = fileChannel.map( FileChannel.MapMode.READ_ONLY, 0, fileSize, Arena.global());
Specialized custom HashMap -> FastHashMap
Parse "double" from raw bytes encoded as long into an int
time ./calculate_average_anitasv.sh Runtime: 0:12*
CalculateAverage_arjenw
- Custom HashMap
- Custom double parsing as int from bytes
Short solution, very readable, no UNSAFE so far!
time ./calculate_average_arjenw.sh Runtime 0:09*
"You don't have to be an engineer to be be a racing driver, but you do have to have Mechanical Sympathy." Jackie Stewart, racing driver
The "Mechanical Sympathy" movement calls for developing this understanding before it's too late and to upfront design the applications taking into account how the machine is going to work on it.
Mechanical Sympathy: Understanding the Hardware Makes You a Better Developer
Tool to Analyze Machine Code generated by the JIT compiler
java --add-modules jdk.incubator.vector --enable-preview -jar ~/dev/tools/jitwatch-ui-1.4.9-shaded-linux-x64.jar
Sandbox simple inling
- SIMD
- ILP
- Branch Prediction
Single Instruction Multiple Data
JEP 448: Vector API (Sixth Incubator)
Introduce an API to express vector computations that reliably compile at runtime to optimal vector instructions on supported CPU architectures, thus achieving performance superior to equivalent scalar computations.
SIMD operations
A vector computation consists of a sequence of operations on vectors. A vector comprises a (usually) fixed sequence of scalar values, where the scalar values correspond to the number of hardware-defined vector lanes.
Vectorized assembly instructions movq -> vmovq
SIMD Operations via Java Vector API
CalculateAverage_chrisbellew
Show vector operations:
java --add-modules jdk.incubator.vector --enable-preview -jar ~/dev/tools/jitwatch-ui-1.4.9-shaded-linux-x64.jar
Runtime: 0:07*
time ./calculate_average_ChrisBellew.sh
Instruction Level Parallelism
Instruction-level parallelism (ILP) is the parallel or simultaneous execution of a sequence of instructions in a computer program. More specifically ILP refers to the average number of instructions run per step of this parallel execution.
ILP must not be confused with concurrency. In ILP there is a single specific thread of execution of a process. On the other hand, concurrency involves the assignment of multiple threads to a CPU's core in a strict alternation, or in true parallelism if there are enough CPU cores, ideally one core for each runnable thread.
Example: https://en.wikipedia.org/wiki/Instruction-level_parallelism
dev.morling.onebrc.CalculateAverage_ianopolousfast#parseStats
lineSize1, lineSize2 ...
time ./calculate_average_ianopolousfast.sh
Runtime: 0:05*
- Helping Branch Prediction -> Brancheless code
dev.morling.onebrc.CalculateAverage_SamuelYvon#branchlessMax dev.morling.onebrc.CalculateAverage_SamuelYvon#branchlessMin
- Finding the ";" splitter
String line = "Los Angeles;14.0"; String[] items = line.split(";"); String station = items[0]; double value = Double.parseDouble(items[1]);
Can we do better?
- Find character bytes within long byte sequence dev.morling.onebrc.CalculateAverage_gonix.Aggregator#valueSepMark dev.morling.onebrc.CalculateAverage_abeobk#getSemiCode dev.morling.onebrc.CalculateAverage_abeobk#getLFCode
- Parse number as long from raw long bytes (originally from merykitty) dev.morling.onebrc.CalculateAverage_abeobk#num
time ./calculate_average_abeobk.sh
Runtime: 0:03*
- Unsafe
- Vector API
- Foreign Memory API
- Bit level hacks
Unsafe Mechanics
private static final Unsafe UNSAFE;
static {
try {
Field theUnsafe = Unsafe.class.getDeclaredField("theUnsafe");
theUnsafe.setAccessible(true);
UNSAFE = (Unsafe) theUnsafe.get(Unsafe.class);
}
catch (NoSuchFieldException | IllegalAccessException e) {
throw new RuntimeException(e);
}
}
time ./calculate_average_merykittyunsafe.sh Runtime: 0.02*
./prepare_thomaswue.sh
Runtime: 0:01*
- Alternative Implementations
- AWK gunnarmorling/1brc#171
- Go gunnarmorling/1brc#67
- Oracle Database (gunnarmorling/1brc#707)
- Cobol: gunnarmorling/1brc#481
- Implementation in C with AVX2 instructions (Fastest solution! 0.5s!!!) gunnarmorling/1brc#710
- You can already quite far with clean idiomatic Java Code
- New APIs open up more possibilities Foreign Memory API, Vector API
- Measure, don't guess!
- Strive the balance between optimization and maintainability
- Java is not slow!