A C/C++ header file that converts Intel SSE intrinsics to Arm/Aarch64 NEON intrinsics.
sse2neon
is a translator of Intel SSE (Streaming SIMD Extensions) intrinsics
to Arm NEON,
shortening the time needed to get an Arm working program that then can be used to
extract profiles and to identify hot paths in the code.
The header file sse2neon.h
contains several of the functions provided by Intel
intrinsic headers such as <xmmintrin.h>
, only implemented with NEON-based counterparts
to produce the exact semantics of the intrinsics.
Header file | Extension |
---|---|
<mmintrin.h> |
MMX |
<xmmintrin.h> |
SSE |
<emmintrin.h> |
SSE2 |
<pmmintrin.h> |
SSE3 |
<tmmintrin.h> |
SSSE3 |
<smmintrin.h> |
SSE4.1 |
<nmmintrin.h> |
SSE4.2 |
<wmmintrin.h> |
AES |
sse2neon
aims to support SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2 and AES extension.
In order to deliver NEON-equivalent intrinsics for all SSE intrinsics used widely, please be aware that some SSE intrinsics exist a direct mapping with a concrete NEON-equivalent intrinsic. However, others lack of 1-to-1 mapping, that means the equivalents are implemented using several NEON intrinsics.
For example, SSE intrinsic _mm_loadu_si128
has a direct NEON mapping (vld1q_s32
),
but SSE intrinsic _mm_maddubs_epi16
has to be implemented with 13+ NEON instructions.
-
Put the file
sse2neon.h
in to your source code directory. -
Locate the following SSE header files included in the code:
#include <xmmintrin.h>
#include <emmintrin.h>
{p,t,s,n,w}mmintrin.h should be replaceable, but the coverage of these extensions might be limited though.
- Replace them with:
#include "sse2neon.h"
- Explicitly specify platform-specific options to gcc/clang compilers.
- On ARMv8-A targets, you should specify the following compiler option: (Remove
crypto
and/orcrc
if your architecture does not support cryptographic and/or CRC32 extensions)
-march=armv8-a+fp+simd+crypto+crc
- On ARMv7-A targets, you need to append the following compiler option:
-mfpu=neon
- On ARMv8-A targets, you should specify the following compiler option: (Remove
Considering the balance between correctness and performance, sse2neon
recognizes the following compile-time configurations:
SSE2NEON_PRECISE_MINMAX
: Enable precise implementation of_mm_min_ps
and_mm_max_ps
. If you need consistent results such as NaN special cases, enable it.SSE2NEON_PRECISE_DIV
: Enable precise implementation of_mm_rcp_ps
and_mm_div_ps
by additional Netwon-Raphson iteration for accuracy.SSE2NEON_PRECISE_SQRT
: Enable precise implementation of_mm_sqrt_ps
and_mm_rsqrt_ps
by additional Netwon-Raphson iteration for accuracy.SSE2NEON_PRECISE_DP
: Enable precise implementation of_mm_dp_pd
. When the conditional bit is not set, the corresponding multiplication would not be executed.
The above are turned off by default, and you should define the corresponding macro(s) as 1
before including sse2neon.h
if you need the precise implementations.
sse2neon
provides a unified interface for developing test cases. These test
cases are located in tests
directory, and the input data is specified at
runtime. Use the following commands to perform test cases:
$ make check
You can specify GNU toolchain for cross compilation as well. QEMU should be installed in advance.
$ make CROSS_COMPILE=aarch64-linux-gnu- check # ARMv8-A
or
$ make CROSS_COMPILE=arm-linux-gnueabihf- check # ARMv7-A
Check the details via Test Suite for SSE2NEON.
Here is a partial list of open source projects that have adopted sse2neon
for Arm/Aarch64 support.
- Aaru Data Preservation Suite is a fully-featured software package to preserve all storage media from the very old to the cutting edge, as well as to give detailed information about any supported image file (whether from Aaru or not) and to extract the files from those images.
- aether-game-utils is a collection of cross platform utilities for quickly creating small game prototypes in C++.
- Apache Doris is a Massively Parallel Processing (MPP) based interactive SQL data warehousing for reporting and analysis.
- Apache Impala is a lightning-fast, distributed SQL queries for petabytes of data stored in Apache Hadoop clusters.
- Apache Kudu completes Hadoop's storage layer to enable fast analytics on fast data.
- ART is an implementation in OCaml of Adaptive Radix Tree (ART).
- Async is a set of c++ primitives that allows efficient and rapid development in C++17 on GNU/Linux systems.
- avec is a little library for using SIMD instructions on both x86 and Arm.
- bipartite_motif_finder as known as BMF (Bipartite Motif Finder) is an open source tool for finding co-occurences of sequence motifs in genomic sequences.
- Blender is the free and open source 3D creation suite, supporting the entirety of the 3D pipeline.
- Boo is a cross-platform windowing and event manager similar to SDL or SFML, with additional 3D rendering functionality.
- CARTA is a new visualization tool designed for viewing radio astronomy images in CASA, FITS, MIRIAD, and HDF5 formats (using the IDIA custom schema for HDF5).
- Catcoon is a feedforward neural network implementation in C.
- compute-runtime, the Intel Graphics Compute Runtime for oneAPI Level Zero and OpenCL Driver, provides compute API support (Level Zero, OpenCL) for Intel graphics hardware architectures (HD Graphics, Xe).
- dab-cmdline provides entries for the functionality to handle Digital audio broadcasting (DAB)/DAB+ through some simple calls.
- DISTRHO is an open-source project for Cross-Platform Audio Plugins.
- EDGE is an advanced OpenGL source port spawned from the DOOM engine, with focus on easy development and expansion for modders and end-users.
- Embree a collection of high-performance ray tracing kernels. Its target users are graphics application engineers who want to improve the performance of their photo-realistic rendering application by leveraging Embree's performance-optimized ray tracing kernels.
- emp-tool aims to provide a benchmark for secure computation and allowing other researchers to experiment and extend.
- FoundationDB is a distributed database designed to handle large volumes of structured data across clusters of commodity servers.
- iqtree_arm_neon is the Arm NEON port of IQ-TREE, fast and effective stochastic algorithm to infer phylogenetic trees by maximum likelihood.
- kram is a wrapper to several popular encoders to and from PNG/KTX files with LDR/HDR and BC/ASTC/ETC2.
- libscapi stands for the "Secure Computation API", providing reliable, efficient, and highly flexible cryptographic infrastructure.
- libmatoya is a cross-platform application development library, providing various features such as common cryptography tasks.
- Madronalib enables efficient audio DSP on SIMD processors with readable and brief C++ code.
- minimap2 is a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database.
- MMseqs2 (Many-against-Many sequence searching) is a software suite to search and cluster huge protein and nucleotide sequence sets.
- MRIcroGL is a cross-platform tool for viewing NIfTI, DICOM, MGH, MHD, NRRD, AFNI format medical images.
- N2 is an approximate nearest neighborhoods algorithm library written in C++, providing a much faster search speed than other implementations when modeling large dataset.
- niimath is a general image calculator with superior performance.
- OBS Studio is software designed for capturing, compositing, encoding, recording, and streaming video content, efficiently.
- OGRE is a scene-oriented, flexible 3D engine written in C++ designed to make it easier and more intuitive for developers to produce games and demos utilising 3D hardware.
- OpenXRay is an improved version of the X-Ray engine, used in world famous S.T.A.L.K.E.R. game series by GSC Game World.
- parallel-n64 is an optimized/rewritten Nintendo 64 emulator made specifically for Libretro.
- PFFFT does 1D Fast Fourier Transforms, of single precision real and complex vectors.
- PlutoSDR Firmware is the customized firmware for the PlutoSDR that can be used to introduce fundamentals of Software Defined Radio (SDR) or Radio Frequency (RF) or Communications as advanced topics in electrical engineering in a self or instructor lead setting.
- Pygame is cross-platform and designed to make it easy to write multimedia software, such as games, in Python.
- simd_utils is a header-only library implementing common mathematical functions using SIMD intrinsics.
- SMhasher provides comprehensive Hash function quality and speed tests.
- Spack is a multi-platform package manager that builds and installs multiple versions and configurations of software.
- srsLTE is an open source SDR LTE software suite.
- Surge is an open source digital synthesizer.
- XMRig is an open source CPU miner for Monero cryptocurrency.
- SIMDe: fast and portable implementations of SIMD intrinsics on hardware which doesn't natively support them, such as calling SSE functions on ARM.
- CatBoost's sse2neon
- ARM_NEON_2_x86_SSE
- AvxToNeon
- sse2rvv: C header file that converts Intel SSE intrinsics to RISC-V Vector intrinsic.
- POWER/PowerPC support for GCC contains a series of headers simplifying porting x86_64 code that
makes explicit use of Intel intrinsics to powerpc64le (pure little-endian mode that has been introduced with the POWER8).
- implementation: xmmintrin.h, emmintrin.h, pmmintrin.h, tmmintrin.h, smmintrin.h
- Intel Intrinsics Guide
- Arm Neon Intrinsics Reference
- Neon Programmer's Guide for Armv8-A
- NEON Programmer's Guide
- qemu/target/i386/ops_sse.h: Comprehensive SSE instruction emulation in C. Ideal for semantic checks.
- Porting Takua Renderer to 64-bit ARM- Part 1
- Porting Takua Renderer to 64-bit ARM- Part 2
- Comparing SIMD on x86-64 and arm64
sse2neon
is freely redistributable under the MIT License.