NEON Media Processing Engine (MPE) Advanced SIMD is ARM's second-generation general-purpose SIMD (Single Instruction Multiple Data) vector processing extension engine and is available in all Allwinner A1x, x2x, and x3x SoCs.
NEON is an extension of the VFP that can be used on the standard Vector Floating Point Unit, as NEON often allows for more efficient manipulation of matrices, and vector data in general. This is notably useful for processing audio and video data but also has potential to be used for high-speed memory copies (128-bit at a time).
ARM's Advanced SIMD extension (aka NEON or "MPE" Media Processing Engine) is a combined 64- and 128-bit SIMD instruction set that provides standardised acceleration for media and signal processing applications. NEON is included in all Cortex-A8 devices but is optional in Cortex-A9 devices. NEON can execute MP3 audio decoding on CPUs running at 10 MHz and can run the GSM adaptive multi-rate compression (AMR) speech codec at no more than 13 MHz. It features a comprehensive instruction set, separate register files and independent execution hardware. NEON supports 8-, 16-, 32- and 64-bit integer and single-precision (32-bit) floating-point data and SIMD operations for handling audio and video processing as well as graphics and gaming processing. In NEON, the SIMD supports up to 16 operations at the same time. The NEON hardware shares the same floating-point registers as used in VFP. Devices such as the ARM Cortex-A8 and Cortex-A9 support 128-bit vectors but will execute with just 64 bits at a time, whereas newer Cortex-A15 devices can execute 128 bits at once.
Note! Most SIMD operations can be better performed by NEON (FPU) advanced SIMD extensions provided by ARM, than with standard VPUv3, even though both are supported by all Allwinner SoCs.
The ARM® NEON™ general-purpose SIMD engine efficiently processes current and future multimedia formats, enhancing the user experience.
NEON technology can accelerate multimedia and signal processing algorithms such as video encode/decode, 2D/3D graphics, gaming, audio and speech processing, image processing, telephony, and sound synthesis by at least 3x the performance of ARMv5 and at least 2x the performance of ARMv6 SIMD. Cleanly architected NEON technology works seamlessly with its own independent pipeline and register file.
NEON technology is a 128-bit SIMD (Single Instruction, Multiple Data) architecture extension for the ARM Cortex™-A series processors, designed to provide flexible and powerful acceleration for consumer multimedia applications, delivering a significantly enhanced user experience. It has 32 registers, 64-bits wide (dual view as 16 registers, 128-bits wide.
NEON instructions perform "Packed SIMD" processing:
Registers are considered as vectors of elements of the same data type Data types can be: signed/unsigned 8-bit, 16-bit, 32-bit, 64-bit, single precision floating point Instructions perform the same operation in all lanes
Diagram illustrating NEON packed SIMD processing
The ARM Cortex™-A series processors with NEON technology, as well as ARM's Mali multimedia hardware solutions are used in multimedia applications ranging from smartphones and mobile computing devices to HDTV.
Using NEON and VFPv3 on Allwinner A10
The compiler supports two different options to control NEON and VFP.
NEON and VFPv3 valid VFP compiler entry:
NEON and VFPv4 valid VFP compiler entry:
The --float_support=VFPv3 option instructs the compiler to generate code that utilizes the VFPv3 coprocessor for both double and single precision floating point operations. The option is also used to enable the assembler to accept VFPv3 instructions in assembly source. To enable VFPv3 the EABI mode must also be enabled through the --abi=eabi option. This is necessary because the calling convention for floating point paramemters changes when VFPv3 is enabled and that convention is only supported in EABI mode.
The --neon option instructs the compiler to automatically vectorize loops to use the NEON instructions. To get benefit from this option you should be using --opt_level=2 or higher and be generating code for performance by using the --opt_for_speed=[3-5] option.
Valid VFP compiler entries
The compiler includes support for generating vector floating-point (VFP) co-processor instructions through the --float_support=vfp option. The VFP co-processor is available in many variants of ARM11 and higher. The valid vfp entries are:
VFPv3 architecture and instruction set:
VFPv3d16 architecture and instruction set:
VFPv4-SP architecture and instruction set:
FPA endianness is used to store double-precision floating-point values (most significant word occupies the lower memory address):
VFP endianness is used to store double-precision floating-point values (endianness used is that of the memory system). All VFP coprocessors use this endianness to represent doubles:
The compilers supports several modes related to NEON and VFPv3. By default neither NEON or VFPv3 is enabled. In addition to the default the following 3 modes are supported:
- VFP enabled without NEON
- The compiler will generate VFPv3 instructions for single and double precision floating point operations
- NEON enabled without VFP
- In this mode the compiler will generate NEON instructions for SIMD integer operations. It will not generate NEON instructions to vectorize floating point operations. The motivation for not allowing floating point NEON instructions if VFP is not enabled is because it is possible to have an integer only variant of NEON implemented. In order for the NEON unit to support floating point operations the VFPv3 coprocessor must be present.
- NEON enabled and VFP enabled
- In this mode the compiler will generate a mix of NEON and VFP instructions. The NEON instructions can be either integer or floating point.
VFPv3 vs. NEON performance
A common question with regard to ARM compiler's support for NEON is how to get more floating point operations on the NEON unit instead of the VFPv3. The reason this is desirable is because the VFPv3 coprocessor is not a pipelined architecture on the Allwinner A10, but the NEON is. The compiler will always use VFP instructions for scalar floating point operations, even if the --neon option is used. The hardware is capable of issuing VFP instructions on the NEON coprocessor if the following conditions are met:
- The instruction must be a single precision data processing instruction
- The processor must be in flush-to-zero mode. In this mode the processor will treat all denormalized numbers as zero.
- The processor must be in default NaN mode. In this mode the operation will return the default NaN regardless of the input, whereas in full-compliance mode the returned NaN follows the rules in the ARM Architecture Reference Manual.
- The FPEXC.EX bit must be set to 0. This tells the processor that there is no additional state that must be handled by a context switch.
Thus most SIMD operations can be better performed by NEON (FPU) advanced SIMD extensions provided by ARM, than with standard VPUv3, even though both are supported by all Allwinner SoCs.
- http://elinux.org/images/4/40/Elc2011_anderson_arm.pdf ARM NEON intruction set and why you should care
- http://www.arm.com/products/processors/cortex-a/cortex-a9.php Cortex-A9 Processor
- http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0409f/Chdceejc.html About the Cortex-A9 NEON MPE