ARM/Vector Floating Point Unit
VFP (Vector Floating Point) is a single precision and double precision FPU (Floating Point Unit) co-processor availble is most of ARM's CPUs in once version or other, and is more commonly known as a math coprocessor.
- Allwinner A10 / A10s / A13 SoCs features a VFPv3 (Vector Floating Point v3) co-processor
- Allwinner A20 and A31 SoC features VFPv4 (Vector Floating Point v4) co-processor that is also backwards compatible with VFPv3 (Vector Floating Point v3)
VFPv3 and VFPv4 classed VFPU (Vector Floating Point Unit) is used for for FPU (Floating Point Unit) hardware support in Linux select CONFIG_VFP.
ARM EABIs: software (soft), hardware (hard, Debian based armhf), and both (softfp, armel) run with some overhead relative to a full hard-float system.
With VFP support in kernel you may chroot from one to the other, without still running softfp binaries.
VFPv3-D16 meets Debian based hard float port requirements.
Note! Most SIMD operations can be better performed by NEON (FPU) extensions provided by ARM, which is supported by all Allwinner SoCs.
Contents |
Overview
A floating-point unit (FPU, colloquially a math coprocessor) is a part of a computer system specially designed to carry out operations on floating point numbers. Typical operations are addition, subtraction, multiplication, division, and square root. Some systems (particularly older, microcode-based architectures) can also perform various transcendental functions such as exponential or trigonometric calculations, though in most modern processors these are done with software library routines.
In most modern general purpose computer architectures, one or more FPUs are integrated with the CPU; however many embedded processors, especially older designs, do not have hardware support for floating-point operations.
In the past, some systems have implemented floating point via a coprocessor rather than as an integrated unit; in the microcomputer era, this was generally a single integrated circuit, while in older systems it could be an entire circuit board or a cabinet.
Not all computer architectures have a hardware FPU. In the absence of an FPU, many FPU functions can be emulated, which saves the added hardware cost of an FPU but is significantly slower. Emulation can be implemented on any of several levels: in the CPU as microcode as an operating system function, or in user space code.
In most modern computer architectures, there is some division of floating-point operations from integer operations. This division varies significantly by architecture; some, like the Intel x86 have dedicated floating-point registers, while some take it as far as independent clocking schemes.
Floating-point operations are often pipelined. In earlier superscalar architectures without general out-of-order execution, floating-point operations were sometimes pipelined separately from integer operations. Since the early and mid-1990s, many microprocessors for desktops and servers have more than one FPU.
When a CPU is executing a program that calls for a floating-point operation, there are three ways to carry it out:
- A floating-point unit emulator (a floating-point library)
- Add-on FPU
- Integrated FPU
VFP versions:
- VFPv1 - obsoleted by ARM
- VFPv2 - optional on ARMv5 and ARMv6 cores
- Supports standard FPU arithmetic (add, sub, neg, mul, div), full square root
- 16 64-bit FPU registers
- VFPv3[-D32]
- Broadly compatible with VFPv2 but adds
- Exception-less FPU usage
- 32 64-bit FPU registers as standard
- Adds VCVT instructions to convert between scalar, float and double.
- Adds immediate mode to VMOV such that constants can be loaded into FPU registers
- Broadly compatible with VFPv2 but adds
- VFPv3-D16
- As above, but only has 16 64-bit FPU registers in VFPv3-D16 variant
- VFPv3-F16 variant
- Uncommon but supports IEEE754-2008 half-precision (16-bit) floating point
- VFPv4
- Cortex-A5
- Has a "fused multiply-accumulate"
ARM Floating Point (VFP)
ARM Floating Point architecture (VFP) provides hardware support for floating point operations in half-, single- and double-precision floating point arithmetic. It is fully IEEE 754 compliant with full software library support.
The floating point capabilities of the ARM VFP offer increased performance for floating point arithmetic used in automotive powertrain and body control applications, imaging applications such as scaling, transforms and font generation in printing, 3D transforms, FFT and filtering in graphics. The next generation of consumer products such as Internet appliances, set-top boxes, and home gateways, can directly benefit from the ARM VFP. VFP Applications
Automotive control applications
Powertrain
ABS, Traction control & active suspension
3D Graphics
Digital consumer products
Set-top boxes, games consoles
Imaging
Laser printers, still digital cameras, digital video cameras
Industrial control systems
Motion controls
Many real-time control applications in the industrial and automotive fields benefit from the dynamic range and precision of floating-point offered by the ARM VFP. Automotive powertrain, anti-lock braking, traction control, and active suspension systems are all mission-critical applications where precision and predictability are essential requirements. VFP architecture versions
Before the ARMv7 architecture, VFP stood for Vector Floating-point Architecture, used for vector operations.
Provision of hardware floating point is essential for many applications, and can be used as part of a System on Chip (SoC) design flow using high-level design tools (eg MatLab, MATRIXx and LabVIEW) to directly model the system and derive the application code. Using hardware floating point combined with the NEON™ multimedia processing capability, performance of imaging applications such as scaling, 2D and 3D transforms, font generation, and digital filters can be increased.
There have been three main versions of VFP to date:
VFPv1 is obsolete. Details are available on request from ARM. VFPv2 is an optional extension to the ARM instruction set in the ARMv5TE, ARMv5TEJ and ARMv6 architectures. VFPv3 is an optional extension to the ARM, Thumb® and ThumbEE instruction sets in the ARMv7-A and ARMv7-R profiles. VFPv3 implementation is with either thirty-two or sixteen double word registers. The terms VFPv3-D32 and VFPv3-D16 distinguish between these two implementation options. Extending VFPv3 uses the half-precision extensions that provide conversion functions in both directions between half-precision floating-point and single-precision floating-point.
Using NEON and VFPv3 on Allwinner A10
The compiler supports two different options to control NEON and VFPv3.
--float_support=VFPv3 --neon
The --float_support=VFPv3 option instructs the compiler to generate code that utilizes the VFPv3 coprocessor for both double and single precision floating point operations. The option is also used to enable the assembler to accept VFPv3 instructions in assembly source. To enable VFPv3 the EABI mode must also be enabled through the --abi=eabi option. This is necessary because the calling convention for floating point paramemters changes when VFPv3 is enabled and that convention is only supported in EABI mode.
The --neon option instructs the compiler to automatically vectorize loops to use the NEON instructions. To get benefit from this option you should be using --opt_level=2 or higher and be generating code for performance by using the --opt_for_speed=[3-5] option.
Valid VFP compiler entries
The compiler includes support for generating vector floating-point (VFP) co-processor instructions through the --float_support=vfp option. The VFP co-processor is available in many variants of ARM11 and higher. The valid vfp entries are:
VFPv3 architecture and instruction set:
vfpv3
VFPv3d16 architecture and instruction set:
vfpv3d16
VFPv4-SP architecture and instruction set:
vfpv4spd16
FPA endianness is used to store double-precision floating-point values (most significant word occupies the lower memory address):
fpalib
VFP endianness is used to store double-precision floating-point values (endianness used is that of the memory system). All VFP coprocessors use this endianness to represent doubles:
vfplib
Combining options
The compilers supports several modes related to NEON and VFPv3. By default neither NEON or VFPv3 is enabled. In addition to the default the following 3 modes are supported:
- VFP enabled without NEON
-
- The compiler will generate VFPv3 instructions for single and double precision floating point operations
- NEON enabled without VFP
-
- In this mode the compiler will generate NEON instructions for SIMD integer operations. It will not generate NEON instructions to vectorize floating point operations. The motivation for not allowing floating point NEON instructions if VFP is not enabled is because it is possible to have an integer only variant of NEON implemented. In order for the NEON unit to support floating point operations the VFPv3 coprocessor must be present.
- NEON enabled and VFP enabled
-
- In this mode the compiler will generate a mix of NEON and VFP instructions. The NEON instructions can be either integer or floating point.
VFPv3 vs. NEON performance
A common question with regard to ARM compiler's support for NEON is how to get more floating point operations on the NEON unit instead of the VFPv3. The reason this is desirable is because the VFPv3 coprocessor is not a pipelined architecture on the Allwinner A10, but the NEON is. The compiler will always use VFP instructions for scalar floating point operations, even if the --neon option is used. The hardware is capable of issuing VFP instructions on the NEON coprocessor if the following conditions are met:
- The instruction must be a single precision data processing instruction
- The processor must be in flush-to-zero mode. In this mode the processor will treat all denormalized numbers as zero.
- The processor must be in default NaN mode. In this mode the operation will return the default NaN regardless of the input, whereas in full-compliance mode the returned NaN follows the rules in the ARM Architecture Reference Manual.
- The FPEXC.EX bit must be set to 0. This tells the processor that there is no additional state that must be handled by a context switch.
Thus most SIMD operations can be better performed by NEON (FPU) advanced SIMD extensions provided by ARM, than with standard VPUv3, even though both are supported by all Allwinner SoCs.