Sunxi devices as NAS

The following article tries to give some hints how to build a sufficient file server or NAS (Network-attached storage) with sunxi devices. The focus is 'classic' LAN based file sharing using protocols like NFS, SMB or AFP and not internet or cloud optimised stuff (FTP, SFTP, SeaFile, ownCloud and the like).

Requirements / which device to choose
3 things are important for a performant NAS:


 * I/O bandwidth (how fast can the server access storage/disks?)
 * Network bandwidth (how fast can clients access the server?)
 * CPU horsepower (is the server able to deliver all accessing clients at maximum speed?)

For normal NAS use cases the first 2 requirements are more important. Some Allwinner SoCs feature a SATA port in addition to their USB ports, some provide 10/100 Mbits/sec EMAC Ethernet and some even Gbit Ethernet (GMAC). Since the A20 is the only SoC capable of both SATA and GMAC (see the comparison of different sunxi SoCs) it's the most interesting choice for a NAS. And while the A80 would also be a great choice due to one USB 3.0 port Linux support for this SoC is still very limited.

Since not every A20 based sunxi device uses GMAC networking (or even Ethernet at all) the following devices are the best choices: A20-OLinuXino-Lime2, Banana Pi, Banana Pro or Banana Pi M1+, Cubietruck, Hummingbird A20, Lamobo R1, Orange Pi, Orange Pi Mini or pcDuino3 Nano (Lite).

Architectural differences
Unlike dedicated server platforms with plenty of CPU horsepower and server features (eg. network adapters with TCP offload engine) CPU performance/behaviour on sunxi devices always directly affects network/NAS performance. A feature called CPU frequency scaling is responsible for adjusting CPU clock speeds according to needs. In case you use the wrong cpufreq settings it's impossible to achieve decent NAS performance (compare with SATA performance influences).

Per connection limits

 * Wi-Fi: Don't trust in marketing numbers like 300 Mbps since reality differs much and Wi-Fi uses a shared medium (the more devices use the same frequency bands the less bandwidth available)
 * Ethernet: different GMAC capable sunxi devices perform different for yet unknown reasons (see comparison of devices)
 * SD Card: limited to max. 16.x MB/s sequential speed on A10/A20, random I/O differs heavily and is only dependant on the SD card in use
 * SATA: max. 45/200 MB/s with some tweaking (see SATA performance influences)
 * USB: 3 independant USB ports each capable of max. 30-35 MB/s. When used concurrently total bandwidth is able to reach or even slightly exceed 100 MB/s

Network/storage not blocking each other
On A10/A20 devices Ethernet, SATA and the 3 USB ports are all connected directly to the SoC and do not have to share bandwidth between (but you will find some devices where this restriction applies to USB connected onboard Wi-Fi). This is a real advantage compared to many other ARM devices where 2 or more of these ports are behind a single USB connection. Compare with every model of Raspberry Pi for example: only one single connection exists between SoC and an internal USB hub with integrated Ethernet controller (LAN9512/LAN9514). All expansions ports share the bandwidth of this single USB 2.0 connection.

SATA and GBit capable ARM alternatives
There are 2 other affordable ARM SoC families available that also feature SATA, GBit Ethernet and even PCIe. Since these focus on different market segments they're way more expensive than Allwinner SoCs:


 * Marvell Kirkwood/Armada (used mainly in NAS boxes, recent models show excellent disk/network performance even when used concurrently, hardware accelerated crypto engine CESA)
 * i.MX6 (used mainly in industrial applications and also some SBC, SATA performance 90/100 MB/s write/read, Gbit Ethernet throughput limited to approx. 400 Mbits/sec, hardware accelerated crypto engine CAAM)

In the meantime also 64-bit ARM SoCs are available that do not only have a single SATA port and GBit Ethernet but that are able to access a couple of disks, provide 10 GbE Ethernet, several PCIe 3.0 lanes and use ECC RAM (eg. AMD's A1100). But they're in a different league regarding price, too.

What's different on sunxi compared to common server platforms (eg. x86)

 * CPU clock speeds always directly influence I/O and network bandwidth
 * Due to the limited CPU power even light workloads might heavily decrease NAS performance
 * The RTL8211 mainly used with A20's GMAC works in a different mode than normal: only used as PHY for the SoC's internal MAC implementation and not combining MAC/PHY in a PCIe attached network adapter but --> different drivers, different features (no WoL for example)
 * No ECC RAM available --> high(er) risk of Bit rotting

When to choose another device?

 * Need for transparent filesystem encryption (Allwinner's crypto engine seems to be slow/buggy and the CPU cores aren't that fast)
 * Need for more than 1 SATA port (port multipliers are unrealiable)
 * Need for data integrity (not possible without ECC RAM )

To be fair: the last two issues apply to nearly all cheap NAS boxes also.

Things to consider when using USB connected storage
USB connected disks are slower than expected (480 Mbps) due to protocol overhead (compare with 8b/10b encoding) and the inefficient BOT mode. If you're using mainline kernel, configured it accordingly and your USB device is USB Attached SCSI capable (an optional feature introduced with USB 3.0 that not only improves sequential transfer speeds but especially random I/O) only the first limitation applies.

When using BOT mode you won't be able to exceed 30 MB/s when accessing a disk connected to a single USB port (maybe 35 MB/s after extensive tuning). Using UASP it is possible to get close to 40 MB/s. Multiple disks on the same USB bus will not only have to share the available bandwidth but tend to partially block each other so overall bandwidth will be a bit lower when concurrent disk accesses happen.

When using USB attached disks it's important to use an enclosure/adapter that is capable of SCSI / ATA Translation (otherwise you won't be able to monitor drive health since no S.M.A.R.T. data can be accessed and no S.M.A.R.T. selftests can be triggered)

Influence of the chosen OS image on NAS performance
Most of the manufacturer supplied OS images for your sunxi device didn't have NAS but instead desktop/GUI useage in mind. This might have some severe implications on achievable fileserver performance. In the following we'll have a look at an extreme example: from all A20 devices tested so far the Banana Pi is able to achieve the highest network throughput: 940/750 Mbits/sec RX/TX measured with iperf, therefore combined SATA/network performance is able to reach 44/72 MB/s (in client -> server direction A20's slow SATA write performance is the bottleneck and responsible for just 44 MB/s and in server -> client direction the slower TX Ethernet throughput limits)

When using a desktop oriented OS image from the manufacturer (featuring an old sunxi 3.4 kernel, ARMv6 libs/userland and some unfavourable settings) performance dropped drastically with distro defaults: just 650/400 Mbits/sec RX/TX with iperf and 27.5/44.5 MB/s combined SATA/network. That's a whopping ~38 percent less compared to the maximum achievable with mainline kernel and NAS optimised settings. For details refer to this "Raspbian vs. Armbian" thread.

This does not just apply to OS images for Banana Pi but to most GUI oriented OS images manufacturers provide. In case you experience bad server performance check possible performance tweaks and if that does not help be prepared to build your own u-boot/kernel or switch to a headless distro that takes care of optimised server settings (eg. Armbian, a Debian based distro that supports all of the recommended A20 devices except of the Hummingbird).

Performance tweaks

 * Check general performance optimisation advises, especially if your rootfs (/tmp and /var/log) is on a slow SD card
 * Adjust CPU frequency scaling settings accordingly (eg. use ondemand always togehter with io_is_busy)
 * Run the device headless or set the framebuffer's resolution, depth and refresh rate to a minimum (increases memory bandwidth)
 * Avoid memory reservations for GPUs since on all sunxi devices CPU and GPU cores have to share memory.
 * Assign eth0 IRQs to cpu1 since irqbalancing neither works on sunxi/ARM nor increases performance when used with network interrupts
 * check out different I/O schedulers (can be set and read out using /sys/block/sda/queue/scheduler). On sunxi deadline seems to be the most performant
 * Use Mainline kernel and mainline U-Boot
 * When using Mainline kernel consider using a modern filesystem like btrfs with transparent file compression
 * Choose a device with more available DRAM if you experience memory shortages (the Cubietruck is available with 2GiB RAM)
 * Do some TCP/IP stack tuning to adjust parameters for GBit Ethernet (especially increasing buffer sizes and queue lenghts)
 * Adjust application settings accordingly (eg. using Samba search for tuning guides and edit the contents of smb.conf)

Benchmarking / Identifying bottlenecks
Always test from bottom to top (local disk performance, network performance, combined network/disk performance). Always have an eye on CPU utilisation especially when you're using only a single/dual core SoC. Consider switching temporarely to the performance cpufreq governor to avoid random test results and false conclusions.

Use the right tools
People experienced with SBCs normally use completely different – and mostly inappropriate – tools/methods to measure I/O and network performance compared to server professionals that do this for a living. Due to this you'll find many questionable benchmark results on the net regardings SBCs as servers. Always check the test methodology used.

Tools/methods that definitely lead to wrong results/assumptions:


 * dd with small filesizes and without approriate flags (testing mainly buffers/caches and therefore RAM not disk)
 * time cp $src $dst (same problem as with dd above)
 * hdparm -tT (tampering disk throughput with memory bandwidth)
 * wget/curl downloads from somewhere (many random unrelated influences on both sides of the network connection and in between)
 * scp/SFTP (random results based on SSH ciphers negotiated between client and server )

Use IOzone, bonnie++, IOmeter and the like with large filesizes (at least twice as much as RAM available) and also test random I/O and not just only sequential transfers. Use network tools that don't tamper network throughput with other stuff like disk performance on one or both sides of the network connection (use iperf/netperf and the like). When you're done testing individually always do a combined test using both synthetic benchmarks and real-world tasks (eg. copying a couple of small and afterwards a really large file between client and sunxi NAS. Always test both directions and keep in mind that copying many small files over the network is always slower than one big file, that in turn might be slower than a few large files transferred in a batch )

Identifying bottlenecks

 * iostat 5
 * htop
 * dstat -clnv --fs --vm --top-bio --top-cpu --top-io

Don't let you fool by platitudes. More CPU horsepower isn't always the key to more performance. Always check %user, %nice and %system individually when you experience performance problems. And keep in mind that %iowait is one key to understand when you're running in I/O bottlenecks

Unresolved issues

 * Slow SATA write performance (if this problem could be solved A20 based devices would be able to easily outperform most cheap GBit capable NAS)
 * GMAC settings (where do the variations in RX direction origin from?)
 * slow Lamobo R1 performance (maybe due to b53 driver quirks?)

= References =