Hardware Reliability Tests
The importance of the hardware diagnostic tools
Because of the diversity of various Allwinner based devices we have to deal with different DRAM, CPU clock speed and voltage settings. They are primarily derived from the fex files found in the vendor provided firmware on NAND and also based on the information retrieved by the meminfo tool. Some dishonest sellers also happily advertise way higher specs than the hardware can actually handle (for example, unrealistic 1.5GHz CPU clock speeds). Moreover, different chips have their own individual voltage and clock frequency tolerances within a certain range. So that the borderline stable settings may be good for one unit and bad for another.
In order to quickly identify obviously misconfigured hardware, we need some easy to use diagnostic tools. There is no magic involved. The general idea behind them is to find some real life or synthetic workloads, which are more likely to cause troubles. Then identify, which parts of these workloads are most problematic and convert them into (hopefully) user friendly test tools. Some of these tools are listed below.
Note: naturally, the majority of devices are expected to pass these tests and you are likely not to get any exciting results. Still this is not a good justification for the "nah, this may happen to anyone, but me" attitude. The experience shows that there are definitely faulty/misconfigured devices out there. If you are experiencing occasional crashes or freezes once in a while, then making sure that these tests pass is strongly recommended before trying anything else.
My device is failing these reliability tests. What to do now?!
First of all, make sure that you are using an up to date u-boot bootloader, an up to date linux kernel and correct hardware description data (script.bin for the sunxi-3.4 kernel or a *.dtb file for the mainline kernel). It may be a matter of some hardware misconfiguration (just like it happened with the cubieboard2/cubietruck before), and the intention is to have all of these problems ironed out eventually.
If the tests are still failing with the up to date software, be sure to report the problem in the linux-sunxi mailing list or in the irc channel. Please also include all the relevant information (the board type, the kernel version, the bootloader version, etc.). Even if you can resolve the problem yourself (by adjusting the voltages or the clock speed), reporting the incident is still rather important to ensure that other people don't encounter the same problem in the future. Thanks!
DRAM
Reliability
The Lima-memtester can be used to check if the DRAM settings are reasonable and dcdc3 voltage is sufficient. If you have a shell access to your device, then you can download a static binary compiled for ARM (or click on the 'Expand' link to see how to compile it from sources):
git clone https://github.com/ssvb/lima-memtester.git cd lima-memtester cmake . make -j2
# The wget option --no-check-certificate is only here to make it work even if the date is set wrong (no RTC battery) wget --no-check-certificate https://github.com/ssvb/lima-memtester/releases/download/static-binary-20150126/lima-memtester chmod +x lima-memtester
The lima-memtester static binary requires only the sunxi-3.4 kernel with the mali kernel module and framebuffer enabled to run. This test does not depend on anything in the userland and should work with any Linux distribution (this also means that it does NOT require the userland Mali binary driver). Just run lima-memtester with root privileges as:
./lima-memtester 100M
If everything is working fine, there should be a spinning cube demo running indefinitely. If something is very wrong, then the test fails after just a few seconds! If something is mildly wrong, this usually gets detected in less than 15-20 minutes. However even running for a few hours can still detect some problems. In the case of troubles, the following symptoms may be observed.
- the system freezes
- the display background starts glowing red (normally it is gray)
For even more confidence, it is a good idea to keep it running overnight (8-10 hours) at least once and if possible, over the weekend.
On the Lime2 page there are some test results.
CPU
Reliability of cpufreq voltage/frequency settings
The following ruby script can run basic reliability tests (correctness of jpeg decompression) for all cpufreq operating points. The kernel config needs to have support for 'userspace' cpufreq governor.
cd /tmp git clone https://github.com/ssvb/cpuburn-arm.git cd cpuburn-arm ./cpufreq-ljt-stress-test
Example output:
Creating './whitenoise-1920x1080.jpg' ... done CPU stress test, which is doing JPEG decoding by libjpeg-turbo at different cpufreq operating points. Testing CPU 0 1200 MHz SKIPPED 1152 MHz SKIPPED 1104 MHz SKIPPED 1056 MHz SKIPPED 1008 MHz ............................................................ OK 960 MHz ............................................................ OK 912 MHz ............................................................ OK 864 MHz ............................................................ OK 816 MHz ............................................................ OK 768 MHz ............................................................ OK 744 MHz ............................................................ OK 720 MHz ............................................................ OK 696 MHz SKIPPED 672 MHz SKIPPED 648 MHz SKIPPED 600 MHz SKIPPED 528 MHz SKIPPED 480 MHz SKIPPED 408 MHz SKIPPED 384 MHz SKIPPED 360 MHz SKIPPED 336 MHz SKIPPED 288 MHz SKIPPED 264 MHz SKIPPED 240 MHz SKIPPED 216 MHz SKIPPED 204 MHz SKIPPED 192 MHz SKIPPED 180 MHz SKIPPED 168 MHz SKIPPED 156 MHz SKIPPED 144 MHz SKIPPED 132 MHz SKIPPED 120 MHz SKIPPED 96 MHz SKIPPED 84 MHz SKIPPED 72 MHz SKIPPED 60 MHz SKIPPED
If voltage is configured wrong for one of the operating points, then data corruption may be detected and reported.
Be aware: Please take the results of this script with a grain of salt. There are border cases in which extended tests show a device might not be stable at certain settings even though they pass the tests in this script. Especially on a multi-core system you may want to run CPU-intensive tasks in the background while running cpufreq-ljt-stress-test in order to keep all cores busy. The cpuburn scripts (see below) or compiling a kernel might be suitable tasks for this end. You may also want to change the duration of the tests by changing the number of test iterations per frequency setting (see line 203 of the script: the default value is 60 - feel free to set a higher value to make this script run longer).
Overheating
If the CPU is overclocked and/or overvolted, then it may overheat and fail under heavy load. To check for the potential CPU overheating (with or without overclocking), it is possible to use the cpuburn tool.
To run this test on the Allwinner A10 hardware:
git clone https://github.com/ssvb/cpuburn-arm.git cd cpuburn-arm gcc -o cpuburn-a8 cpuburn-a8.S ./cpuburn-a8
Or on the Allwinner A20 hardware:
git clone https://github.com/ssvb/cpuburn-arm.git cd cpuburn-arm gcc -o cpuburn-a7 cpuburn-a7.S ./cpuburn-a7
The cpuburn programs are only heating the CPU and are not providing any visible feedback. It may make sense to also run some other non CPU hungry program simultaneously to monitor whether it is still alive or not. Running some Mali400 graphics demo is the best for this purpose. And the GPU is also providing an extra source of heat.
WARNING: if the device is recklessly overclocked/overvolted too much, then some permanent hardware damage may theoretically happen. technical explanation