Benchmarks

From linux-sunxi.org
Jump to navigation Jump to search

This page is not too useful as is, and needs to be fully split/removed/reworked to feature in the modern device->components centric wiki logic.

A10 Benchmarks

CPU

Linpack

Download this[1], rename it to linpack.c

Build
root@linaro-alip:~/benchmarks# cc -Ofast -o linpack linpack.c -lm -mcpu=cortex-a8 -march=armv7-a -mfpu=neon -mfloat-abi=hard -funsafe-math-optimizations -fno-fast-math
linpack.c: In function ‘main’:
linpack.c:78:14: warning: ignoring return value of ‘fgets’, declared with attribute warn_unused_result [-Wunused-result]
Results

-mcpu=cortex-a8 -march=armv7-a -mfpu=neon -mfloat-abi=hard -funsafe-math-optimizations -fno-fast-math

Memory required:  315K.


LINPACK benchmark, Double precision.
Machine precision:  15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      16   0.61  88.52%   6.56%   4.92%  37885.057
      32   1.21  85.12%   2.48%  12.40%  41459.119
      64   2.43  93.83%   2.47%   3.70%  37561.254
     128   4.86  91.77%   2.47%   5.76%  38381.368
     256   9.70  92.06%   2.89%   5.05%  38173.000
     512  19.41  91.29%   2.47%   6.23%  38634.432

mcpu=cortex-a8 -mtune=cortex-a8 -march=armv7-a -mfpu=neon -mfloat-abi=hard -funsafe-math-optimizations -fomit-frame-pointer -ffast-math -funroll-loops -funsafe-loop-optimizations

Memory required:  315K.


LINPACK benchmark, Double precision.
Machine precision:  15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      16   0.53  90.57%   1.89%   7.55%  44843.537
      32   1.05  90.48%   3.81%   5.71%  44390.572
      64   2.13  90.14%   2.35%   7.51%  44615.905
     128   4.23  90.54%   3.07%   6.38%  44390.572
     256   8.46  90.19%   2.84%   6.97%  44672.596
     512  17.03  90.55%   2.76%   6.69%  44250.892

Whetstone/Dhrystone

http://www.roylongbottom.org.uk/linux%20benchmarks.htm (requires File:Classic benchmarks.patch)

Building
linaro@linaro-alip:~/tmp$ wget 'http://www.roylongbottom.org.uk/classic_benchmarks.tar.gz'
linaro@linaro-alip:~/tmp$ wget 'http://linux-sunxi.org/images/a/a1/Classic_benchmarks.patch'
linaro@linaro-alip:~/tmp$ tar -xzf classic_benchmarks.tar.gz 
linaro@linaro-alip:~/tmp$ patch -p0 < Classic_benchmarks.patch 
linaro@linaro-alip:~/tmp$ cd classic_benchmarks/source_code/
linaro@linaro-alip:~/tmp/classic_benchmarks/source_code$ make
Results
./whets (gcc-4.7 -static -O3 -mcpu=cortex-a8 -mtune=cortex-a8 -mfpu=neon -funroll-loops)
          Single Precision C/C++ Whetstone Benchmark

Loop content                   Result              MFLOPS      MOPS   Seconds

N1 floating point      -1.12475013732910156       104.038               0.041
N2 floating point      -1.12274742126464844       105.829               0.282
N3 if then else         1.00000000000000000               14575.397     0.002
N4 fixed point         12.00000000000000000                 418.942     0.167
N5 sin,cos etc.         0.49911010265350342                   3.906     4.729
N6 floating point       0.99999982118606567        98.848               1.211
N7 assignments          3.00000000000000000                2254.666     0.018
N8 exp,sqrt etc.        0.75110864639282227                   2.335     3.537

MWIPS                                             222.285               9.987
./dhry1 (gcc-4.7 -static -O3 -mcpu=cortex-a8 -mtune=cortex-a8 -mfpu=neon -funroll-loops)
Microseconds for one run through Dhrystone:         0.22 
Dhrystones per Second:                         4518788 
VAX  MIPS rating =                               2571.88 
./dhry2 (gcc-4.7 -static -O3 -mcpu=cortex-a8 -mtune=cortex-a8 -mfpu=neon -funroll-loops)
Microseconds for one run through Dhrystone:         0.30 
Dhrystones per Second:                         3336166 
VAX  MIPS rating =                               1898.79 

Adding -Ofast and -flto:

./whets (gcc-4.7 -static -Ofast -mcpu=cortex-a8 -mtune=cortex-a8 -mfpu=neon -funroll-loops -flto)
          Single Precision C/C++ Whetstone Benchmark

Loop content                  Result              MFLOPS      MOPS   Seconds

N1 floating point     -1.12367534637451172       103.565              0.004
N2 floating point     -1.12167263031005859       105.531              0.028
N3 if then else        1.00000000000000000               14852.924    0.000
N4 fixed point        12.00000000000000000                6970.390    0.001
N5 sin,cos etc.        0.49911010265350342                   3.933    0.465
N6 floating point      0.99999982118606567        98.786              0.120
N7 assignments         3.00000000000000000                2211.433    0.002
N8 exp,sqrt etc.       0.75110864639282227                   2.698    0.303

MWIPS                                            238.120              0.924
./dhry1 (gcc-4.7 -static -Ofast -mcpu=cortex-a8 -mtune=cortex-a8 -mfpu=neon -funroll-loops -flto)
Microseconds for one run through Dhrystone:         0.19 
Dhrystones per Second:                         5185531 
VAX  MIPS rating =                               2951.36 
./dhry2 (gcc-4.7 -static -Ofast -mcpu=cortex-a8 -mtune=cortex-a8 -mfpu=neon -funroll-loops)
Microseconds for one run through Dhrystone:         0.19 
Dhrystones per Second:                         5262435 
VAX  MIPS rating =                               2995.13 

OpenSSL

How to test

run

openssl speed
Results

Linaro-alip soft-float

OpenSSL 1.0.1 14 Mar 2012
built on: Tue Aug 21 05:35:49 UTC 2012
options:bn(64,32) rc4(ptr,char) des(idx,cisc,16,long) aes(partial) blowfish(ptr)
compiler: cc -fPIC -DOPENSSL_PIC -DZLIB -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DL_ENDIAN -DTERMIO -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security -Werror=format-security -D_FORTIFY_SOURCE=2 -Wl,-Bsymbolic-functions -Wl,-z,relro -Wa,--noexecstack -Wall -DOPENSSL_NO_TLS1_2_CLIENT -DOPENSSL_MAX_TLS1_2_CIPHER_LENGTH=50
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
md2                  0.00         0.00         0.00         0.00         0.00
mdc2                 0.00         0.00         0.00         0.00         0.00
md4               4539.13k    23584.98k    68988.33k   133520.04k   184363.69k
md5               5140.49k    17237.58k    46162.43k    79220.05k   100848.98k
hmac(md5)         6296.96k    20580.39k    51788.37k    83395.93k   101282.91k
sha1              5056.81k    15672.85k    36537.09k    54699.01k    64102.40k
rmd160            4733.01k    14162.58k    31460.95k    45231.10k    51950.93k
rc4              67049.00k    74935.98k    78372.86k    79348.39k    79623.51k
des cbc          17689.04k    18793.72k    19138.82k    19248.13k    19292.16k
des ede3          6748.10k     6951.38k     6998.10k     7015.77k     6950.87k
idea cbc             0.00         0.00         0.00         0.00         0.00
seed cbc         20640.20k    21906.09k    22347.52k    22450.52k    22500.69k
rc2 cbc          13089.00k    13998.74k    14224.73k    14294.36k    14164.76k
rc5-32/12 cbc        0.00         0.00         0.00         0.00         0.00
blowfish cbc     26759.62k    29755.75k    30726.06k    30958.59k    31053.14k
cast cbc         25870.12k    28393.51k    29254.23k    29501.78k    29570.39k
aes-128 cbc      19582.69k    20855.45k    21258.07k    21348.35k    21392.04k
aes-192 cbc      16902.33k    17731.03k    18009.26k    18094.42k    18117.97k
aes-256 cbc      14778.66k    15419.55k    15636.82k    15683.58k    15712.26k
camellia-128 cbc    26162.67k    28201.17k    28923.31k    29136.90k    28918.58k
camellia-192 cbc    20555.46k    22316.52k    22863.19k    22990.17k    23046.83k
camellia-256 cbc    20704.67k    22316.39k    22846.72k    23003.48k    23044.10k
sha256            4130.87k     9683.05k    17185.11k    21408.43k    23093.25k
sha512             804.45k     3218.84k     4525.99k     6147.07k     6873.09k
whirlpool         1201.69k     2457.88k     3979.18k     4716.20k     4917.93k
aes-128 ige      18517.42k    19858.50k    20280.58k    20406.61k    20838.75k
aes-192 ige      15950.20k    17003.69k    17323.18k    17393.32k    17408.00k
aes-256 ige      14102.48k    14868.65k    15100.93k    15172.95k    15174.31k
ghash            14806.49k    15383.55k    15564.03k    15625.22k    15652.18k
                  sign    verify    sign/s verify/s
rsa  512 bits 0.002293s 0.000203s    436.1   4920.6
rsa 1024 bits 0.012441s 0.000617s     80.4   1621.2
rsa 2048 bits 0.075263s 0.002055s     13.3    486.7
rsa 4096 bits 0.499048s 0.007148s      2.0    139.9
                  sign    verify    sign/s verify/s
dsa  512 bits 0.002058s 0.002299s    485.9    435.0
dsa 1024 bits 0.006101s 0.006964s    163.9    143.6
dsa 2048 bits 0.020326s 0.023641s     49.2     42.3
                              sign    verify    sign/s verify/s
 160 bit ecdsa (secp160r1)   0.0010s   0.0045s    977.2    222.1
 192 bit ecdsa (nistp192)   0.0011s   0.0046s    950.8    218.4
 224 bit ecdsa (nistp224)   0.0014s   0.0062s    739.1    160.2
 256 bit ecdsa (nistp256)   0.0016s   0.0079s    613.0    126.5
 384 bit ecdsa (nistp384)   0.0036s   0.0184s    281.4     54.3
 521 bit ecdsa (nistp521)   0.0096s   0.0510s    103.9     19.6
 163 bit ecdsa (nistk163)   0.0021s   0.0080s    473.6    125.3
 233 bit ecdsa (nistk233)   0.0044s   0.0155s    228.5     64.3
 283 bit ecdsa (nistk283)   0.0067s   0.0286s    150.2     35.0
 409 bit ecdsa (nistk409)   0.0178s   0.0667s     56.3     15.0
 571 bit ecdsa (nistk571)   0.0426s   0.1538s     23.5      6.5
 163 bit ecdsa (nistb163)   0.0021s   0.0086s    472.9    116.0
 233 bit ecdsa (nistb233)   0.0043s   0.0173s    230.3     57.9
 283 bit ecdsa (nistb283)   0.0067s   0.0320s    149.7     31.2
 409 bit ecdsa (nistb409)   0.0178s   0.0759s     56.1     13.2
 571 bit ecdsa (nistb571)   0.0428s   0.1760s     23.3      5.7
                              op      op/s
 160 bit ecdh (secp160r1)   0.0038s    264.8
 192 bit ecdh (nistp192)   0.0038s    263.9
 224 bit ecdh (nistp224)   0.0052s    191.9
 256 bit ecdh (nistp256)   0.0066s    151.4
 384 bit ecdh (nistp384)   0.0152s     66.0
 521 bit ecdh (nistp521)   0.0422s     23.7
 163 bit ecdh (nistk163)   0.0040s    253.0
 233 bit ecdh (nistk233)   0.0077s    130.0
 283 bit ecdh (nistk283)   0.0142s     70.6
 409 bit ecdh (nistk409)   0.0331s     30.2
 571 bit ecdh (nistk571)   0.0760s     13.2
 163 bit ecdh (nistb163)   0.0042s    235.8
 233 bit ecdh (nistb233)   0.0085s    117.0
 283 bit ecdh (nistb283)   0.0158s     63.1
 409 bit ecdh (nistb409)   0.0378s     26.5
 571 bit ecdh (nistb571)   0.0879s     11.4

ArchLinux-ARM hard-float

OpenSSL 1.0.1c 10 May 2012
built on: Sat May 12 16:58:09 UTC 2012
options:bn(64,32) md2(int) rc4(ptr,char) des(idx,cisc,16,long) aes(partial) idea(int) blowfish(ptr)
compiler: gcc -fPIC -DOPENSSL_PIC -DZLIB -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -Wa,--noexecstack -march=armv7-a -mfloat-abi=hard -mfpu=vfpv3-d16 -O2 -pipe -fstack-protector --param=ssp-buffer-size=4 -D_FORTIFY_SOURCE=2 -DOPENSSL_NO_TLS1_2_CLIENT -DTERMIO -O3 -Wall -DOPENSSL_BN_ASM_MONT
-DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DAES_ASM -DGHASH_ASM
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
md2               1010.38k     2071.59k     2919.79k     3215.08k     3322.59k
mdc2              2238.70k     2724.06k     2915.34k     3063.91k     3044.11k
md4               8261.57k    28911.65k    81103.94k   148492.57k   200026.24k
md5               6456.03k    20979.84k    54995.08k    89176.35k   111244.43k
hmac(md5)         6319.94k    21289.87k    54631.79k    89444.75k   110983.92k
sha1              6633.24k    20150.08k    47302.98k    70280.66k    82581.21k
rmd160            5493.36k    15627.12k    34127.48k    48159.05k    54297.05k
rc4              66233.79k    74331.73k    77396.54k    77693.81k    78829.52k
des cbc          18532.54k    19769.99k    20273.26k    20313.14k    20323.82k
des ede3          7169.37k     7346.22k     7416.20k     7478.53k     7461.20k
idea cbc         15485.30k    16443.47k    16683.46k    16698.54k    16758.24k
seed cbc         20667.10k    22857.28k    23349.77k    23677.72k    23609.99k
rc2 cbc          13686.09k    14637.63k    14956.66k    15077.94k    14912.37k
rc5-32/12 cbc        0.00         0.00         0.00         0.00         0.00
blowfish cbc     27451.80k    30338.11k    31082.36k    31144.23k    31523.17k
cast cbc         27317.50k    30075.42k    31215.36k    31145.33k    31403.65k
aes-128 cbc      35895.60k    40605.48k    43274.31k    43880.05k    44219.12k
aes-192 cbc      30897.64k    35908.66k    37676.55k    38116.09k    38425.23k
aes-256 cbc      27594.10k    31650.74k    33180.37k    33427.34k    33498.80k
camellia-128 cbc    26308.13k    29661.26k    31114.20k    31346.19k    31582.08k
camellia-192 cbc    21422.53k    23395.33k    24418.24k    24554.27k    24599.57k
camellia-256 cbc    21457.81k    23333.78k    24369.82k    24582.58k    24617.25k
sha256           10078.10k    24314.98k    43970.03k    55573.89k    59677.84k
sha512            4133.94k    16576.53k    25365.00k    35504.81k    40002.80k
whirlpool         1216.98k     2492.34k     4065.11k     4781.12k     5059.59k
aes-128 ige      31316.97k    38357.40k    41833.06k    43101.16k    43264.37k
aes-192 ige      27502.07k    34078.09k    36575.95k    37367.50k    37251.58k
aes-256 ige      24869.69k    30543.02k    32333.16k    32777.21k    33054.87k
ghash            52904.92k    62310.47k    66025.17k    66775.11k    66985.81k
                  sign    verify    sign/s verify/s
rsa  512 bits 0.001042s 0.000104s    960.0   9587.7
rsa 1024 bits 0.005983s 0.000327s    167.2   3060.2
rsa 2048 bits 0.038947s 0.001188s     25.7    841.7
rsa 4096 bits 0.280000s 0.004561s      3.6    219.2
                  sign    verify    sign/s verify/s
dsa  512 bits 0.001062s 0.001161s    942.0    861.4
dsa 1024 bits 0.003206s 0.003716s    311.9    269.1
dsa 2048 bits 0.011507s 0.013283s     86.9     75.3
                              sign    verify    sign/s verify/s
 160 bit ecdsa (secp160r1)   0.0006s   0.0023s   1620.6    438.5
 192 bit ecdsa (nistp192)   0.0008s   0.0033s   1259.9    304.1
 224 bit ecdsa (nistp224)   0.0010s   0.0043s    991.3    232.9
 256 bit ecdsa (nistp256)   0.0013s   0.0058s    790.4    173.8
 384 bit ecdsa (nistp384)   0.0030s   0.0151s    338.2     66.2
 521 bit ecdsa (nistp521)   0.0062s   0.0346s    161.1     28.9
 163 bit ecdsa (nistk163)   0.0019s   0.0064s    536.1    157.0
 233 bit ecdsa (nistk233)   0.0039s   0.0116s    257.6     85.9
 283 bit ecdsa (nistk283)   0.0059s   0.0214s    169.2     46.8
 409 bit ecdsa (nistk409)   0.0161s   0.0469s     62.0     21.3
 571 bit ecdsa (nistk571)   0.0385s   0.1089s     25.9      9.2
 163 bit ecdsa (nistb163)   0.0018s   0.0069s    544.3    145.1
 233 bit ecdsa (nistb233)   0.0038s   0.0128s    259.9     78.3
 283 bit ecdsa (nistb283)   0.0059s   0.0238s    169.3     42.0
 409 bit ecdsa (nistb409)   0.0161s   0.0533s     62.1     18.8
 571 bit ecdsa (nistb571)   0.0385s   0.1241s     25.9      8.1
                              op      op/s
 160 bit ecdh (secp160r1)   0.0019s    515.5
 192 bit ecdh (nistp192)   0.0027s    374.3
 224 bit ecdh (nistp224)   0.0036s    278.7
 256 bit ecdh (nistp256)   0.0049s    203.8
 384 bit ecdh (nistp384)   0.0126s     79.2
 521 bit ecdh (nistp521)   0.0288s     34.7
 163 bit ecdh (nistk163)   0.0031s    319.4
 233 bit ecdh (nistk233)   0.0057s    176.9
 283 bit ecdh (nistk283)   0.0105s     94.9
 409 bit ecdh (nistk409)   0.0231s     43.2
 571 bit ecdh (nistk571)   0.0538s     18.6
 163 bit ecdh (nistb163)   0.0033s    300.7
 233 bit ecdh (nistb233)   0.0063s    158.9
 283 bit ecdh (nistb283)   0.0118s     85.1
 409 bit ecdh (nistb409)   0.0263s     38.1
 571 bit ecdh (nistb571)   0.0615s     16.3

SciMark

Build
wget http://math.nist.gov/scimark2/scimark2_1c.zip
unzip -o scimark2_1c.zip -d scimark2_files
cd scimark2_files/
g++ -o scimark2 -O *.c -mcpu=cortex-a8 -mtune=cortex-a8 -march=armv7-a -mfpu=neon -mfloat-abi=hard -funsafe-math-optimizations -fomit-frame-pointer -ffast-math -funroll-loops -funsafe-loop-optimizations -fno-tree-vectorize
./scimark2 -large
Results
**                                                              **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to [email protected])     **
**                                                              **
Using       2.00 seconds min time per kenel.
Composite Score:           29.32
FFT             Mflops:    13.57    (N=1048576)
SOR             Mflops:    48.51    (1000 x 1000)
MonteCarlo:     Mflops:    23.30
Sparse matmult  Mflops:    34.22    (N=100000, nz=1000000)
LU              Mflops:    26.97    (M=1000, N=1000)

nbench

http://www.tux.org/~mayer/linux/bmark.html

build
linaro@linaro-alip:~/tmp$ wget http://www.tux.org/~mayer/linux/nbench-byte-2.2.3.tar.gz
[...]
linaro@linaro-alip:~/tmp$ tar -xzf nbench-byte-2.2.3.tar.gz 
linaro@linaro-alip:~/tmp$ cd nbench-byte-2.2.3
linaro@linaro-alip:~/tmp/nbench-byte-2.2.3$ vi Makefile 
linaro@linaro-alip:~/tmp/nbench-byte-2.2.3$ make
[...]
linaro@linaro-alip:~/tmp/nbench-byte-2.2.3$ ./nbench 
results
CC=gcc-4.7
CFLAGS=-s -static -Wall -O3 -mfpu=neon -mcpu=cortex-a8 -mtune=cortex-a8 -fomit-frame-pointer -marm -funroll-loops
BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index
                    :                  : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT        :          583.28  :      14.96  :       4.91
STRING SORT         :          58.353  :      26.07  :       4.04
BITFIELD            :      2.6754e+08  :      45.89  :       9.59
FP EMULATION        :          108.48  :      52.05  :      12.01
FOURIER             :          1866.1  :       2.12  :       1.19
ASSIGNMENT          :          9.0228  :      34.33  :       8.91
IDEA                :          1226.3  :      18.76  :       5.57
HUFFMAN             :          744.22  :      20.64  :       6.59
NEURAL NET          :            1.96  :       3.15  :       1.32
LU DECOMPOSITION    :          87.325  :       4.52  :       3.27
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 27.658
FLOATING-POINT INDEX: 3.115
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 : 
L2 Cache            : 
OS                  : Linux 3.4.19-a10-aufs+
C compiler          : gcc-4.7
libc                : libc-2.15.so
MEMORY INDEX        : 7.010
INTEGER INDEX       : 6.822
FLOATING-POINT INDEX: 1.728
Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38
* Trademarks are property of their respective holder.
CC=gcc-4.7
CFLAGS=-s -static -Wall -Ofast -mfpu=neon -mcpu=cortex-a8 -mtune=cortex-a8 -fomit-frame-pointer -marm -funroll-loops
BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index
                    :                  : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT        :          586.72  :      15.05  :       4.94
STRING SORT         :          58.217  :      26.01  :       4.03
BITFIELD            :      2.6871e+08  :      46.09  :       9.63
FP EMULATION        :           108.2  :      51.92  :      11.98
FOURIER             :          1895.2  :       2.16  :       1.21
ASSIGNMENT          :          9.0192  :      34.32  :       8.90
IDEA                :          1226.8  :      18.76  :       5.57
HUFFMAN             :          804.24  :      22.30  :       7.12
NEURAL NET          :          2.0692  :       3.32  :       1.40
LU DECOMPOSITION    :          87.325  :       4.52  :       3.27
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 27.988
FLOATING-POINT INDEX: 3.188
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 : 
L2 Cache            : 
OS                  : Linux 3.4.19-a10-aufs+
C compiler          : gcc-4.7
libc                : libc-2.15.so
MEMORY INDEX        : 7.014
INTEGER INDEX       : 6.962
FLOATING-POINT INDEX: 1.768
Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38
* Trademarks are property of their respective holder.

Linux kernel build

setup
root@debian:~$ wget 'http://www.kernel.org/pub/linux/kernel/v3.0/linux-3.1.tar.bz2'
root@debian:~$ md5sum linux-3.1.tar.bz2
8d43453f8159b2332ad410b19d86a931  linux-3.1.tar.bz2
root@debian:~$ tar -xjf linux-3.1.tar.bz2
root@debian:~$ cd linux-3.1
root@debian:~/linux-3.1$ gcc --version
gcc (Debian 4.7.2-5) 4.7.2
Copyright (C) 2012 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
tests
root@debian:~/linux-3.1$ time make -s vexpress_defconfig bzImage 
[...] 
real    45m26.121s 
user    43m6.080s 
sys     2m8.370s
root@debian:~$ time md5sum linux-3.1.tar.bz2  
8d43453f8159b2332ad410b19d86a931  linux-3.1.tar.bz2 
real    0m0.797s
user    0m0.610s
sys     0m0.180s
root@debian:~$ time bzip2 -t linux-3.1.tar.bz2  
real    1m47.884s
user    1m47.250s
sys     0m0.290s

OpenBenchmark Phoronix Test Suite

Comparisons with Debian and Raspian on r-Pi vs. Cubieboard 1 and 2 http://openbenchmarking.org/result/1308083-UT-1302242BY19 http://openbenchmarking.org/result/1308084-UT-1301189RA85

GPU

Results for X11 libraries and framebuffer libraries may differ.

ioquake3

See ioquake3

es2_gears

X11 libraries:

  • 131FPS
  • r3p0: 195-200 FPS
  • r3p0: 58-75 FPS - fullscreen (1024x768)

Framebuffer libraries: ?

glx_gears

X11 libraries + mesa:

  • 117 FPS
  • ~25 FPS - fullscreen (1024x768)

glmark2-es2

X11 libraries:

=======================================================
    glmark2 2012.08
=======================================================
    OpenGL Information
    GL_VENDOR:     ARM
    GL_RENDERER:   Mali-400 MP
    GL_VERSION:    OpenGL ES 2.0
=======================================================
[build] use-vbo=false: FPS: 48 FrameTime: 20.833 ms
[build] use-vbo=true: FPS: 55 FrameTime: 18.182 ms
[texture] texture-filter=nearest: FPS: 56 FrameTime: 17.857 ms
[texture] texture-filter=linear: FPS: 56 FrameTime: 17.857 ms
[texture] texture-filter=mipmap: FPS: 57 FrameTime: 17.544 ms
[shading] shading=gouraud: FPS: 50 FrameTime: 20.000 ms
[shading] shading=blinn-phong-inf: FPS: 50 FrameTime: 20.000 ms
[shading] shading=phong: FPS: 47 FrameTime: 21.277 ms
[bump] bump-render=high-poly: FPS: 37 FrameTime: 27.027 ms
[bump] bump-render=normals: FPS: 58 FrameTime: 17.241 ms
[bump] bump-render=height: FPS: 57 FrameTime: 17.544 ms
[effect2d] kernel=0,1,0;1,-4,1;0,1,0;: FPS: 30 FrameTime: 33.333 ms
[effect2d] kernel=1,1,1,1,1;1,1,1,1,1;1,1,1,1,1;: FPS: 19 FrameTime: 52.632 ms
[pulsar] light=false:quads=5:texture=false: FPS: 59 FrameTime: 16.949 ms
[desktop] blur-radius=5:effect=blur:passes=1:separable=true:windows=4: FPS: 16 FrameTime: 62.500 ms
[desktop] effect=shadow:windows=4: FPS: 43 FrameTime: 23.256 ms
Error: Requested MapBuffer VBO update method but GL_OES_mapbuffer is not supported!
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=map: Unsupported
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=subdata: FPS: 18 FrameTime: 55.556 ms
Error: Requested MapBuffer VBO update method but GL_OES_mapbuffer is not supported!
[buffer] columns=200:interleave=true:update-dispersion=0.9:update-fraction=0.5:update-method=map: Unsupported
[ideas] speed=duration: FPS: 48 FrameTime: 20.833 ms
[jellyfish] <default>: FPS: 43 FrameTime: 23.256 ms
Error: SceneTerrain requires Vertex Texture Fetch support, but GL_MAX_VERTEX_TEXTURE_IMAGE_UNITS is 0
[terrain] <default>: Unsupported
[conditionals] fragment-steps=0:vertex-steps=0: FPS: 59 FrameTime: 16.949 ms
[conditionals] fragment-steps=5:vertex-steps=0: FPS: 54 FrameTime: 18.519 ms
[conditionals] fragment-steps=0:vertex-steps=5: FPS: 58 FrameTime: 17.241 ms
[function] fragment-complexity=low:fragment-steps=5: FPS: 57 FrameTime: 17.544 ms
[function] fragment-complexity=medium:fragment-steps=5: FPS: 43 FrameTime: 23.256 ms
[loop] fragment-loop=false:fragment-steps=5:vertex-steps=5: FPS: 56 FrameTime: 17.857 ms
[loop] fragment-steps=5:fragment-uniform=false:vertex-steps=5: FPS: 57 FrameTime: 17.544 ms
[loop] fragment-steps=5:fragment-uniform=true:vertex-steps=5: FPS: 56 FrameTime: 17.857 ms
=======================================================
                                  glmark2 Score: 47 
=======================================================

Video decoding

See CedarXVideoRenderingChart

IO

SATA

root@debian:~% sudo dd if=/dev/sda of=/dev/null bs=32M count=100 iflag=direct
100+0 records in
100+0 records out
3355443200 bytes (3.4 GB) copied, 15.3565 s, 219 MB/s

This may be limited by the comparatively old, cheap SSD being used.

SD Card

NAND

Ethernet

Power consumption

A13 Benchmarks

A13 needs own CPU benchmarks because DDR3 bus is crippled.

nbench

Tested on A13-olinuxino, debian wheezy.

CC=gcc-4.7 CFLAGS= -s -static -O3 -mfpu=neon -mcpu=cortex-a8 -mtune=cortex-a8 -fomit-frame-pointer -marm -munroll-loops

BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index
                    :                  : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT        :           578.4  :      14.83  :       4.87
STRING SORT         :          53.536  :      23.92  :       3.70
BITFIELD            :      2.5697e+08  :      44.08  :       9.21
FP EMULATION        :          105.84  :      50.79  :      11.72
FOURIER             :          1754.5  :       2.00  :       1.12
ASSIGNMENT          :          8.8536  :      33.69  :       8.74
IDEA                :          1206.5  :      18.45  :       5.48
HUFFMAN             :          719.14  :      19.94  :       6.37
NEURAL NET          :          1.9275  :       3.10  :       1.30
LU DECOMPOSITION    :          85.326  :       4.42  :       3.19
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 26.768
FLOATING-POINT INDEX: 3.011
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 : 
L2 Cache            : 
OS                  : Linux 3.4.61stage+
C compiler          : gcc-4.7
libc                : libc-2.13.so
MEMORY INDEX        : 6.679
INTEGER INDEX       : 6.681
FLOATING-POINT INDEX: 1.670
Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38
* Trademarks are property of their respective holder.

A10S Benchmarks

Should be the same as A13.

A20 Benchmarks

CPU

OpenSSL

OpenSSL 1.0.1c 10 May 2012
built on: Sun May 26 10:09:49 UTC 2013
options:bn(64,32) rc4(ptr,char) des(idx,cisc,16,long) aes(partial) blowfish(ptr)
compiler: cc -fPIC -DOPENSSL_PIC -DZLIB -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DL_ENDIAN -DTERMIO -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -Wl,-Bsymbolic-functions -Wl,-z,relro -Wa,--noexecstack -Wall -DOPENSSL_NO_TLS1_2_CLIENT -DOPENSSL_MAX_TLS1_2_CIPHER_LENGTH=50 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DAES_ASM -DBSAES_ASM -DGHASH_ASM
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
md2                  0.00         0.00         0.00         0.00         0.00
mdc2                 0.00         0.00         0.00         0.00         0.00
md4               2536.46k    20273.86k    52285.78k    85916.01k   106042.42k
md5               4658.85k    15543.72k    40538.41k    67159.72k    83646.07k
hmac(md5)         4977.63k    16353.73k    41931.60k    69000.08k    84975.62k
sha1              4701.95k    13683.01k    28788.78k    40693.55k    46074.54k
rmd160            3908.94k    10874.84k    22251.01k    30578.56k    34138.79k
rc4              48278.43k    54375.70k    56301.51k    57160.09k    53712.21k
des cbc          11529.27k    12242.68k    12399.79k    12450.82k    12465.49k
des ede3          4163.83k     4113.26k     4434.71k     4462.30k     3923.97k
idea cbc             0.00         0.00         0.00         0.00         0.00
seed cbc         15233.77k    16166.74k    16474.46k    16540.33k    16561.49k
rc2 cbc           8235.71k     8122.28k     9262.39k     9267.03k     9287.92k
rc5-32/12 cbc        0.00         0.00         0.00         0.00         0.00
blowfish cbc     18124.43k    19772.31k    20277.09k    20431.03k    20400.81k
cast cbc         17256.97k    17877.53k    20460.48k    20543.36k    20561.92k
aes-128 cbc      20783.69k    22336.55k    23158.58k    23271.08k    23328.09k
aes-192 cbc      16915.78k    17387.43k    19564.65k    19736.49k    16523.73k
aes-256 cbc      16038.03k    17123.52k    15135.01k    16914.49k    17337.45k
camellia-128 cbc    17164.96k    18374.83k    18757.21k    18865.92k    18882.56k
camellia-192 cbc    13680.37k    14486.59k    14760.33k    14803.97k    14824.79k
camellia-256 cbc    13662.32k    14485.67k    14743.13k    14816.15k    14827.52k
sha256            5340.42k    12254.95k    21022.63k    25548.63k    27282.68k
sha512            2550.36k    10262.62k    15025.92k    20669.44k    23343.09k
whirlpool         1401.77k     2917.73k     4762.11k     5679.58k     5982.89k
aes-128 ige      18517.31k    18765.50k    21879.30k    22013.26k    22099.22k
aes-192 ige      17356.40k    18653.29k    19075.34k    19152.21k    19177.47k
aes-256 ige      15542.49k    16533.78k    16846.17k    16933.33k    17005.93k
ghash            24851.30k    27019.22k    28001.93k    28195.84k    28265.13k
                  sign    verify    sign/s verify/s
rsa  512 bits 0.001276s 0.000122s    783.7   8212.9
rsa 1024 bits 0.006676s 0.000382s    149.8   2617.3
rsa 2048 bits 0.045991s 0.001380s     21.7    724.6
rsa 4096 bits 0.334000s 0.005418s      3.0    184.6
                  sign    verify    sign/s verify/s
dsa  512 bits 0.001230s 0.001300s    813.1    769.2
dsa 1024 bits 0.003737s 0.004349s    267.6    229.9
dsa 2048 bits 0.013634s 0.015876s     73.3     63.0
                              sign    verify    sign/s verify/s
 160 bit ecdsa (secp160r1)   0.0008s   0.0031s   1319.3    322.7
 192 bit ecdsa (nistp192)   0.0010s   0.0042s   1010.7    236.5
 224 bit ecdsa (nistp224)   0.0013s   0.0056s    797.9    179.9
 256 bit ecdsa (nistp256)   0.0016s   0.0074s    637.6    135.0
 384 bit ecdsa (nistp384)   0.0035s   0.0178s    287.7     56.1
 521 bit ecdsa (nistp521)   0.0073s   0.0393s    136.1     25.4
 163 bit ecdsa (nistk163)   0.0025s   0.0094s    402.0    106.5
 233 bit ecdsa (nistk233)   0.0055s   0.0163s    183.2     61.3
 283 bit ecdsa (nistk283)   0.0085s   0.0316s    117.3     31.7
 409 bit ecdsa (nistk409)   0.0209s   0.0644s     47.7     15.5
 571 bit ecdsa (nistk571)   0.0539s   0.1527s     18.6      6.5
 163 bit ecdsa (nistb163)   0.0026s   0.0096s    378.9    103.9
 233 bit ecdsa (nistb233)   0.0056s   0.0192s    178.7     52.2
 283 bit ecdsa (nistb283)   0.0088s   0.0336s    113.9     29.7
 409 bit ecdsa (nistb409)   0.0223s   0.0772s     44.7     13.0
 571 bit ecdsa (nistb571)   0.0538s   0.1719s     18.6      5.8
                              op      op/s
 160 bit ecdh (secp160r1)   0.0026s    385.4
 192 bit ecdh (nistp192)   0.0037s    270.4
 224 bit ecdh (nistp224)   0.0048s    208.8
 256 bit ecdh (nistp256)   0.0062s    162.2
 384 bit ecdh (nistp384)   0.0152s     65.9
 521 bit ecdh (nistp521)   0.0334s     30.0
 163 bit ecdh (nistk163)   0.0044s    226.9
 233 bit ecdh (nistk233)   0.0078s    128.5
 283 bit ecdh (nistk283)   0.0147s     68.1
 409 bit ecdh (nistk409)   0.0321s     31.2
 571 bit ecdh (nistk571)   0.0755s     13.2
 163 bit ecdh (nistb163)   0.0048s    209.1
 233 bit ecdh (nistb233)   0.0088s    113.7
 283 bit ecdh (nistb283)   0.0161s     62.2
 409 bit ecdh (nistb409)   0.0364s     27.5
 571 bit ecdh (nistb571)   0.0858s     11.7

Linpack

Compile:

linaro@localhost:~/bench$ cc -Ofast -o linpack linpack.c -lm -mcpu=cortex-a7 -mfpu=vfpv4 -mfloat-abi=hard -funsafe-math-optimizations -fomit-frame-pointer -ffast-math -funroll-loops -funsafe-loop-optimizations

Results

Enter array size (q to quit) [200]:  
Memory required:  315K.
 
 
LINPACK benchmark, Double precision.
Machine precision:  15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:
 
    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      32   0.73  87.67%   4.11%   8.22%  65592.040
      64   1.70  89.41%   2.94%   7.65%  55983.015
     128   1.19  90.76%   3.36%   5.88%  156952.381
     256   2.36  88.56%   3.39%   8.05%  162015.361
     512   5.03  89.86%   2.78%   7.36%  150889.843
    1024  10.19  89.99%   2.75%   7.26%  148814.109

Note: The linpack results suggest that the floating performance of the Cortex A7 core in the A20 is signficantly faster (up to 3x more KFLOPS) than the Cortex A8 used in the A10. Running a flioating-point intensive application (3D geometry processing) seems to confirm that the A20 is significantly faster.

lmbench

lmbench (lmbench-3.0-a9) is an older and not very well known benchmark, but can provides some interesting low-level architectural details as well as memory speed benchmarks.

$ cache -N 2 -M 4M -L 64
L1 cache: 32768 bytes 2.99 nanoseconds 64 linesize 1.73 parallelism
L2 cache: 262144 bytes 14.36 nanoseconds 64 linesize 2.34 parallelism
Memory latency: 158.86 nanoseconds 1.75 parallelism

The reported latency and parallelism is not consistent between runs, but the L1 data cache and L2 cache size is probably correct. The L2 cache size is not large, but that is not surprising for a 55nm manufactured SoC (modern higher-end ARM socs manufactured at 28nm for smartphones and tablets have up to 2MB of L2 cache; RK3188 has 512KB L2 cache).

$ echo -n 'CPU speed: '
$ mhz
$ lat_ops -N 100
$ par_ops -N 10
CPU speed: 1008 MHz, 0.9921 nanosec clock
integer bit: 1.00 nanoseconds
integer add: 0.99 nanoseconds
integer mul: 2.98 nanoseconds
integer div: 73.66 nanoseconds
integer mod: 23.93 nanoseconds
int64 bit: 2.00 nanoseconds
uint64 add: 2.14 nanoseconds
int64 mul: 5.02 nanoseconds
int64 div: 311.98 nanoseconds
int64 mod: 195.88 nanoseconds
float add: 3.97 nanoseconds
float mul: 3.98 nanoseconds
float div: 17.88 nanoseconds
double add: 3.98 nanoseconds
double mul: 6.96 nanoseconds
double div: 31.81 nanoseconds
float bogomflops: 28.14 nanoseconds
double bogomflops: 45.05 nanoseconds
integer bit parallelism: 1.03
integer add parallelism: 1.43
integer mul parallelism: 2.78
integer div parallelism: 1.01
integer mod parallelism: 1.05
int64 bit parallelism: 1.00
int64 add parallelism: 1.14
int64 mul parallelism: 1.01
int64 div parallelism: 1.03
int64 mod parallelism: 1.02
float add parallelism: 3.98
float mul parallelism: 1.60
float div parallelism: 1.20
double add parallelism: 3.98
double mul parallelism: 1.27
double div parallelism: 1.10

This gives interesting info about the CPU instruction characteristics of the ARM Cortex A7 core used in the A20.

  • As with most ARM architectures, integer divide slow.
  • Integer multiply is relatively fast.
  • Floating performance is OK, and single precision (float) is faster than double precision (double).

Memory latencies:

$ lat_mem_rd -N 8 16 128
0.00049 3.013
0.00098 3.009
0.00195 3.014
0.00293 3.009
0.00391 3.019
0.00586 3.009
0.00781 3.018
0.01172 3.009
0.01562 2.996
0.02344 6.636
0.03125 5.792
0.04688 9.009
0.06250 10.178
0.09375 9.222
0.12500 10.254
0.18750 21.143
0.25000 27.971
0.37500 42.014
0.50000 46.024
0.75000 46.074
1.00000 46.855
1.50000 57.570
2.00000 58.869
3.00000 60.061
4.00000 60.225
6.00000 60.892
8.00000 61.057
12.00000 61.355
16.00000 61.225

This confirms the cache sizes of the A20, the L1 (data) cache size is 32KB, latency seems to about 3ns, and as buffer size approaches 32KB the latency reported by the test starts to increase due to cache associativity effects (cache line conflicts). A similar transition is seen for the L2 cache, latency is about 9.3ns, and latency starts to increase as the buffer size approaches 256KB. DRAM latency is about 60ns on the test configuration (1008 MHz CPU, 432 MHz DRAM clock, 432 MHz MBUS clock, 6 cycle CAS timing). With 9 cycle CAS timing latency at 16MB buffer size is 63.8ns.

Memory bandwidth:

One CPU core:
4K read  0.004000 5889.70
4K write 0.004000 10083.91
4K rdwr  0.004000 4401.45
4K copy  0.004000 8441.78
64K read  0.064000 4488.56
64K write 0.064000 6251.16
64K rdwr  0.064000 2420.51
64K copy  0.064000 2821.57
1M read  1.00 1273.65
1M write 1.00 542.30
1M rdwr  1.00 592.77
1M copy  1.00 293.34
16M read  16.00 1136.36
16M write 16.00 513.89
16M rdwr  16.00 549.53
16M copy  16.00 288.14
Two CPU cores:
4K read  0.004000 11623.95
4K write 0.004000 20121.14
4K rdwr  0.004000 8693.74
4K copy  0.004000 16691.51
64K read  0.064000 8449.77
64K write 0.064000 10443.66
64K rdwr  0.064000 3063.00
64K copy  0.064000 2790.98
1M read  1.00 1791.39
1M write 1.00 588.36
1M rdwr  1.00 595.09
1M copy  1.00 409.95
16M read  16.00 1533.03
16M write 16.00 719.84
16M rdwr  16.00 630.88
16M copy  16.00 413.78

Each core has its own L1 data cache, so the L1 cache bandwidth doubles with two cores active. WIth two cores active, the shared L2 cache bandwidth (64K buffer result above) increases compared to when using only one core except in the case of copy when one core already saturates the L2 cache bandwidth. With DRAM access (16M buffer), two active cores are able to utilize more DRAM bandwidth compared to when using only one core (may depend on the lmbench implementation, but is probably a good sign for multi-tasking performance). With a CAS timing of 9 cycles instead of 6, performance is only slightly slower. However, with lower DRAM or MBUS clock, performance will be lower.

GPU

The Mali-400MP2 GPU in the A20 has two pixel processors instead of only one in the A10 (Mali-400MP). Because of that especially fillrate (i.e. high resolution) performance should be higher with proper drivers. You need the newer Mali r3p2 drivers (standard is r3p0) to really take advantage of the improved features of the Mali-400MP2. See the section Optimizing system performance for advanced instructions for using the r3p2 Mali drivers.

The following benchmarks were performed with 1280x720 60 Hz HDMI output (32bpp). The window size of of glmark2 is the default 800x600. The device has the memory clock set to 408 MHz (which is lower than some other devices which may impact performance). The CPU governor was set to ondemand with custom settings. The SwapbuffersWait option was set to "false" in the xorg.conf to eliminate the effect of vsync. The fb0_framebuffer_num in script.bin was set to 3 so that xf86-video-fbturbo can optimally provide Mali GLES integration.

The version of glmark2 (glmark2-es2) used is 2013.08.07. Source: https://github.com/ssvb/glmark2.git. Configure with

apt-get install libgles2-mesa-dev && ./waf configure --with-flavors x11-glesv2

You might need to apply a patch like this to the GLES header files for a clean compile:

*** /usr/include/GLES2/gl2.h-old	2013-11-25 22:00:09.287711308 +0100
--- /usr/include/GLES2/gl2.h	2013-11-25 22:00:45.147711324 +0100
***************
*** 32,37 ****
--- 32,38 ----
  typedef khronos_float_t  GLfloat;
  typedef khronos_float_t  GLclampf;
  typedef khronos_int32_t  GLfixed;
+ typedef char             GLchar;
  
  /* GL types for handling large vertex buffer objects */
  typedef khronos_intptr_t GLintptr;

Performance with standard Mali r3p0 drivers/kernel:

=======================================================
    glmark2 2013.08.07
=======================================================
    OpenGL Information
    GL_VENDOR:     ARM
    GL_RENDERER:   Mali-400 MP
    GL_VERSION:    OpenGL ES 2.0
=======================================================
[build] use-vbo=false: FPS: 190 FrameTime: 5.263 ms
[build] use-vbo=true: FPS: 225 FrameTime: 4.444 ms
[texture] texture-filter=nearest: FPS: 257 FrameTime: 3.891 ms
[texture] texture-filter=linear: FPS: 226 FrameTime: 4.425 ms
[texture] texture-filter=mipmap: FPS: 200 FrameTime: 5.000 ms
[shading] shading=gouraud: FPS: 174 FrameTime: 5.747 ms
[shading] shading=blinn-phong-inf: FPS: 161 FrameTime: 6.211 ms
[shading] shading=phong: FPS: 131 FrameTime: 7.634 ms
[bump] bump-render=high-poly: FPS: 74 FrameTime: 13.514 ms
[bump] bump-render=normals: FPS: 229 FrameTime: 4.367 ms
[bump] bump-render=height: FPS: 182 FrameTime: 5.495 ms
[effect2d] kernel=0,1,0;1,-4,1;0,1,0;: FPS: 37 FrameTime: 27.027 ms
[effect2d] kernel=1,1,1,1,1;1,1,1,1,1;1,1,1,1,1;: FPS: 22 FrameTime: 45.455 ms
[pulsar] light=false:quads=5:texture=false: FPS: 279 FrameTime: 3.584 ms
[desktop] blur-radius=5:effect=blur:passes=1:separable=true:windows=4: FPS: 16 FrameTime: 62.500 ms
[desktop] effect=shadow:windows=4: FPS: 61 FrameTime: 16.393 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=map: Unsupported
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=subdata: FPS: 32 FrameTime: 31.250 ms
[buffer] columns=200:interleave=true:update-dispersion=0.9:update-fraction=0.5:update-method=map: Unsupported
[ideas] speed=duration: FPS: 156 FrameTime: 6.410 ms
[jellyfish] <default>: FPS: 52 FrameTime: 19.231 ms
[terrain] <default>: Unsupported
[shadow] <default>: FPS: 27 FrameTime: 37.037 ms
[refract] <default>: FPS: 15 FrameTime: 66.667 ms
[conditionals] fragment-steps=0:vertex-steps=0: FPS: 249 FrameTime: 4.016 ms
[conditionals] fragment-steps=5:vertex-steps=0: FPS: 85 FrameTime: 11.765 ms
[conditionals] fragment-steps=0:vertex-steps=5: FPS: 256 FrameTime: 3.906 ms
[function] fragment-complexity=low:fragment-steps=5: FPS: 119 FrameTime: 8.403 ms
[function] fragment-complexity=medium:fragment-steps=5: FPS: 58 FrameTime: 17.241 ms
[loop] fragment-loop=false:fragment-steps=5:vertex-steps=5: FPS: 118 FrameTime: 8.475 ms
[loop] fragment-steps=5:fragment-uniform=false:vertex-steps=5: FPS: 118 FrameTime: 8.475 ms
[loop] fragment-steps=5:fragment-uniform=true:vertex-steps=5: FPS: 118 FrameTime: 8.475 ms
=======================================================
                                  glmark2 Score: 133 
=======================================================

Performance with Mali r3p2 drivers/kernel:

=======================================================
    glmark2 2013.08.07
=======================================================
    OpenGL Information
    GL_VENDOR:     ARM
    GL_RENDERER:   Mali-400 MP
    GL_VERSION:    OpenGL ES 2.0
=======================================================
[build] use-vbo=false: FPS: 181 FrameTime: 5.525 ms
[build] use-vbo=true: FPS: 199 FrameTime: 5.025 ms
[texture] texture-filter=nearest: FPS: 224 FrameTime: 4.464 ms
[texture] texture-filter=linear: FPS: 216 FrameTime: 4.630 ms
[texture] texture-filter=mipmap: FPS: 244 FrameTime: 4.098 ms
[shading] shading=gouraud: FPS: 159 FrameTime: 6.289 ms
[shading] shading=blinn-phong-inf: FPS: 163 FrameTime: 6.135 ms
[shading] shading=phong: FPS: 129 FrameTime: 7.752 ms
[bump] bump-render=high-poly: FPS: 67 FrameTime: 14.925 ms
[bump] bump-render=normals: FPS: 265 FrameTime: 3.774 ms
[bump] bump-render=height: FPS: 234 FrameTime: 4.274 ms
[effect2d] kernel=0,1,0;1,-4,1;0,1,0;: FPS: 87 FrameTime: 11.494 ms
[effect2d] kernel=1,1,1,1,1;1,1,1,1,1;1,1,1,1,1;: FPS: 42 FrameTime: 23.810 ms
[pulsar] light=false:quads=5:texture=false: FPS: 356 FrameTime: 2.809 ms
[desktop] blur-radius=5:effect=blur:passes=1:separable=true:windows=4: FPS: 29 FrameTime: 34.483 ms
[desktop] effect=shadow:windows=4: FPS: 114 FrameTime: 8.772 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=map: Unsupported
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=subdata: FPS: 36 FrameTime: 27.778 ms
[buffer] columns=200:interleave=true:update-dispersion=0.9:update-fraction=0.5:update-method=map: Unsupported
[ideas] speed=duration: FPS: 159 FrameTime: 6.289 ms
[jellyfish] <default>: FPS: 114 FrameTime: 8.772 ms
[terrain] <default>: Unsupported
[shadow] <default>: FPS: 95 FrameTime: 10.526 ms
[refract] <default>: FPS: 15 FrameTime: 66.667 ms
[conditionals] fragment-steps=0:vertex-steps=0: FPS: 334 FrameTime: 2.994 ms
[conditionals] fragment-steps=5:vertex-steps=0: FPS: 145 FrameTime: 6.897 ms
[conditionals] fragment-steps=0:vertex-steps=5: FPS: 298 FrameTime: 3.356 ms
[function] fragment-complexity=low:fragment-steps=5: FPS: 197 FrameTime: 5.076 ms
[function] fragment-complexity=medium:fragment-steps=5: FPS: 104 FrameTime: 9.615 ms
[loop] fragment-loop=false:fragment-steps=5:vertex-steps=5: FPS: 192 FrameTime: 5.208 ms
[loop] fragment-steps=5:fragment-uniform=false:vertex-steps=5: FPS: 196 FrameTime: 5.102 ms
[loop] fragment-steps=5:fragment-uniform=true:vertex-steps=5: FPS: 193 FrameTime: 5.181 ms
=======================================================
                                  glmark2 Score: 165 
=======================================================

Note that several sub-benchmarks show doubled performance or better. This includes 2D compositing and fillrate-limited tests.

After tweaking the memory controller parameters (DRAM frequency 432 MHz, MBUS frequency 432 MHz, CAS timing 6 cycles) the gl2mark2 score increases to 189.

Another benchmark of r3p2 drivers, done on Cubietruck, with r3p2 kernel module, binary drivers and patched xf86-video-fbturbo:

root@cubietruck:/sata/build/xf86-video-fbturbo# DISPLAY=:0 glmark2-es2
=======================================================
    glmark2 2012.08
=======================================================
    OpenGL Information
    GL_VENDOR:     ARM
    GL_RENDERER:   Mali-400 MP
    GL_VERSION:    OpenGL ES 2.0
=======================================================
[build] use-vbo=false: FPS: 204 FrameTime: 4.902 ms
[build] use-vbo=true: FPS: 262 FrameTime: 3.817 ms
[texture] texture-filter=nearest: FPS: 291 FrameTime: 3.436 ms
[texture] texture-filter=linear: FPS: 282 FrameTime: 3.546 ms
[texture] texture-filter=mipmap: FPS: 315 FrameTime: 3.175 ms
[shading] shading=gouraud: FPS: 195 FrameTime: 5.128 ms
[shading] shading=blinn-phong-inf: FPS: 208 FrameTime: 4.808 ms
[shading] shading=phong: FPS: 167 FrameTime: 5.988 ms
[bump] bump-render=high-poly: FPS: 75 FrameTime: 13.333 ms
[bump] bump-render=normals: FPS: 341 FrameTime: 2.933 ms
[bump] bump-render=height: FPS: 299 FrameTime: 3.344 ms
[effect2d] kernel=0,1,0;1,-4,1;0,1,0;: FPS: 93 FrameTime: 10.753 ms
[effect2d] kernel=1,1,1,1,1;1,1,1,1,1;1,1,1,1,1;: FPS: 43 FrameTime: 23.256 ms
[pulsar] light=false:quads=5:texture=false: FPS: 438 FrameTime: 2.283 ms
[desktop] blur-radius=5:effect=blur:passes=1:separable=true:windows=4: FPS: 33 FrameTime: 30.303 ms
[desktop] effect=shadow:windows=4: FPS: 138 FrameTime: 7.246 ms
Error: Requested MapBuffer VBO update method but GL_OES_mapbuffer is not supported!
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=map: Unsupported
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=subdata: FPS: 40 FrameTime: 25.000 ms
Error: Requested MapBuffer VBO update method but GL_OES_mapbuffer is not supported!
[buffer] columns=200:interleave=true:update-dispersion=0.9:update-fraction=0.5:update-method=map: Unsupported
[ideas] speed=duration: FPS: 182 FrameTime: 5.495 ms
[jellyfish] <default>: FPS: 126 FrameTime: 7.937 ms
Error: SceneTerrain requires Vertex Texture Fetch support, but GL_MAX_VERTEX_TEXTURE_IMAGE_UNITS is 0
[terrain] <default>: Unsupported
[conditionals] fragment-steps=0:vertex-steps=0: FPS: 386 FrameTime: 2.591 ms
[conditionals] fragment-steps=5:vertex-steps=0: FPS: 163 FrameTime: 6.135 ms
[conditionals] fragment-steps=0:vertex-steps=5: FPS: 401 FrameTime: 2.494 ms
[function] fragment-complexity=low:fragment-steps=5: FPS: 226 FrameTime: 4.425 ms
[function] fragment-complexity=medium:fragment-steps=5: FPS: 114 FrameTime: 8.772 ms
[loop] fragment-loop=false:fragment-steps=5:vertex-steps=5: FPS: 226 FrameTime: 4.425 ms
[loop] fragment-steps=5:fragment-uniform=false:vertex-steps=5: FPS: 227 FrameTime: 4.405 ms
[loop] fragment-steps=5:fragment-uniform=true:vertex-steps=5: FPS: 227 FrameTime: 4.405 ms
=======================================================
                                  glmark2 Score: 211 
=======================================================