A10 DRAM Controller Performance

From linux-sunxi.org
Jump to navigation Jump to search

Tests with the lima-memspeed program on a Cubietruck with a 1920x1080-32@60Hz monitor

The lima-memspeed program is a tool, which tries to simulate different memory intensive workloads and measure how much of the memory bandwidth is really available to be consumed by the CPU, GPU and other peripherals. Basically, it should provide an answer to the question about the optimal relationship between the MBUS and DRAM clock frequencies and whether increasing the DRAM clock speed provides any practical improvements. The systems with a 32-bit dram bus width can't be analyzed well by just the tinymembench program, because tinymembench only focuses on the memory bandwidth available to a single CPU core. And a single CPU core alone can't consume all the DRAM bandwidth, but may benefit from some help from the other peripherals and other CPU cores.

Usage: lima-memspeed [workload1] [workload2] ... [workloadN]

Where the 'workload' arguments are the identifiers of different
memory bandwidth consuming workloads. Each workload is run in its
own thread.

The list of available workload identifiers:
	fb_blank                       (blank the screen in order not to drain memory bandwidth)
	fb_scanout                     (take the framebuffer scanout bandwidth into account)
	gpu_write                      (use the lima driver to solid fill the screen)
	gpu_copy                       (use the lima driver to copy a texture to the screen)
	neon_write                     (use ARM NEON to fill a memory buffer)
	neon_write_backwards           (use ARM NEON to fill a memory buffer)
	neon_read_pf32                 (use ARM NEON to read from a memory buffer)
	neon_read_pf64                 (use ARM NEON to read from a memory buffer)
	neon_copy_pf64                 (use ARM NEON to copy a memory buffer)
Cubietruck (standard 32-bit dram bus width)
MBUS clock DRAM clock fb_scanout gpu_write fb_scanout neon_write fb_scanout neon_copy_pf64 (*) fb_blank gpu_write fb_blank neon_write fb_blank neon_write neon_read_pf64 fb_blank neon_write gpu_write
300 MHz 432 MHz 2209.3 MB/s 2343.2 MB/s 1826.3 MB/s 1935.1 MB/s 2035.5 MB/s 2270.5 MB/s 2018.0 MB/s
400 MHz 432 MHz 2301.6 MB/s 2765.3 MB/s 1835.9 MB/s 2045.8 MB/s 2703.7 MB/s 2615.1 MB/s 2610.4 MB/s
400 MHz 528 MHz 2448.0 MB/s 3064.1 MB/s 2096.7 MB/s 2073.4 MB/s 2713.5 MB/s 2882.8 MB/s 2673.7 MB/s
528 MHz 528 MHz 2461.8 MB/s 3298.6 MB/s 2107.1 MB/s 2098.3 MB/s 3288.1 MB/s 3172.1 MB/s 3370.1 MB/s
400 MHz 600 MHz 2511.5 MB/s 3094.4 MB/s 2146.1 MB/s 2073.4 MB/s 2717.7 MB/s 2940.8 MB/s 2675.7 MB/s
400 MHz 648 MHz 2542.7 MB/s 3106.7 MB/s 2243.0 MB/s 2073.4 MB/s 2721.9 MB/s 2978.5 MB/s 2683.2 MB/s
600 MHz 648 MHz 2543.3 MB/s 3451.5 MB/s 2262.7 MB/s 2102.4 MB/s 3293.7 MB/s 3600.8 MB/s 3630.9 MB/s
Cubietruck (artificially configured 16-bit dram bus width)
MBUS clock DRAM clock fb_scanout gpu_write fb_scanout neon_write fb_scanout neon_copy_pf64 (*) fb_blank gpu_write fb_blank neon_write fb_blank neon_write neon_read_pf64 fb_blank neon_write gpu_write
300 MHz 528 MHz 1354.0 MB/s 1873.6 MB/s 1268.5 MB/s 1188.7 MB/s 1979.5 MB/s 1627.1 MB/s 1712.5 MB/s
400 MHz 528 MHz 1354.3 MB/s 1903.0 MB/s 1278.1 MB/s 1181.7 MB/s 1981.7 MB/s 1637.7 MB/s 1707.8 MB/s
528 MHz 528 MHz 1352.3 MB/s 1892.6 MB/s 1269.3 MB/s 1188.7 MB/s 1979.6 MB/s 1626.0 MB/s 1702.0 MB/s
300 MHz 648 MHz 1464.1 MB/s 2219.9 MB/s 1403.5 MB/s 1256.4 MB/s 2046.7 MB/s 1923.1 MB/s 1892.5 MB/s
400 MHz 648 MHz 1463.3 MB/s 2332.0 MB/s 1403.5 MB/s 1258.1 MB/s 2415.7 MB/s 1925.5 MB/s 1993.0 MB/s
600 MHz 648 MHz 1464.3 MB/s 2348.3 MB/s 1403.5 MB/s 1255.0 MB/s 2382.0 MB/s 1926.9 MB/s 1993.6 MB/s

(*) copying 1 MB from one memory location to another memory location per second means 2 MB/s bandwidth in this table, because both reads and writes are accounted.

Based on the benchmark results above, looks like the systems with a full 32-bit memory interface want to have both MBUS and DRAM clocked at high speed. While the Allwinner A20 systems with only a 16-bit memory interface (such as Olimex A20-OLinuXino-Lime) should not have any obvious extra bandwidth penalties even if MBUS is clocked slower than DRAM.

Tests with the tinymembench program on a A13-OLinuXino-Micro and screen blanked

The Allwinner A13 user manual tells us about the 300MHz clock speed limit for MBUS. And indeed, when having only 16-bit external DDR3 memory interface to deal with, clocking MBUS at a very high speed may be unnecessary (assuming that MBUS internally has the same width as in the A10/A20 siblings). So it is quite interesting to check if running MBUS at half-speed of DRAM is fast enough for A13.

Benchmarks have been run on Olimex_A13-OLinuXino-Micro for different MBUS/DRAM clock settings. The CPU clock speed was 1008MHz, AXI clock speed 504MHz (overclocked). The screen was blanked in order not to drain memory bandwidth. The performance numbers are obtained using a tinymembench tool for the 'NEON read prefetched (64 bytes step)', 'NEON fill', 'NEON copy prefetched (64 bytes step)' subtests. The DRAM timings are accurately calculated for each clock frequency, assuming JEDEC Speed Bin 1333H (DDR3 1333 9-9-9).

Please note that the DRAM clock speeds above 533 MHz (and MBUS above 300 MHz) may be considered overclocking if we trust the Allwinner manuals!

Original old u-boot-sunxi settings (with the wastful DDR3-1333 timings)
MBUS clock DRAM clock Read Write Copy Random read in 64MiB block
204 MHz 408 MHz 709.1 MB/s 1026.7 MB/s 475.4 MB/s 311.8 ns
Half speed MBUS
MBUS clock DRAM clock Read Write Copy Random read in 64MiB block
204 MHz 408 MHz 738.0 MB/s 1027.1 MB/s 512.7 MB/s 286.3 ns
216 MHz 432 MHz 762.6 MB/s 1095.4 MB/s 544.4 MB/s 274.8 ns
300 MHz MBUS
MBUS clock DRAM clock Read Write Copy Random read in 64MiB block
300 MHz 408 MHz 785.7 MB/s 1450.3 MB/s 511.3 MB/s 262.2 ns
300 MHz 432 MHz 825.2 MB/s 1474.6 MB/s 544.1 MB/s 259.2 ns
300 MHz 456 MHz 854.3 MB/s 1486.8 MB/s 576.3 MB/s 244.3 ns
300 MHz 480 MHz 880.2 MB/s 1491.6 MB/s 606.7 MB/s 241.8 ns
300 MHz 504 MHz 889.8 MB/s 1499.9 MB/s 637.7 MB/s 238.6 ns
300 MHz 528 MHz 936.2 MB/s 1507.4 MB/s 671.0 MB/s 232.1 ns
300 MHz 552 MHz 921.0 MB/s 1508.3 MB/s 659.7 MB/s 236.2 ns
300 MHz 576 MHz 943.8 MB/s 1512.8 MB/s 689.3 MB/s 231.2 ns
300 MHz 600 MHz 988.3 MB/s 1517.5 MB/s 719.1 MB/s 224.3 ns
300 MHz 624 MHz 979.5 MB/s 1518.9 MB/s 743.6 MB/s 226.6 ns
300 MHz 648 MHz 1012.8 MB/s 1522.5 MB/s 770.1 MB/s 221.2 ns
Balanced speed MBUS (2/3 of DRAM)
MBUS clock DRAM clock Read Write Copy Random read in 64MiB block
272 MHz 408 MHz 770.2 MB/s 1345.8 MB/s 512.2 MB/s 270.7 ns
288 MHz 432 MHz 816.6 MB/s 1421.0 MB/s 545.4 MB/s 259.9 ns
304 MHz 456 MHz 857.3 MB/s 1504.8 MB/s 576.4 MB/s 245.3 ns
320 MHz 480 MHz 897.5 MB/s 1584.6 MB/s 606.8 MB/s 239.5 ns
336 MHz 504 MHz 933.5 MB/s 1663.9 MB/s 643.5 MB/s 232.8 ns
352 MHz 528 MHz 954.4 MB/s 1742.0 MB/s 669.1 MB/s 226.6 ns
368 MHz 552 MHz 979.4 MB/s 1797.8 MB/s 665.3 MB/s 228.7 ns
384 MHz 576 MHz 1014.1 MB/s 1789.7 MB/s 689.1 MB/s 222.6 ns
400 MHz 600 MHz 1039.7 MB/s 1836.8 MB/s 719.3 MB/s 215.9 ns
416 MHz 624 MHz 1069.6 MB/s 1819.2 MB/s 747.5 MB/s 219.1 ns
432 MHz 648 MHz 1092.5 MB/s 1869.2 MB/s 777.8 MB/s 210.3 ns
Full speed MBUS (the same as DRAM)
MBUS clock DRAM clock Read Write Copy Random read in 64MiB block
408 MHz 408 MHz 833.6 MB/s 1460.8 MB/s 513.0 MB/s 252.3 ns
432 MHz 432 MHz 863.1 MB/s 1555.5 MB/s 543.2 MB/s 243.8 ns
456 MHz 456 MHz 911.5 MB/s 1647.3 MB/s 576.7 MB/s 233.2 ns
480 MHz 480 MHz 954.2 MB/s 1727.3 MB/s 609.5 MB/s 231.5 ns
504 MHz 504 MHz 983.0 MB/s 1813.1 MB/s 638.3 MB/s 220.6 ns
528 MHz 528 MHz 1030.9 MB/s 1913.3 MB/s 676.4 MB/s 220.8 ns
552 MHz 552 MHz 1033.2 MB/s 1943.4 MB/s 623.4 MB/s 217.8 ns
576 MHz 576 MHz 1065.7 MB/s 1989.1 MB/s 691.3 MB/s 211.3 ns
600 MHz 600 MHz Reliability is poor
624 MHz 624 MHz Reliability is poor
648 MHz 648 MHz Reliability is poor
Exotic MBUS configurations (no practical use, just for the sake of research)
MBUS clock DRAM clock Read Write Copy Random read in 64MiB block Random non-cached read in 4KiB block
50 MHz 600 MHz 303.1 MB/s 255.7 MB/s 236.4 MB/s 505.3 ns 335.6 ns
100 MHz 600 MHz 535.0 MB/s 516.3 MB/s 389.3 MB/s 316.0 ns 215.7 ns
150 MHz 600 MHz 700.8 MB/s 773.2 MB/s 594.1 MB/s 261.1 ns 166.3 ns
200 MHz 600 MHz 818.1 MB/s 1018.9 MB/s 686.9 MB/s 258.4 ns 142.7 ns

The non-cached read latency numbers from the table above should have no TLB misses and exactly one DRAM access per read. They seem to fit the (12 * mbus_cycle_time + 95 ns) formula quite nicely. It might be that MBUS contributes 12 its cycles to the memory access latency.