A10 DRAM Controller Performance

Tests with the lima-memspeed program on a Cubietruck with a 1920x1080-32@60Hz monitor
The lima-memspeed program is a tool, which tries to simulate different memory intensive workloads and measure how much of the memory bandwidth is really available to be consumed by the CPU, GPU and other peripherals. Basically, it should provide an answer to the question about the optimal relationship between the MBUS and DRAM clock frequencies and whether increasing the DRAM clock speed provides any practical improvements. The systems with a 32-bit dram bus width can't be analyzed well by just the tinymembench program, because tinymembench only focuses on the memory bandwidth available to a single CPU core. And a single CPU core alone can't consume all the DRAM bandwidth, but may benefit from some help from the other peripherals and other CPU cores.

Usage: lima-memspeed [workload1] [workload2] ... [workloadN]

Where the 'workload' arguments are the identifiers of different memory bandwidth consuming workloads. Each workload is run in its own thread.

The list of available workload identifiers: fb_blank                      (blank the screen in order not to drain memory bandwidth) fb_scanout                    (take the framebuffer scanout bandwidth into account) gpu_write                     (use the lima driver to solid fill the screen) gpu_copy                      (use the lima driver to copy a texture to the screen) neon_write                    (use ARM NEON to fill a memory buffer) neon_write_backwards          (use ARM NEON to fill a memory buffer) neon_read_pf32                (use ARM NEON to read from a memory buffer) neon_read_pf64                (use ARM NEON to read from a memory buffer) neon_copy_pf64                (use ARM NEON to copy a memory buffer)

(*) copying 1 MB from one memory location to another memory location per second means 2 MB/s bandwidth in this table, because both reads and writes are accounted.

'''Based on the benchmark results above, looks like the systems with a full 32-bit memory interface want to have both MBUS and DRAM clocked at high speed. While the Allwinner A20 systems with only a 16-bit memory interface (such as Olimex A20-OLinuXino-Lime) should not have any obvious extra bandwidth penalties even if MBUS is clocked slower than DRAM.'''

Tests with the tinymembench program on a A13-OLinuXino-Micro and screen blanked
The Allwinner A13 user manual tells us about the 300MHz clock speed limit for MBUS. And indeed, when having only 16-bit external DDR3 memory interface to deal with, clocking MBUS at a very high speed may be unnecessary (assuming that MBUS internally has the same width as in the A10/A20 siblings). So it is quite interesting to check if running MBUS at half-speed of DRAM is fast enough for A13.

Benchmarks have been run on Olimex_A13-OLinuXino-Micro for different MBUS/DRAM clock settings. The CPU clock speed was 1008MHz, AXI clock speed 504MHz (overclocked). The screen was blanked in order not to drain memory bandwidth. The performance numbers are obtained using a tinymembench tool for the 'NEON read prefetched (64 bytes step)', 'NEON fill', 'NEON copy prefetched (64 bytes step)' subtests. The DRAM timings are accurately calculated for each clock frequency, assuming JEDEC Speed Bin 1333H (DDR3 1333 9-9-9).

Please note that the DRAM clock speeds above 533 MHz (and MBUS above 300 MHz) may be considered overclocking if we trust the Allwinner manuals!

The non-cached read latency numbers from the table above should have no TLB misses and exactly one DRAM access per read. They seem to fit the (12 * mbus_cycle_time + 95 ns) formula quite nicely. It might be that MBUS contributes 12 its cycles to the memory access latency.