A10 DRAM Controller Calibration

=Overview of the DRAM controller features affecting the clock speed limit and reliability=

This section provides information about DDR3 memory in general and an overview of the relevant configuration features of the A10/A13/A20 DRAM controller.

DQS gate training
The DQ data lines and DQS/DQS# strobe lines are used both for sending data to the DRAM chips and also for receiving data back. As a result, the DRAM controller must switch between reading and writing at appropriate times. After sending a read command to the DRAM chip, we are expecting a response with a certain delay. At the time when this response arrives, we need to have the DQS gate open to let the data in. Then after the data is fully received and we need to switch back to writing, the DQS gate has to be closed. To allow a certain level of tolerance to the timing skew, every batch of read operations is surrounded by 0.9 cycle long "preamble" and the 0.3 cycle long "postamble". The gate needs to be open during "preamble" and closed during "postamble".

An important parameter to be configured is the delay between submitting read commands and opening the DQS gate for getting the responses. It is written to the SDR_RSLR0/SDR_RSLR1 and the SDR_RDGR0/SDR_RDGR1 registers, which configure it with 1/4 cycle granularity. Luckily, this delay can be automatically detected by the hardware (triggered by setting the CCR_DATA_TRAINING bit in the SDR_CCR register). Unluckily, the automatic detection is a bit flaky and sometimes ends up with unreliable settings, especially on cold system start (this is a problem only for high DRAM clock frequencies, low frequencies are reasonably safe). So it makes a lot of sense to just identify the optimal DQS gating delay for each board and override the hardware detection with a pre-defined delay in the 'dram_para' struct.

Other than the delay value itself, we have two types of windowing to select from:
 * passive (the DQS gate close time is calculated as the gate open time plus the duration of the read operation added)
 * active (the DQS gate is auto-closing, internally implemented by watching for the last rising edge on the DQS line)

The passive windowing mode is activated by setting the CCR_DQS_GATE bit in the SDR_CCR register. However accurately hitting the 0.3 cycle long "postamble" is a bit difficult in the passive mode with just 1/4 cycle delay granularity. The active windowing mode exists to address this particular problem and should be preferred. Still there is one good use for the passive windowing mode: that's the process of hardware DQS gate training itself. Since the passive mode has more strict timing requirements, the gating delay value obtained by the hardware DQS gate training is more accurate in passive mode.

Also the hardware supports DQS gating delay drift compensation (the CCR_DQS_DRIFT_COMP bit in the SDR_CCR register) for automatically adjusting it at runtime if necessary. But in reality, experiments show that enabling the drift compensation feature just makes reliability worse and we should avoid it.

Impedance settings, ODT and ZQ calibration
The tracks on the PCB connect the DRAM controller with the DDR3 chip(s) and behave like any other wires. Signal integrity may vary really a lot depending on whether the impedance matching has been done properly. Both output drive and termination impedance can (and should) be adjusted on both ends of the track. For memory the write operations, we deal with the DRAM controller output drive impedance and the DDR3 termination impedance. And vice versa, for the memory read operations, we deal with the DRAM controller termination impedance and the DDR3 output drive impedance.

The ODT abbreviation means on-die termination. The internal resistors for implementing configurable impedance are located on-die both in the SoC (for the DRAM controller) and in the DDR3 chips. But because the accuracy of the on-die resistors is not so great, they are calibrated against external high precision 240 ohm resistors at the initialization time (both on the DRAM controller side and on the DDR3 chip side) and optionally periodically re-calibrated at run time (on the DDR3 chip side, coupled with the refresh operation). This calibration process against the external resistor is called ZQ calibration. When looking at the device schematics, one can normally find at least two high precision 240 ohm resistors: one connected to the SoC and one connected to the DRAM chip. For example, A13-OLinuXino-MICRO has these resistors connected to the DZQ and the ZQ pins.

The purpose of the ZQ calibration is only to ensure that the configured impedance settings are applied accurately. For example, if we configure the 240/4 ohm termination impedance, then we want to be sure that it is really 60 ohm on every board, regardless of the PVT (process-voltage-temperature) differences. ZQ calibration solves this. But the selection of optimal impedance divisors is still the responsibility of the user, because they are not configured automatically by the hardware. For Allwinner A10/A13/A20 based devices, the impedance divisors are specified in the 'dram_para' struct in the u-boot bootloader via the following parameters:
 * the 'zq' and the 'odt_en' variables (see the SDR_ZQCR0 register) for the impedance on the DRAM controller end of the wire
 * the 'emr1' variable (see the description of the MR1 configuration register bits in the DDR3 spec or the DRAM datasheet) for the impedance on the DDR3 chip end of the wire

Additional references:
 * DDR3 Dynamic On-Die Termination
 * DDR3 ZQ Calibration

CLK-DQS timing de-skew, read and write leveling
In the case of PCB tracks length mismatch, there may be some timing skew between the CMD/ADD/CLK, DQ and/or DQS/DQS# signals. Some general overview can be found in the New Features of DDR3 SDRAM pdf. Also the Altera - Utilizing Leveling Techniques in DDR3 SDRAM Memory Interfaces pdf is quite interesting even though it talks about a different DRAM controller and is not directly applicable.

The A10/A13/A20 DRAM controller has a lot of knobs to configure various delays, even up to an individual bit level. However, the DRAM controller does not implement any hardware assistance for automatic read/write leveling at all. So we are up to using some other method for exploring the vast space of possible configurations to find the one, which works the best. If a good configuration for the delay adjustments is identified, then we can hardcode it into the 'dram_para' struct in the u-boot bootloader for each board type.

Right now, all the delays related configuration is exposed as the 'tpr3' variable in the 'dram_para' struct. This variable is a hexadecimal number, composed of the following bit-fields:
 * bits [22:20] - mapped to MFWDLY bits of the command lane
 * bits [18:16] - mapped to MFBDLY bits of the command lane
 * bits [15:12] - mapped to SDPHASE bits of the byte lane 3
 * bits [11:8] - mapped to SDPHASE bits of the byte lane 2
 * bits [7:4] - mapped to SDPHASE bits of the byte lane 1
 * bits [3:0] - mapped to SDPHASE bits of the byte lane 0

Basically, adjusting bits 22:16 in the 'tpr3' parameter tweaks delays on the command lane. Because the relative delay between the signals on the command lane and the signals on the byte lanes changes, this also effectively adjusts the delays for both write and read operations. Also adjusting bits 15:0 in the 'tpr3' parameter allows to postpone or move forward the sampling of incoming data for read operations (relative to the default 90 degrees phase). Since we can control both read and write delays almost independently from each other, the 'tpr3' parameter is good enough for simple de-skew adjustments. There are also other delay related knobs in the DRAM controller, but they are not exposed in the 'dram_para' struct yet.

DDR3 timing parameters
The description of DDR3 DRAM modules sometimes includes a sequence of 4 numbers separated by dashes, for example DDR3-1333 9-9-9-24. These four numbers are the values of tCAS-tRCD-tRP-tRAS parameters, which are most important for performance (lower is better). But there are more parameters than just these four. A complete list of timing parameters and their possible values can be found in the DDR3 spec (for the standard speed bins) and also in the datasheet of each DRAM chip in the case if the chip can support tighter timings than required by the DDR3 standard. The A10/A13/A20 DRAM controller registers SDR_TPR0, SDR_TPR1 and SDR_TPR2 are used to configure these timing parameters. Please note that the DRAM controller expects these parameters in cycles, and DRAM datasheets usually provide them in nanoseconds. So a conversion is necessary to configure this right.

This configuration is provided by the 'tpr0', 'tpr1', 'tpr2' parameters in the u-boot 'dram_para' struct, which are directly written to the corresponding hardware registers on DRAM initialization.

=Finding optimal DRAM settings for your board or device=

The DRAM controller overview in the previous chapters contains some parts of text, which are highlighted in red. Basically, they say that the A10/A13/A20 DRAM controller is missing decent automatic DDR3 configuration features, enjoyed by the high-end ARM or x86 desktop systems. And using a bad configuration or just keeping the defaults does not allow reaching really high DDR3 clock speeds.

To overcome this hardware limitation and in order to allow significantly faster DRAM clock speeds, we essentially brute-force search for a good configuration using the lima-memtester program as a tool to evaluate and compare reliability of different settings. This method can be used by anyone and does not require any special lab equipment, simulation software or anything else. However hardcoding the impedance and delays is not a perfectly universal solution. The optimal DRAM settings, found with this method, can be only used just for a single device model (and even limited to a single PCB revision in some cases).

The next chapters contain the description of the exact step by step procedure. Be warned that it is a very long iterative process and may take up to a week to find something useful! But the results typically pay off and reward you with much better memory performance.

The kernel, distro rootfs and userland tools
It is required to have the sunxi-3.4 kernel (specifically for for the mali kernel module). The mainline kernel is not supported yet because it is lacking in the graphics department. Also the process of probing different dram setting involves a lot of watchdog triggered reboots, which may corrupt the file system pretty fast. So it is strongly recommended to setup boot over the network using the NFS root file system. Any other configurations are only going to bring unnecessary troubles and are completely unsupported by this guide.

Once the system is up and running, we need to install some prerequisites: git, cmake and the ruby scripting language interpreter. For example, on a Debian/Ubuntu distro it would be:

apt-get install git cmake ruby

And then compile and install the lima-memtester and a10-dram-tools:

cd /tmp git clone https://github.com/ssvb/lima-memtester.git cd lima-memtester cmake -DCMAKE_INSTALL_PREFIX=/usr. make -j2 install

cd /tmp git clone https://github.com/ssvb/a10-dram-tools.git cd a10-dram-tools cmake -DCMAKE_INSTALL_PREFIX=/usr. make -j2 install

The bootloader
The DRAM settings are configured in u-boot. And we are going to use the mainline u-boot with some extra pending DRAM patches:

Finding good impedance settings
=Other links=

Some links, which are not directly describing sunxi hardware, but may be useful for grasping the general concept:
 * Altera - Utilizing Leveling Techniques in DDR3 SDRAM Memory Interfaces
 * Freescale - i.MX 6 Series DDR Calibration
 * DDR3 introduction slides
 * Samsung - Mobile DRAM’s Frequently violated parameters Application Note