A10 DRAM Controller Calibration

=Overview of the DRAM controller features affecting the clock speed limit and reliability=

This section provides information about DDR3 memory in general and an overview of the relevant configuration features of the A10/A13/A20 DRAM controller.

DQS gate training
The DQ data lines and DQS/DQS# strobe lines are used for both sending data to the DRAM chips and receiving data back. And the DRAM controller must switch between reading and writing at the appropriate time. After sending a read command to the DRAM chip, we are expecting the response with a certain delay. At the time when this response arrives, we need to have the DQS gate open to let the data in. And when the data is fully received and we need to switch to writing, the DQS gate has to be closed. To allow a certain level of tolerance to the timing skew, the read operation is surrounded by 0.9 cycle long "preamble" and the 0.3 cycle long "postamble". The gate needs to be open during "preamble" and closed during "postamble".

The important parameter, which needs calibration here, is the delay before opening the DQS gate. It is configured in the SDR_RSLR0/SDR_RSLR1 and SDR_RDGR0/SDR_RDGR1 registers with 1/4 cycle precision separately for each byte lane. Luckily, this delay can be automatically detected by the hardware. Unluckily, this automatic detection is a bit flaky and sometimes ends up with unreliable settings, especially on cold restart. So it makes a lot of sense to just identify the optimal DQS gating delay for each board (based on running reliability tests) and override the hardware detection with the pre-defined delay in the 'dram_para' struct.

Other than the delay value itself, we have two types of windowing:
 * passive (the DQS gate close time is calculated as the gate open time plus the duration of the read operation added)
 * active (the DQS gate is auto-closing, based on watching the DQS line)

Accurately hitting the 0.3 cycle long "postamble" in the passive mode is a bit difficult with just 1/4 cycle delay granularity. And the active windowing mode resolves exactly this problem.

And supposedly the hardware also supports DQS gating delay drift compensation for automatically adjusting it at runtime if necessary. But in reality, enabling the drift compensation feature just makes reliability worse.

Impedance settings, ODT and ZQ calibration
The tracks on the PCB, which connect the DRAM controller with the DDR3 chips, behave like any other wires. Impedance matching is used to improve signal integrity. Both drive and termination impedance can and should be adjusted on both ends of the track. For memory write operations, we deal with the DRAM controller output drive impedance and the DDR3 termination impedance. And for memory read operations, we deal with the DRAM controller termination impedance and the DDR3 output drive impedance.

The ODT abbreviation means on-die termination. The internal resistors for implementing configurable impedance are located on-die both in the SoC (for the DRAM controller) and in the DDR3 chips. But because the accuracy of the on-die resistors is not so great, they are calibrated against high precision 240 ohm resistors at the initialization time (on both the DRAM controller side and the DDR3 side) and optionally periodically re-calibrated at run time (on the DDR3 side together with the refresh operation). This calibration process is called ZQ calibration. On the device schematics, one can find at least two high precision 240 ohm resistors: one connected to the SoC and one connected to the DRAM chip.

The purpose of the ZQ calibration is only to ensure that the configured impedance settings are applied accurately. For example, if we configure the 1/4 divisor for the termination impedance, then we want to be sure that it is really "240 ohm / 4 = 60 ohm" on every board and the ZQ calibration solves this. But the selection of optimal impedance divisors is still the responsibility of the user, because they are not configured automatically. For Allwinner A10/A13/A20 based devices, these impedance divisors are configured in the 'dram_para' struct in u-boot:
 * For the DRAM controller side of the wire, these are the 'zq' and the 'odt_en' variables (see A10_DRAM_Controller_Register_Guide)
 * For the DDR3 chip side of the wire, this is the 'emr1' variable (see the description of the MR1 configuration register bits in the DDR3 spec or the DRAM datasheet).

Additional references:
 * DDR3 Dynamic On-Die Termination
 * DDR3 ZQ Calibration

CLK-DQS timing de-skew (Read and Write leveling)
Even if the DQS gating window is opened/closed at the right time and the impedance is perfectly matched, we still have one more potential problem affecting reliability. Now it's the timing skew between the CMD/ADD/CLK, DQ and DQS/DQS# lines. A general overview can be found in the New Features of DDR3 SDRAM pdf.

Fortunately, the A10/A13/A20 DRAM controller has a number of knobs to configure various delays, even up to an individual bit level. Unfortunately, it does not implement any hardware support for automatic read/write levelling at all. So once again, we are up to exploring the vast space of possible configurations to find the one, which works the best. And then hardcode this best configuration into the 'dram_para' struct in u-boot for each board.

A somewhat simplistic approach to configuring read/write leveling is the 'tpr3' variable in the 'dram_para' struct. This variable is just a hexadecimal number, composed of the following bit-fields:
 * bits [22:20] - MFWDLY of the command lane
 * bits [18:16] - MFBDLY of the command lane
 * bits [15:12] - SDPHASE of the byte lane 3
 * bits [11:8] - SDPHASE of the byte lane 2
 * bits [7:4] - SDPHASE of the byte lane 1
 * bits [3:0] - SDPHASE of the byte lane 0

Basically, adjusting bits 22:16 in 'tpr3' affects delays on the command lane and affects write timings for all byte lanes. And adjusting bits 15:0 (while keeping SDPHASE the same for each byte lane for the sake of even more simplicity) affects read timings for all byte lanes. One of these configurations ought to be better than the others, and we can find it. One of the nice features of this approach is that we can easily present the test results in a table, where moving between cells in the horizontal direction changes read timings (of all byte lanes as a whole), and moving between cells in the vertical direction changes write timings (of all byte lanes as a whole). Here is an example of such color coded table (one of the steps from the Cubietruck DRAM settings calibration):

  dcdc3_vol        = 1325 dram_clk          = 648 mbus_clk          = 600 dram_type         = 3 dram_rank_num     = 1 dram_chip_density = 4096 dram_io_width     = 8 dram_bus_width    = 32 dram_cas          = 9 dram_zq           = 0x2c dram_odt_en       = 3 dram_tpr0         = 0x429899b4 dram_tpr1         = 0xa0a0 dram_tpr2         = 0x2c200 dram_tpr3         = 0x182222 dram_emr1         = 0x42 dram_emr2         = 0x10 dram_emr3         = 0x0 dqs_gating_delay  = 0x07070707 active_windowing  = 1

Sure, there may be some read or write delay mismatch between individual lanes (not to mention the individual bits), but at least we still can find the least problematic common delay across all lanes at this stage. And selecting the 'tpr3' value from the middle of the GREEN isle in the table above improves reliability, if compared to the .trp3=0x000000 default (no delay adjustments at all).

DDR3 timing parameters
The description of DDR3 DRAM modules sometimes includes a sequence of 4 numbers separated by dashes, for example DDR3-1333 9-9-9-24. These four numbers are the values of tCAS-tRCD-tRP-tRAS parameters, which are most important for performance (lower is better). But there are more parameters than just these four. A complete list of timing parameters and their possible values can be found in the DDR3 spec (for the standard speed bins) and also in the datasheet of each DRAM chip in the case if the chip can support tighter timings than required by the DDR3 standard. The A10/A13/A20 DRAM controller registers SDR_TPR0, SDR_TPR1 and SDR_TPR2 are used to configure these timing parameters. Please note that the DRAM controller expects these parameters in cycles, and DRAM datasheets usually provide them in nanoseconds. So a conversion is necessary to configure this right.

=Finding optimal DRAM settings for your board or device=

This section describes the exact procedure, which can be used to find optimal DRAM settings for each Allwinner A10/A13/A20 based development board or device.

=Other links=

Some links, which are not directly describing sunxi hardware, but may be useful for grasping the general concept:
 * Altera - Utilizing Leveling Techniques in DDR3 SDRAM Memory Interfaces
 * Freescale - i.MX 6 Series DDR Calibration
 * DDR3 introduction slides
 * Samsung - Mobile DRAM’s Frequently violated parameters Application Note