Sunxi-Cedrus

sunxi-cedrus is an effort to bring hardware accelerated video de-/encoding on Allwinner SoC on a mainline kernel. It is made of a v4l kernel driver and a libVA backend. Those components are still in development and are not as advanced as libvdpau-sunxi which is designed to run on vendor's kernel and has been developed for much longer.

If you want to try it, you need a board with an A13 SoC, this page will try to guide you from your first steps with the driver to the details of its implementation. Make sure you've checked the known bugs and limitations before using it.

Installation
This procedure has been tested on the NextThingCo's CHIP board but should be adaptable to other A13 boards supported by the 4.8 mainline kernel. The first thing you need to do is to recompile a kernel with the sunxi-cedrus v4l driver and the corresponding device tree entry. You can follow the Mainline Kernel Howto but using the following repository:

https://github.com/FlorentRevest/linux-sunxi-cedrus

Use menuconfig to enable the VPU driver in Device Drivers -> Multimedia support -> Memory-to-memory multimedia devices -> Sunxi CEDRUS VPU driver. The driver should be compiled into the kernel and not as a module.

Your kernel will be located in arch/arm/boot/zImage and device tree in arch/arm/boot/dts/sun5i-r8-chip.dtb Don't forget to install kernel headers to your distribution if you want to be able to compile sunxi-cedrus-drv-video.

On a standard debian jessie system you will need to install the following build dependencies:

apt install git autoconf automake libtool pkg-config gcc libdrm-dev libva-dev libx11-dev make g++ vlc xorg

You will be able to compile and install a newer version of libVA supporting the LIBVA_DRIVER_NAME environment variable and then the sunxi-cedrus libVA backend:

git clone https://github.com/01org/libva cd libva git checkout 695f99ef0405cf4255e7767b44effb0da2fe706e ./autogen.sh --prefix=/usr --libdir=/usr/lib/arm-linux-gnueabihf/ make sudo make install

git clone https://github.com/FlorentRevest/sunxi-cedrus-drv-video cd sunxi-cedrus-drv-video ./autogen.sh make # DRM_CFLAGS=-I/path/to/your/linux/headers sudo make install

Once installed, make sure you've exported the environment variable telling VA to use the sunxi_cedrus backend, activate the VA X11 decoding in VLC settings (Tools -> Preferences -> Input / Codecs -> set 'Hardware-accelerated decoding' to 'VA-API video decoder via X11') and then run one of the sample media file you can find here and here

xinit& export DISPLAY=:0.0 export LIBVA_DRIVER_NAME=sunxi_cedrus vlc big_buck_bunny_480p_MPEG2_MP2_25fps_1800K.MPG vlc ducks_take_off_420_720p25.mp4

Supported features

 * MPEG2 Decoding
 * Partial MPEG4 Decoding

Known bugs and limitations

 * H264 and H265 are not supported (some of the underlying problems include: allocating 19 surfaces at once and queueing several slices by frames)
 * MPEG4 decoding has some glitches. When something moves in a video it usualy draws some kind of trace behind it, this behavior is believed to come from an inconsistency in movement prediction. (VLC also tries to SyncSurface more often than needed which results in error when dequeuing the capture buffers)
 * No encoding: this can be added later on but hasn't been tried yet
 * Direct rendering: currently, buffers coming out of the v4l driver in a tiled pixel format are converted to a standard YUV pixel format and then rendered on screen by ffmpeg/vlc. As soon as the support for YUV DRM planes will be added to the kernel, this behavior can be replaced and the performances will be much better.
 * Currently the video can only be played at a zoom of 1:1, otherwise the scaling is done by ffmpeg in full CPU and it is too slow. Having that in the DRM driver would also allow for hardware accelerated frames scaling.
 * We currently can't play a MPEG2 file and then a MPEG4 file (or vice versa) or the output will be full of garbage pixels. This is probably due to some registers of the MPEG engine being kept between the two decoding and "polluting" the MPEG4 decoding with older values from MPEG2 decoding. We should find a way to clear those registers when receiving a S_FMT ioctl in the kernel driver or when closing the video device.

Technical details and implementation
The Cedrus project has provided reverse engineering of the Allwinner's proprietary CedarX blob for a couple of years. This work has been done on the Allwinner's 3.4 kernel and led to the creation of a libVDPAU backend interfacing with the "cedar_dev" and "disp" kernel drivers available in the vendor's kernel. The "cedar_dev" kernel driver directly mapped registers and memory from the VE to the userspace and could potentially be a security risk. Those two drivers couldn't be upstreamed because they don't use any standard API or framework.

In order to use the VE on a mainline kernel, a new proper kernel driver had to be written from scratch with mainlinable methods in mind. The correct way to implement a codec device is to use the "video4linux" framework, referenced as v4l2. v4l2 handles many kind of video devices, some of them are cameras and sends data to a CAPTURE queue, others are screens and use data from an OUTPUT queue. Codec devices require a flow of data from OUTPUT to CAPTURE, they are "memory-to-memory" devices.

Until now, most of the codec devices were able to handle raw bitstream via a firmware, which means that the OUTPUT queue (containing the compressed input data) could directly contain a MPEG file and the CAPTURE queue was filled with video frames. But the inner working of Allwinner's VPU is different, indeed it requires prior bitstream parsing into smaller frames/slices alongside headers' data. This parsing can not be done in the kernel side since it would be a complex codebase to maintain so it requires a new user-space usecase, hence a new API.

The "Frame API" described here has been designed for the Rockchip's VP8 decoding support and is implemented here and here The "Frame API" aims to standardize the way VPU drivers should communicate frame by frame with the userspace. The advocated method is to bind "buffers" containing slice data from the OUTPUT queue to "extended controls" containing frame's header. The extended controls mechanism allows to send complex data structures to the kernel and program device's registers accordingly. However, the userspace might want to queue several frames in a row and set the corresponding extended controls at the same time. If the registers are programmed at the time an extended control is received, this means that at the time of processing a buffer, the registers might be programmed for another frame. This scenario is to be fixed by the "Request API".

The idea behind this API is to allow atomic operations like a QBUF and a S_EXT_CTRLS. As of August 2016, the "Request API" is still at the state of RFC, it has had quite a few proposals for the past few years but none of them got accepted into the kernel. The latest RFCs, related to the Media API are not able to handle controls so sunxi-cedrus had to use an older RFC.

The "sunxi-cedrus" kernel driver is hence made of a m2m v4l2 driver handling requests of MPEG2 or MPEG4 frames data with a standard header extended control. At the time of processing the m2m queue, it programs the VPU's registers depending on the used codec. Currently MPEG2 and MPEG4 are the only supported formats but H264 and H265 would be the next step.

A second limitation of the Allwinner's VPU is the need for buffers in the lower 256M of RAM. In order to allocate large sets of data in this area, "sunxi-cedrus" reserves a DMA pool that is then used by videobuf's dma-contig backend to allocate input and output buffers easily and integrate that with the v4l QBUF/DQBUF APIs.

From the userspace side of things, all the prior bitstream parsing is done by VA users(such as ffmpeg). Standard VA-API headers are given to the VA backend "sunxi-cedrus-drv-video" which is just in charge of ensuring a correspondence between v4l2 buffers and controls and VA structures. We can compare a VAPicture to a buffer plus an extended control in the OUTPUT queue and a VASurface to a multiplanar buffer in the CAPTURE queue. An Image is then "derived" from a Surface to produce a standard set of NV12 buffers that can be shown on screen by VLC for example. VA-API was an extremely appropriate choice compared to VDPAU since the data it provides are often very similar to the ones the VPU expects.

More info
You can HL Florent Revest (kido) on #cedrus on irc.freenode.net for in depth questions