Fb2Gratis.com

Multi-Processor System-on-Chip 2. Liliana Andrade
Чтение книги онлайн.

Читать онлайн книгу Multi-Processor System-on-Chip 2 - Liliana Andrade страница 19

Информация о книге:

Название: Multi-Processor System-on-Chip 2

Автор: Liliana Andrade

Издательство: John Wiley & Sons Limited

Жанр: Зарубежная компьютерная литература

Серия:

isbn: 9781119818403

isbn:

СКАЧАТЬ component decoder (MAP recursions) and on Turbo decoder level (iterative loop). The kernel operation in the MAP is the Add-Compare-Select (ACS) operation. Depending on the encoder’s memory depth, several ACS operations have to be executed to process a single trellis step in the forward and backward recursion, respectively. In total, 4 · K trellis computations and two interleavings have to be performed for one single Turbo decoding iteration. Hence, P is determined by the number of parallel trellis step computations performed by a decoding architecture. The maximum value of P for one decoding iteration, i.e. P = 1, can be achieved by an architecture if the forward/backward recursions of the MAP algorithm are unrolled and pipelined (functional parallelism), yielding the XMAP architecture. P = I if, in addition, iterations are unrolled and pipelined. For a detailed overview and discussion of such highly parallel decoders, we refer to (Weithoffer et al. 2018). Figure 2.2 (left) shows the layout of a Turbo decoder exploiting all levels of parallelism (P = I) and an optimized interleaver such that ω is very small (Weithoffer et al. 2018). The decoder achieves 102 Gbit/s information bit throughput for an information block length of 128 bit, on a low Vt 28 nm Fully Depleted Silicon on Insulator (FD-SOI) technology under worst-case PVT conditions, running with 800 MHz and performing four decoding iterations. Taking into account the code rate of R = 1/3, the corresponding coded throughput is 306 Gbit/s. The total area is 23.61 mm². Different colors represent the eight different MAP decoders originating from the four unrolled iterations (each iteration requires two MAP decoders). Although this architecture achieves an information bit throughput of more than 100 Gbit/s for small block sizes, it still suffers from limited flexibility in terms of block sizes and varying number of iterations, a large area and high power consumption. Hence, further research is mandatory.

Schematic illustration of 306 Gbit/s turbo decoder, 288 Gbit/s LDPC decoder, and 764 Gbit/s polar decoder.

Figure 2.2. Left: 306 Gbit/s turbo decoder. Middle: 288 Gbit/s LDPC decoder. Right: 764 Gbit/s polar decoder

2.3.2. LDPC decoder

LDPC decoding is based on an iterative message exchange between variable and check nodes on the Tanner graph that is represented by the parity check matrix H. Unlike the Turbo decoding, this BP has some inherent parallelism, since all check nodes (variable nodes) can be processed independently from each other. The decoder throughput is mainly limited by the iterative data exchange between the check and variable nodes. The result of each check node or variable node calculation has to be spread via the edges of the Tanner graph to all other connected nodes. The Tanner graph has very limited locality to provide good communications performance, which challenges an efficient implementation of the data transfers and largely impacts ω. Hence, in contrast to Turbo decoding, the BP algorithm is data transfer and not compute dominated. Let us consider an LDPC block code with a parity check matrix H that has #edges(H) 1-entries (number of 1’s in H equals the number of edges in the Tanner graph). Furthermore, let #proc_edges(A) denote the number of edges that can be processed in one clock cycle by a decoder architecture. The corresponding parallelism is . We can classify different decoder architectures. Partially parallel architectures: Only a subset of edges and nodes are processed in parallel, i.e. P < 1. These architectures are very common for large block sizes that use quasi-cyclic (QC) block codes. An example is the DVB-S2 standard. Fully parallel architectures at iteration level: All edges are processed in parallel, i.e. P = 1. All check nodes and variable nodes are instantiated as hardware units and the corresponding edges are hardwired. Because of the low locality in the Tanner graph, routing congestion (minimizing ω) is a big challenge in these architectures. Unrolled fully parallel architectures: These architectures are similar to the fully parallel architectures, but, in addition, the iterations are unrolled and pipelined, i.e. P = I. In every clock cycle, a new block is processed. Only this architectural approach is feasible to achieve a throughput towards 1 Tbit/s (Schläfer et al. 2013; Ghanaatian et al. 2018). Figure 2.2 (middle) shows the layout of an (672,420) LDPC decoder using the unrolled fully parallel architecture approach. The decoder is implemented in the same technology and under same PVT assumptions as the Turbo decoder and runs with 400 MHz, achieving a (coded) throughput of 288 Gbit/s. The total area is 3.31 mm². Different colors represent the individual iterations (in total 10).

2.3.3. Polar decoder

As discussed earlier, Polar codes have recently attracted much attention in the context of 5G. Successive Cancelation (SC), Successive Cancelation List (SCL) and BP are the most prominent decoding algorithms for Polar codes. Decoding corresponds to a breadth-first (BP) or depth-first traversal (SC, SCL) of the Polar Factor Tree (PFT) (Alamdar-Yazdi and Kschischang 2011), in which the received log-likelihood ratios from the channel are processed by the tree nodes. BP needs a large set of iterations to achieve the error correction performance of SC and is, thus, not well suited for very high throughput and low latency (Abbas et al. 2017). SC and SCL, on the other hand, have sequential behavior due to the mandatory depth-first tree traversal. To achieve a very high throughput, the tree traversal can be unrolled and pipelined (Giard et al. 2015), very similar to the iteration unrolling in Turbo code and LDPC code decoding as described above. Whenever a node is visited during the tree traversal, a corresponding pipeline stage can be instantiated. In this way, for a block length of N (= number of leaves in the PFT), the maximum number of pipeline stages is 2 ∗ (2N − 2) + 1. In this way, all N ∗ log N operations are executed in parallel, i.e. P = 1. Obviously, the complexity of the decoding architecture is directly proportional to the size of the PFT. However, for a given code, the tree can be reduced by various transformations. For example, if a subtree represents a repetition code or a parity check code, the corresponding subtree can be replaced by a single node. Similarly, we can merge rate-0 and rate-1 nodes into its parent nodes (Sarkis et al. 2014) or use majority logic decoding in subtrees. The achievable tree reduction strongly depends on the underlying Polar code. SC is the best candidate to achieve throughput towards 1 Tbit/s. Here, we consider a (1,024, 512) Polar code that is decoded with the SC algorithm. Again we use the same technology and PVT setup as for the Turbo and LDPC decoders. The unoptimized PFT has 2,047 nodes, corresponding to 4,093 pipeline stages. Performing the aforementioned tree optimization yields a reduction to 385 stages. Implementing this tree in a fully pipelined architecture would require about 310 KB pipeline memory. This huge memory requirement illustrates a further challenge for high-throughput decoder architectures. Although pipelining enables highest throughput for decoders, these architectures suffer from large latency and, more importantly, large power consumption. A huge amount of data has to be stored inside the pipelined architecture. These data sets are stored in registers that have to be driven by a clock tree. If a design contains many registers, the clock tree can become the dominant power consumer. Hence, minimizing the register load is a main optimization goal. On the code level, it can be performed by the aforementioned tree optimizations. On an architectural level, register balancing (retiming) is a very efficient technique to further reduce the pipeline stages. Advanced retiming with a frequency constraint of 700 MHz results in a partially pipelined architecture with only 105 pipeline stages (85 KB memory). The power consumption of this design is 5.7 W, of which more than 70% is consumed in registers and the clock tree. In a further step, some of the registers can be replaced by latches that have a much lower load. Latch-based design with some further optimizations reduces the area from 3.14 mm² to 2.79 mm² and the power consumption from 5.7 W to 2.7 W. The final (coded) throughput is 664 Gbit/s. Only 32% of the overall power accounts for registers/latches and the clock tree. Figure 2.2 (right) shows the layout of this Polar decoder. Each color represents a pipeline stage (105), the black color is memory. The decoder was fully automatically optimized and generated with the framework presented in Kestel et al. (2018b) and Lehnigk-Emden et al. (2019).

СКАЧАТЬ

Multi-Processor System-on-Chip 2. Liliana Andrade Чтение книги онлайн.

Читать онлайн книгу Multi-Processor System-on-Chip 2 - Liliana Andrade страница 19

2.3.2. LDPC decoder

2.3.3. Polar decoder

Multi-Processor System-on-Chip 2. Liliana Andrade
Чтение книги онлайн.