# **High Throughput Coding An Implementation Centric View** ...when spectral efficiency meets nm/pJ... ...when Shannon meets Moore... #### Norbert Wehn ## Birth of Information Theory 70 Years Ago Reprinted with corrections from The Bell System Technical Journal, Vol. 27, pp. 379–423, 623–656, July, October, 1948. INTRODUCTION Theorem 11: Let a discrete channel have the capacity C and a discrete source the entropy per second H. If $H \leq C$ there exists a coding system such that the output of the source can be transmitted over the channel with an arbitrarily small frequency of errors (or an arbitrarily small equivocation). If H > C it is possible to encode the source so that the equivocation is less than $H - C + \epsilon$ where $\epsilon$ is arbitrarily small. There is no method of encoding which gives an equivocation less than H - C. ### Birth of Microelectronics 70 Years Ago Christmas 1947 (Bell): Bardeen, Brattain discover point contact transistor January 1948 (Bell): Shockley discovers junction transistor - 2 gold foils pressed onto germanium - Gold contact in forward direction - Gold contact in backward direction FIG. 3. Entry in Bardeen's lab notebook dated 24 December 1947, giving his conception of how the point-contact transistor functions. (Reprinted by permission of AT&T Archives.) FIG. 4. Entry in Shockley's lab notebook dated 23 Januar, 1948 recording his conception of the junction transistor. It wrote this page at home on a piece of paper, which he late pasted into his notebook. (Reprinted by permission of AT&I Bell System Tech. Journal, Vol 27, 1948: A Mathematical Theory of Communication, C. Shannon Bell System Tech. Journal, Vol 28, 1949: The Theory of p-n junctions in semiconductors and p-n junctions transistors, W. Schockley # Moore's Law Forever ? Source: G. Intel ## Limits of Moore's Law #### Cost - Wafer cost 28nm -> 7nm: more than doubles, but area density increases by 6x - Average IC design cost for 16nm/14nm chip is ~ \$80 million - In 7nm it will cost > \$200 million - ⇒ Only high volume chips justify use of advanced technology nodes #### Performance gain - 28nm -> 7nm: 3x improvement in frequency (4 nodes) - 2018 -> 2033: 1.7x frequency gain (7 nodes) #### **Power density** - Power per transistor decreases slower than transistors density increases - → power per mm² increases - Until 2033, power density increases by 8x - Same TDP: frequency has to be reduced by 8x 4.2GHz -> 0.5GHz ## Limits of Moore's Law #### Interconnect - 14nm technology delay for 1mm wire ~400ps - Until 2033 - 9% delay improvement in logic per technology node without wires - 10% node-to-node penalty for data path with tight metal pitches - Energy challenge Source: NVIDIA # MICROELECTRONIC SYSTEMS DESIGN RESEARCH GROUP # Wireless Communication Source: G. Fettweis # MICROELECTRONIC SYSTEMS DESIGN RESEARCH GROUP ## Microelectronic Contribution to Channel Coding #### 2 Turbo-Code decoders in different technologies - Both decoders designed with the same methodology - Similar basic architecture: exploit spatial parallelism, subblocks on several MAP decoders in parallel # MAP<sub>1</sub> Subblock 1 read MAP<sub>2</sub> Subblock 2 Deinterleaver/ Deinterleaver Network MAP<sub>P</sub> Subblock P #### Decoder 1 (2004) - UMTS compliant decoder in 180nm technology - Max frequency 166 MHz - 16 MAP decoders in parallel - Throughput 80 Mbit/s @ 6 iterations - Journal of VLSI Signal Processing 39, 63–77, 2005 © 2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands ■ 30 mm<sup>2</sup> #### A Scalable System Architecture for High-Throughput Turbo-Decoders MICHAEL J. THUL, FRANK GILBERT, TIMO VOGT, GERD KREISELMAIER AND NORBERT WEHN Microelectronic System Design Research Group, University of Kaiserslautern, Erwin-Schroedinger-Strasse, # MICROELECTRONIC SYSTEMS DESIGN RESEARCH GROUP ## **Channel Coding** # 2012 7th International Symposium on Turbo Codes and Readile Information Proceeding (6TC) #### Decoder 2 (2011) - LTE compliant decoder, 65nm technology - Max frequency 450 MHz - 32 parallel MAP decoders - Throughput 2.15Gbit/s @ 6 iteration - 7.7 mm² #### A 2.15GBit/s Turbo Code Decoder for LTE Advanced Base Station Applications Thomas Buseher, Frank Kienle, Christian Weis, Norbert Wehn Microelectronic Systems Design Research Group, University of Kaiserslaus 67663 Kaiserslauten, Germany Unseher kienle weis wehn Breit midd de #### Comparison - **180nm**, 130nm, 90nm, **65nm** - Throughput increase 27x, but frequency increase only 3x - Improvement in area efficiency (area/throughput) 100x - ⇒ Progress due to microelectronic mainly in area efficiency - ⇒ Throughput increase mainly due to code, algorithm, architecture: e.g. conflict free interleaver, NII, radix-4, re-computation, advanced normalization, larger parallelism... # MICROELECTRONIC SYSTEMS DESIGN RESEARCH GROUP ## Communications Performance versus Implementation Efficiency Transactions Papers On Complexity, Energy- and Implementation-Efficiency of Channel Decoders Various decoders in same technology, same design methodology - Communications performance: FER/BER over SNR - Implementation efficiency - Area efficiency: decoded bits/s/mm<sup>2</sup> - Energy efficiency: decoded bits/s/power = decoded bits/energy Comparison of TC decoder and LDPC decoder - LTE compliant Turbo-Code decoder: rate 1/3...9/10 (puncturing) - Flexible LDPC decoder: rate 1/4...9/10 - Blocksize for comparison 6154 bits Strong interrelation communication performance and implementation efficiency → Design space exploration ## Compute vs Data Transfer/Storage #### E.g. Belief propagation - Inherent parallel (check/variable node processing) - Data transfer dominated #### Data transfers Highly parallel architectures -> wires routing congestion: area and power | Design Step | Area<br>mm² | Clock<br>Frequency | Power | |-------------|-------------|--------------------|----------| | Synthesis | 2,9 | 322 MHz | ~ 600 mW | | P & R | 4,6 | 275 MHz | 1110 mW | Large difference: area 58%, power 83% increase, frequency 17% decrease ## Compute vs Data Transfer/Storage High throughput, (partially) parallel architectures: SRAM memories Largely contribute to power | Memory | | |--------|---------| | Cache | (64bit) | | 8KB | 10pJ | | 32KB | 20pJ | | 1MB | 100pJ | - Generate access conflicts - E.g. TC interleaver, double diagonal Q matrix LDPC... - Reliability issue due to high energy particles - ECC protection becomes mandatory #### Complexity of data transfer (routing) / storage ~ N x P x q - N blocklength, P parallel decoded codewords, q average LLR precision - => Efficient LLR quantization q is a major optimization step E.g. information bottleneck, finite alphabets... # Towards 1Tb/s FEC Decoders Power envelope 1 Watt@10mm<sup>2</sup>, throughput 1Tb/s@1GHz $\Rightarrow$ ~1pJ/bit, ~100mW/mm<sup>2</sup>, ~1000 bits in 1ns #### **Energy efficient high throughput architectures** Large locality and regularity, large parallelism #### Information theory Irregularity, Iterative/sequential decoding algorithms | Code | Decoding algorithms | Parallel vs. serial | Locality | Compute kernels | Transfers vs. compute | |------------|-----------------------------|---------------------|--------------------|---------------------|-----------------------| | Turbo code | MAP | serial/iterative | low (interleaver) | Add-Compare-select | compute dominated | | LDPC code | Belief propagation | parallel/iterative | low (Tanner graph) | Min-Sum/add | transfer dominated | | Polar code | Successive cancelation/List | serial | high | Min-Sum/add/sorting | balanced | # MICROELECTRONIC SYSTEMS DESIGN RESEARCH GROUP # Towards 1Tb/s TC Decoding - MAP algorithm: data dependencies in the trellis - Spitting of trellis in independent sub-trellises → spatial parallelization of different sub-trellis: fast processing of a single block - "Unrolling" of recursions and pipelining: several blocks are processed in parallel - TC level: unrolling of iterations 102 Gbit/s Turbo code decoder, area 23.61 mm<sup>2</sup> ## Towards 1Tb/s LDPC Decoding LDPC Block code length N, parity check matrix H, I iterations #edges: 1\_entries(H), #proc\_edges(H) edges processed in 1 clock cycle $$T_{BC}(H,A) = \underbrace{ \text{"proc\_edges(A)} \atop \text{#edges(H)}}_{\text{Parallelism P}} \text{Parallelism P}$$ $$N * \frac{1}{i} * f \text{ [bits/sec]}$$ E.g. IEEE 802.11ad, N=672, #edges(H)=1890, f=400MHz (28nm FDSOI), 9 iter. - ~10 Gb/s P<1 e.g. row wise partially parallel architecture (10 Gbit/s) - >10 Gb/sP=1 i.e. fully parallel on Tanner graph (29 Gbit/s) - >100 Gb/s P>1 i.e. several H matrices are processed in parallel, unrolled fully parallel architecture (268 Gbit/s) MICROELECTRONIC SYSTEMS DESIGN RESEARCH GROUP ## **BCH Decoder Scaling** Extended Euclidian Algorithm, piplined architecture 28nm FDSOI, WC PVT, blocksize 511, correctable errors: 2..12 | Place & Route | 2 | 3 | 4 | 6 | 12 | |----------------------------|-------|-------|--------|--------|--------| | Frequency (MHz) | 935 | 885 | | 403 | 170 | | Throughput (Gbps) | 477,6 | 452,2 | 327,6 | 206 | 86,8 | | Total Cell Area (um2) | 59198 | 93014 | 128475 | 224906 | 545406 | | Area Efficiency (Gbps/mm2) | 5247 | 3397 | 1793 | 670 | 119 | | Power Total (mW) | 33,42 | 52,7 | 52,24 | 54,44 | 76,57 | | Energy Efficiency (pJ/bit) | 0,07 | 0,12 | 0,16 | 0,26 | 0,88 | 28nm FDSOI, WC PVT, 6 correctable errors, blocksize: 255...1023 | Place & Route | 255 | 511 | 1023 | |----------------------------|--------|--------|--------| | Frequency (MHz) | 364 | 403 | 297 | | Throughput (Gbps) | 92,7 | 206 | 303,6 | | Total Cell Area (um2) | 111902 | 224906 | 442433 | | Area Efficiency (Gbps/mm2) | 617 | 670 | 481 | | Power Total (mW) | 30,15 | 54,44 | 79,82 | | Energy Efficiency (pJ/bit) | 0,33 | 0,26 | 0,26 | # Towards 1Tb/s Polar Decoding Decoding algorithms SC, SCL: "unrolling" of tree traversal on polar factor tree - Reduction of tree size by different optimizations e.g. - Replace repetition codes and parity check code by one single nodes - Merge rate-0 codes and rate-1 nodes into parent nodes Original Tree Replaced subtrees # Towards 1Tb/s SC Polar Decoding #### 1024/512 Code, fast SC decoding algorithms - Worst case PVT timing 28nm technology, optimized factor tree, - Logic stages 385, retimed pipeline stages 105 | Place&Route | Register | Latches | |-----------------------------------|----------|---------| | Area [mm <sup>2</sup> ] | 3.14 | 2.79 | | - Combinat. | 0.96 | 0.91 | | - Buf/Inv | 0.65 | 0.27 | | - Noncomb | 1.55 | 1.12 | | Area Eff. [Gbps/mm <sup>2</sup> ] | 205 | 231 | | Utilization | 78% | 72% | | Frequency [MHz] | 621 | 629 | | Throughput [Gbps] | 636 | 644 | | Power [W] | < 5.7 | 2.7 | | - Clock | 47% | 19% | | - Registers | 24% | 13% | | - Combinat. | 29% | 68% | | Energy Eff. [pJ/bit] | 8.8 | 4.2 | Each colour represents a stage (105) black color is memory ## Summary - Applications require ever higher throughput, lower latency, better communication performance, higher energy efficiency and low power - Microelectronic progress can not keep pace with these requirements - Throughput towards 1 Tb/s are feasible for TC, LDPC, PC but - Limited to smaller block sizes, low iterations (TC, LDPC) → comm. performance - Flexibility challenge - Heavy pipelining increases latency, power in clock tree is a major challenge - Power density one of the biggest challenges - High communication performance under architectural constraints for very high throughput is challenging Thank you for attention! For more information please visit http://ems.eit.uni-kl.de