

# Configurable FPGA TPC FEC Unit for Tb/s Optical Communication

### 2019 OWHTC

Søren Forchhammer

DTU Fotonik







### Outline

- Motivation.
- The product code (TPC).
- VHDL implementation.
- Implementation results.
- Lab Experiment.
- Performance results.
- Error floor.
- Conclusion.





### Motivation

- FEC plays an important role in high speed optical communication.
- Soft decision FEC can be very complex.

Can limit the data rate.

• We focus on hard decision FEC.

Target beyond 1 Tb/s.

- Data rate: 100 Gb/s NCG: 9dB overhead: 7% [1].
- A FEC decoder unit is designed to operate up to a data rate of 1.6Tb/s on a single FPGA.

[1] Vitesse, "100G CI-BCH-4 eFEC Encoder/Decoder Core and Design Package,"[Online]. Available:https://www.vitesse.com/products/product/VSC9804.





### The Product Code

- BCH code
  - Can correct up to 3 errors.
  - Expurgation allows detection of 4 errors.
- Code parameters
  - We chose code lengths (n) so that parallel input widths divide the block lengths.
  - Parity bits (n k) = 32.
  - Overhead ranges from 6.7% to 7.0%.
- Minimum distance (d) = 8.
- Generator polynomial

$$g(z) = (z^{10} + z^3 + 1)(z^{10} + z^3 + z^2 + z + 1)$$
$$(z^{10} + z^8 + z^3 + z^2 + 1)(z^2 + 1)$$







Fig.1 Frame with Product code structure.



### **VHDL** Implementation

- We implemented a generic decoder and tested at 40 G system and extended and synthesized the FEC decoder up to to 1.6 Tb/s.
- Implementation consists of three independent parts:
  - Synchronization of frames and lanes.
  - Syndrome calculation.
  - Fast Gravano decoding based on syndromes.
- The implementation is not vendor specific, except the transceivers and synchronization parts.





## Main Decoder Design I

- Computation of syndromes.
- Each row and column is treated as a polynomial in  $\alpha$ ,  $\alpha^3$  and  $\alpha^5$ .
  - $\alpha$  is the primitive element in GF(1024).
- The actual determination of errors is done by the Gravano algorithm<sup>[2]</sup>.
  - We call this part of decoding as *Gravano decoder*.
- It takes the syndromes and prepares 2<sup>nd</sup> and 3<sup>rd</sup> degree polynomials.
- These polynomials are solved using look-up tables.
- The decoded result is verified by the two parity check bits.
- The output is a number (0 to 3) of the error positions.

[2] S. Gravano, "Decoding the triple-error-correcting (15,5) binary BCH code by the analytic solution of the cubic error-locator polynomial over GF(2<sup>4</sup>)," *Int. J. Electronics*, vol. 68, No 2, pp. 175-180, 1990
[3] Okano, Imai, "A construction of high-speed decoders using ROMs for BCH and RS Codes," *IEEE Trans Comput.* 1987







## Main Decoder Design II

- Decoding is performed with hard decisions, first rows, then columns and then rows again and so on.
- During row/column decoding, up to 3 errored bits can be flipped in the frame.
- Corresponding column/row syndromes are now invalid.
- Row/column syndromes are updated by table lookups.
- A syndrome is only updated when an error is corrected.
- The actual data frame also must be corrected i.e. error positions has to be read, inverted and written back.







### Main Decoder Design III

- Multiple Gravano decoders in parallel.
- Receives syndromes and delivers corrections.
- Corrections are performed if time permits.
- All corrections from row/columns must be finished before we can start decoding in the other dimension.



Fig 2. Top-level decoder datapath (40G).





### **RAM Configuration**

- The data is treated as a 1008x1008 frame.
- Bits should be accessed both row-wise and column-wise.
- We consider a rectangular block (*tile*) within the full 2-D codeword.
- For 128-bit input, the tile size is 16x8.
- RAM is also divided into slices and working in parallel.



Fig 3. Input frame.





#### **DTU Fotonik** Department of Photonics Engineering





- The decoder has a latency of two frames.
- The initial values of syndromes for rows and columns are calculated in the same frame-slot.
- While Gravano decoder works on the previous frame.



Fig. 6 Frame and syndrome buffering.

#### **DTU Fotonik** Department of Photonics Engineering





Scalability

- Higher data rates can be achieved by a large number of input bits in parallel.
  - We have used *data widths* of 128, 256, 512, 1024 and 2048.
  - The input width must divide the full frame width.
- With 128-wide input, we have 1008<sup>2</sup>/128=7938 cycles per frame slot and 450 cycles for 2048.
- For very high data rates this can limit the *number of iterations* to unacceptable level.
- Solution:
  - Increase latency.
  - Increase number of Gravano decoders.







# Synchronization of Lanes and Frames

- Input data arrive at 4 transceivers each with 10 Gb/s.
- The four lanes are synchronized and multiplexed to get 40 Gb/s.
- Synchronization of lanes is accomplished by denoting one lane as master with a fixed delay.
- The slave lanes have a variable delay, implemented by a FIFO controlled by a small state machine.
- Frame synchronization is achieved by a unique word.





### Implementation Results I

- For a single frame decoder, the latency is 2 frames, i.e. approx. 2 Mbits.
- With a single frame decoder, 452 Gb/s is achieved, on a modest FPGA (Altera Stratix V)
- At 452 Gb/s, the latency will be  $< 5 \ \mu s$ .

| Data<br>width | Gravano<br>decoders | Trans-<br>ceivers<br>(in bits) | fmax<br>MHz | Max<br>Gross<br>rate | ALM/mem<br>(%) | No<br>It. |
|---------------|---------------------|--------------------------------|-------------|----------------------|----------------|-----------|
| 128           | 2                   | 4*32 bit                       | 339.79      | 43 G                 | 5/10           | 7         |
| 128           | 8                   | 4*32 bit                       | 339.21      | 43 G                 | 8/14           | ~25       |
| 512           | 8                   | 16*32 bit                      | 322.48      | 164 G                | 14/16          | 6½        |
| 1024          | 8                   | 32*32 bit                      | 323.21      | 330 G                | 24/17          | 3         |
| 1024          | 16                  | 32*32 bit                      | 280.27      | 286 G                | 40/23          | 5½        |
| 2048          | 32                  | 64*32 bit                      | 221.78      | 452 G                | 72/34          | 3         |

#### TABLE 1. SYNTHESIS RESULTS FOR ALTERA STRATIX V (5SGXEA7N2F40C2 -CURRENT BOARD)







### Implementation Results II

- Having *d* frame decoders, the latency becomes 2*d* frames, i.e. 2*d* Mbits.
- Gross rate of 1.6 Tb/s is achieved with
  - Latency of 10 frames.
  - 5 parallel frame decoders.
  - 8 Gravano decoders.
- A more powerful FPGA is used to achieve higher rates.

| Data width        | Gravano<br>decoders | Trans-<br>ceivers<br>(in bits) | fmax<br>(MHz) | Max<br>Gross rate | ALM/mem<br>(%) | No<br>It. |
|-------------------|---------------------|--------------------------------|---------------|-------------------|----------------|-----------|
| 2048              | 32                  | 64*32                          | 222.27        | 455 G             | 39/32          | 3         |
| 2x2048<br>(d = 2) | 32                  | 64*40 +<br>32*48               | 195.47        | 800 G             | 79/64          | 3         |
| 3x1024<br>(d=3)   | 8                   | 64*48                          | 338.87        | 1041 G            | 40/49          | 3         |
| 4x1024<br>(d=4)   | 8                   | 64*40 +<br>32*48               | 328.19        | 1344 G            | 53/65          | 3         |
| 5x1024<br>(d=5)   | 8                   | 64*56 +<br>32*48               | 313.58        | 1605 G            | 69/81          | 3         |

#### TABLE 2. SYNTHESIS RESULTS FOR ALTERA ARRIA 10 (10AX115U1F45I1SG)

#### DTU Fotonik

Department of Photonics Engineering





## Lab experiment with 40 G

- 128-bit input and 2 Gravano decoders.
- The test frame is loaded into the pulse pattern generator.
- Variable optical attenuator was used to degrade the OSNR of the signal.
- 45 GHz photodiode provided the optical-to-electrical conversion.
- 40 Gb/s was demultiplexed to 4x10 Gb/s subcarriers/lanes



Department of Photonics Engineering

Fig. 7 Experimental setup for 40 Gb/s.

DTU Fotonik





### **Performance Results**



Fig.8 Simulation results for 2, 3, 7 and 25 iterations and experimental results from 40 G experiment.



#### Danmarks Grundforskningsfond Danish National Research Foundation

#### DTU Fotonik

Department of Photonics Engineering



### Forward Error-correction (FEC) for optical channels - performance

Theoretical analysis using density evolution

With an overhead of ~6.6 % we reach an output (post-fec) Bit-Error-Rate well below  $10^{-17}$  at input BER of  $3x10^{-3}$  (BER  $10^{-20}$ ) –  $4.4x10^{-3}$ .

Experimentally runs with 20 peta-bit error free





### **Error Floor**

- The error floor is dominated by the probability of a 4x4-core of errors in the frame.
- For a pre-FEC BER *p*, the probability of frame loss (*PFL*) is

$$PFL = {\binom{n}{4}}^2 p^{4 \times 4}$$
  
Error\_floor(p) =  $\frac{4^2}{n^2} {\binom{n}{4}}^2 p^{4 \times 4}$ 

| Х | Х | Х | Х |
|---|---|---|---|
| Х | х | х | х |
| Х | х | х | х |
| Х | х | х | х |

4x4-core of errors in the received frame.

 Errors of the decoding scheme may also contribute to error floor\*.

J. D. Andersen e.a. "A configurable FPGA FEC unit for Tb/s optical communication," Proc. ICC 2017.







Conclusion

- A configurable HD-FEC design was presented, capable of achieving up to 1.6 Tb/s.
- NCG's of 9.0-9.3 dB were estimated.
- The decoder delay may for high data rates be kept in the range of  $5 6 \ \mu s$ .
- The simulation results were verified with the optical transmission system in the lab.







### Acknowledgments

• Thanks to DTU Fotonik team members

DTU Fotonik

– Jakob D. Andersen, Knud J. Larsen, Shajeel Iqbal, Christian B. Bøgh

Department of Photonics Engineering

- Francesco Da Ros, Kjeld Dalgaard





### Thank you



