Geno-Weaving: Capacity-Achieving Coding for DNA Data Storage
Description
Background:
DNA data storage has emerged as a compelling medium for archival data, offering extraordinary density and durability. In practice, data is encoded into millions of short synthetic DNA strands stored in a liquid container. When retrieved, the strands are read by nanopore sequencers in random order and a random number of times, and are corrupted by substitution and deletion errors. Weinberger and Merhav [4] formally characterized the capacity of this channel, accounting for the Poisson-distributed read counts and the permutation of strands.
The classical approach to protect data in this setting is a concatenation scheme: a per-strand inner code handles sequence errors, and an outer Reed–Solomon code handles strand-level erasures. A fundamental limitation of this approach is that current DNA synthesis technology restricts strand lengths to roughly 200–300 nucleotides—a regime in which inner codes suffer a significant finite-block-length rate penalty.
Modern Relevance:
Wang and Guruswami [1] proposed Geno-Weaving to overcome this bottleneck. Rather than coding along each strand, the key idea is to code across strands: a single block code protects the same nucleotide position across all strands simultaneously. Since the number of strands (millions) is orders of magnitude larger than the strand length, this orthogonal direction provides a vastly larger effective block length, eliminating the finite-length penalty. The scheme uses polar codes and provably achieves the capacity of the DNA storage channel. Lin, Wang, and Guruswami [2] extended this framework to deletion errors, demonstrating—theoretically and through simulations—that a Geno-Weaving design for substitution channels already performs well on deletion channels at realistic rates of 0.1%–10%, outperforming even optimal concatenation schemes.
Seminar Focus:
This seminar explores the Geno-Weaving framework and its extension to deletion errors. The goal is to develop a deep understanding of the channel model, the capacity-achieving code construction, and the block-length advantage over concatenation. The student will work through the encoding and decoding procedures in detail, develop illustrative examples, and carry out a quantitative comparison of Geno-Weaving's code rates against those of concatenation schemes for different deletion probabilities.
Expected Outcomes:
- Channel Model & Capacity: A clear explanation of the DNA storage channel, including Poissonization and the three challenge factors (noise, permutation, and sampling), with worked examples.
- Code Construction: Step-by-step illustrations of the Geno-Weaving encoding and decoding, including the push/pull synchronization mechanism for deletion errors.
- Rate Comparison: Plots comparing Geno-Weaving's achievable rates against concatenation schemes across a range of deletion probabilities, with a discussion of the block-length gain.
References:
[1] H.-P. Wang and V. Guruswami, "Geno-Weaving: A Framework for Low-Complexity Capacity-Achieving DNA Data Storage," *IEEE Journal on Selected Areas in Information Theory*, vol. 6, pp. 383–393, 2025, doi: [10.1109/JSAIT.2025.3610643](https://doi.org/10.1109/JSAIT.2025.3610643).
[2] Y.-T. Lin, H.-P. Wang, and V. Guruswami, "Block Length Gain for Nanopore Channels," *arXiv preprint arXiv:2511.18027*, Nov. 2025.
[3] L. Organick, S. D. Ang, Y.-J. Chen, R. Lopez, S. Yekhanin, K. Makarychev, et al., "Random Access in Large-Scale DNA Data Storage," *Nature Biotechnology*, vol. 36, no. 3, pp. 242–248, Mar. 2018, doi: [10.1038/nbt.4079](https://doi.org/10.1038/nbt.4079).
[4] N. Weinberger and N. Merhav, "The DNA Storage Channel: Capacity and Error Probability Bounds," *IEEE Transactions on Information Theory*, vol. 68, no. 9, pp. 5657–5700, Sep. 2022, doi: [10.1109/TIT.2022.3176371](https://doi.org/10.1109/TIT.2022.3176371).
Prerequisites
- Channel Coding