Foto von Frederik Walter

M.Sc. Frederik Walter

Technische Universität München

Professur für Codierung und Kryptographie (Prof. Wachter-Zeh)

Postadresse

Postal:
Theresienstr. 90
80333 München

Biography

Frederik is currently a doctoral candidate in the Associate Professorship of Coding and Cryptography (Prof. Wachter-Zeh) with the Institute for Communications Engineering, School of Computation, Information and Technology, TUM, Munich, Germany.

Before joining the institute, he was part of a medical technology startup that was funded by the eXist program of the German Ministry of Economic Affairs. From 2017 to 2020, he worked in a management consultancy, advising clients on complex strategic transformations.

Frederik received his M.Sc degree with high distinction (1.0) in Electrical Engineering and Information Technology from TUM in 2016. During this time, he spent a semester at the National University of Singapore. He received his B.Sc. also with high distinction (1.0) in Electrical Engineering from Ulm University in 2014. 

Research Interests

Frederik's research interests focus on coding and information theory applied to the area of DNA data storage. In particular, he studies improvements in the synthesis of DNA strands. 

One field is the use of composite DNA as a tool to increase the alphabet size and, therefore, the information density. Instead of synthesizing only one nucleotide in each cycle, mixing several nucleotides and encoding information in the occurring ratio is possible. Therefore, many interesting coding-related challenges arise when the strands are exposed to substitution, deletion, and/or insertion errors. 

Another area in his research is the efficient synthesis for gene expression analysis. He applies coding theoretic concepts to minimize the synthesis time while ensuring all genes are uniquely represented on a testing array.

Teaching

Teaching Assistant an der Universität Ulm

  • Grundlagen der Elektrotechnik I
  • Grundlagen der Elektrotechnik II
  • Signale und Systeme

Available Theses

Theses in Progress

Coding for DNA Data Storage

Beschreibung

Modern information-based society largely relies on trusted information storage in a large variety of setups, ranging from long-term digital archiving to embedding information into products. DNA is a promising future storage medium for such diverse applications as it offers sustainable and robust long-term information storage at extraordinary information density. DNA will also never become obsolete.

A 5min video with an overview of DNA storage: https://youtu.be/r8qWc9X4f6k?si=Yzm5sOW-a6VDnBu3 

The challenges in DNA storage are new error models and processing operations compared to conventional systems. A tradeoff between reading and writing costs needs to be found [1]. Furthermore, it is possible to read coverage analytically. [2]

This seminar focuses on the DNA data storage channel and the different approaches to handling the new challenges. The goal is to understand these challenges and methods of how to handle them. 

References: 

[1] S. Chandak et al., “Improved read/write cost tradeoff in DNA-based data storage using LDPC codes,” in 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Sep. 2019, pp. 147–156. doi: 10.1109/ALLERTON.2019.8919890.

 

[2] D. Bar-Lev, O. Sabary, R. Gabrys, and E. Yaakobi, “Cover Your Bases: How to Minimize the Sequencing Coverage in DNA Storage Systems.” arXiv, May 12, 2023. Accessed: Sep. 06, 2023. [Online]. Available: http://arxiv.org/abs/2305.05656

 

 

 

Voraussetzungen

  • Channel Coding Lecture is recommended
  • A good understanding of probability theory

Betreuer:

Frederik Walter

Coding for DNA Microarrays

Beschreibung

Gene expression analysis

(From Wikipedia: https://en.wikipedia.org/wiki/Gene_expression)

Measuring gene expression is an important part of many life sciences, as the ability to quantify the level at which a particular gene is expressed within a cell, tissue or organism can provide a lot of valuable information. For example, measuring gene expression can:

  • Identify viral infection of a cell (viral protein expression).
  • Determine an individual's susceptibility to cancer (oncogene expression).
  • Find if a bacterium is resistant to penicillin (beta-lactamase expression).

Similarly, the analysis of the location of protein expression is a powerful tool, and this can be done on an organismal or cellular scale. Investigation of localization is particularly important for the study of development in multicellular organisms and as an indicator of protein function in single cells. Ideally, measurement of expression is done by detecting the final gene product (for many genes, this is the protein); however, it is often easier to detect one of the precursors, typically mRNA and to infer gene-expression levels from these measurements.

An approach for gene expression analysis is the hybridization microarray. A single array or "chip" may contain probes to determine transcript levels for every known gene in the genome of one or more organisms.

Here is a video explaining the concepts of gene expression analysis and DNA microarrays: https://www.youtube.com/watch?v=Hv5flUOsE0s

Relation to coding

The generation of the DNA microarray bears several interesting challenges from a coding perspective. The microarray needs to be generated, which means that artificial DNA strands are synthesized. Many of these DNA strands are synthesised simultaneously. However, only one of the four bases of the DNA (A, C, G, T) can be written at once. The challenge is to select the best possible sequence in which the bases are written to minimize the overall running time. This problem is called finding the Shortest Common Supersequence (SCS) and is unfortunately np-hard (https://en.wikipedia.org/wiki/Shortest_common_supersequence). Furthermore, there are more than one possible probes that can be used to signal a specific gene. Hence, this is another parameter to optimize.

Problem description

This thesis aims to analyze the problem of finding the best probes for a DNA microarray and implement an algorithm to run small-scale experiments. 

This should only roughly introduce the problem. No need to understand everything or review all references. If you are interested, please reach out to me, and we can discuss a suitable direction for you.

Related references

[1] M. Abu-Sini, A. Lenz, and E. Yaakobi, “DNA Synthesis Using Shortmers,” in 2023 IEEE International Symposium on Information Theory (ISIT), Jun. 2023, pp. 585–590. doi: 10.1109/ISIT54713.2023.10206609.

[2] A. Lenz, Y. Liu, C. Rashtchian, P. H. Siegel, A. Wachter-Zeh, and E. Yaakobi, “Coding for Efficient DNA Synthesis,” in 2020 IEEE International Symposium on Information Theory (ISIT), Jun. 2020, pp. 2885–2890. doi: 10.1109/ISIT44484.2020.9174272.

[3] D. Maier, “The Complexity of Some Problems on Subsequences and Supersequences,” J. ACM, vol. 25, no. 2, pp. 322–336, Apr. 1978, doi: 10.1145/322063.322075.

[4] K. Makarychev, M. Z. Rácz, C. Rashtchian, and S. Yekhanin, “Batch Optimization for DNA Synthesis,” IEEE Transactions on Information Theory, vol. 68, no. 11, pp. 7454–7470, Nov. 2022, doi: 10.1109/TIT.2022.3184903.

[5] N. Xie, S. Xu, and Y. Xu, “A new coding-based algorithm for finding closest pair of vectors,” Theoretical Computer Science, vol. 782, pp. 129–144, Aug. 2019, doi: 10.1016/j.tcs.2019.03.011.

Voraussetzungen

Good knowledge and high interest in mathematics, especially

  • Linear algebra
  • Combinatorics

Good programming knowledge is required for this topic. 

Kontakt

frederik.walter@tum.de

Betreuer:

Frederik Walter

Coding Methods for Composite DNA Synthesis

Beschreibung

DNA Data Storage

Data storage on DNA molecules is a promising approach for archiving massive data.

In classical DNA storage systems, binary information is encoded into sequences consisting of the four DNA bases {A, C, G, T}. The encoded sequences are used to generate DNA molecules called strands using the biochemical process of DNA synthesis. The synthesized strands are stored together in a tube. To retrieve the binary information, the strand must be read via DNA sequencing and decoded back into the binary representation.

The synthesis and the sequencing procedures are error-prone, and with the natural degradation of DNA they introduce errors to the DNA strands. To ensure data reliability, the errors have to be corrected by algorithms and error-correcting codes (ECCs).

A 5min video with an overview of DNA storage: https://youtu.be/r8qWc9X4f6k?si=Yzm5sOW-a6VDnBu3

Composite DNA

Recently, to allow higher potential information capacity, [1,2] introduced the DNA composite synthesis method. In this method, the multiple copies created by the standard DNA synthesis method are utilized to create composite DNA symbols, defined by a mixture of DNA bases and their ratios in a specific position of the strands. By defining different mixtures and ratios, the alphabet can be extended to have more than 4 symbols. More formally, a composite DNA symbol in a specific position can be abstracted as a quartet of probabilities {p_A, p_C, p_G, p_T}, in which p_X, 0 ≤ p_X ≤ 1, is the fraction of the base X in {A, C, G, T} in the mixture and p_A+p_C+ p_G+ p_T =1. Thus, to identify composite symbols it is required to sequence multiple reads and then to estimate p_A, p_C, p_G, p_T in each position.

Problem description

ECCs for DNA data storage differ in many aspects from classical error correction codes. In this model, new error type gain relevance, like deletions and insertions which affect the synchronization of the sequences. Especially for composite DNA data storage, these error types received only little attention.

The most related work to this problem was recently studied by Zhang et al. in [6]. The authors initiated the study of error-correcting codes for DNA composite. They considered an error model for composite symbols, which assumes that errors occur in at most t symbols, and their magnitude is limited by l. They presented several code constructions as well as bounds for this model. In this thesis, we want to analyse a different way to model the composite synthesis method and studies additional error models. We already have some results for substitution and single deletion errors. This thesis should focus on extending these to combinations of error models [5] or two deletions [3,4].

This should only roughly introduce the problem. No need to review all references. If you are interested, please reach out to me, and we can discuss a suitable direction for you.

References

[1] L. Anavy, I. Vaknin, O. Atar, R. Amit, and Z. Yakhini, “Data storage in DNA with fewer synthesis cycles using composite DNA letters,” Nat Biotechnol, vol. 37, no. 10, pp. 1229–1236, Oct. 2019, doi: 10.1038/s41587-019-0240-x.

[2] Y. Choi et al., “High information capacity DNA-based data storage with augmented encoding characters using degenerate bases,” Sci Rep, vol. 9, no. 1, Art. no. 1, Apr. 2019, doi: 10.1038/s41598-019-43105-w.

[3] V. Guruswami and J. Håstad, “Explicit Two-Deletion Codes With Redundancy Matching the Existential Bound,” IEEE Transactions on Information Theory, vol. 67, no. 10, pp. 6384–6394, Oct. 2021, doi: 10.1109/TIT.2021.3069446.

[4] J. Sima, N. Raviv, and J. Bruck, “Two Deletion Correcting Codes From Indicator Vectors,” IEEE Trans. Inform. Theory, vol. 66, no. 4, pp. 2375–2391, Apr. 2020, doi: 10.1109/TIT.2019.2950290.

[5] I. Smagloy, L. Welter, A. Wachter-Zeh, and E. Yaakobi, “Single-Deletion Single-Substitution Correcting Codes,” IEEE Transactions on Information Theory, pp. 1–1, 2023, doi: 10.1109/TIT.2023.3319088.

[6] W. Zhang, Z. Chen, and Z. Wang, “Limited-Magnitude Error Correction for Probability Vectors in DNA Storage,” in ICC 2022 - IEEE International Conference on Communications, Seoul, Korea, Republic of: IEEE, May 2022, pp. 3460–3465. doi: 10.1109/ICC45855.2022.9838471.

Voraussetzungen

Good knowledge and high interest in mathematics, especially

  • Linear algebra
  • Combinatorics

I highly recommend the channel coding lecture for this thesis.

Kontakt

Frederik Walter (frederik.walter@tum.de)

Betreuer:

Frederik Walter