Foto von Luis Maßny

M.Sc. Luis Maßny

Technische Universität München

Professur für Codierung und Kryptographie (Prof. Wachter-Zeh)


Theresienstr. 90
80333 München


Angebotene Abschlussarbeiten

Laufende Abschlussarbeiten

Byzantine-resilient distributed training

Analyzing and understand a coding-based approach against Byzantine failures in distributed training.


The training for many machine learning algorithms is based on iterative gradient descent. In the case of large data sets, the training is carried out distributedly on computation clusters. However, computation errors and transmission errors as well as malicious nodes have a negative impact on the training process, and may lead to an erroneous model. Especially so-called Byzantine failures pose a challenge for distributed training. Therefore, an important research objective are techniques for robustness against Byzantine failures. The goal of this seminar work is to understand and analyze a coding-based approach to Byzantine-resilient distributed training.

Main paper:

  • Chen, Lingjiao, Hongyi Wang, Zachary Charles & Dimitris Papailiopoulos (2018). DRACO: Byzantine-Resilient Distributed Training via Redundant Gradients. ArXiv:1803.09877 [Cs, Math, Stat].

Related papers:

  • Tandon, Rashish, Qi Lei, Alexandros G. Dimakis & Nikos Karampatziakis (2017). Gradient Coding. ArXiv:1612.03301 [Cs, Math, Stat].
  • Chen, Y., Su, L. & Xu, J. (2017). Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 1(2), 1-25.
  • Blanchard, P., El Mhamdi, E. M., Guerraoui, R., & Stainer, J. (2017). Machine learning with adversaries: Byzantine tolerant gradient descent. Advances in Neural Information Processing Systems, 30.


  • Knowledge in coding theory
  • Proficiency in linear algebra
  • Basic knowledge in machine learning and gradient descent algorithms


Implementation of a Demonstrator for Coded Computing


In future 6G mobile communication networks, machine learning and other complex tasks will be executed on distributed computing clusters. In order to reduce the latency, coding schemes shall be applied. Thereby, redundant computation tasks are scheduled in order to alleviate the impact of slow worker nodes.

In the context of our related research, we are planning to implement a demonstrator, which shall present a potential application of coded computation schemes and shall point out their benefits. For example, we could use coded computations to run a machine learning algorithm in a distributed manner on different worker machines.

The objective of this project is to get familiar with different coded computing schemes proposed in the literature and implement those for a small wireless distributed computing network. In the first step, a simple Python-based distributed computation cluster shall be set up using the RAY framework and Tensorflow (or PyTorch). Finally, a coding scheme shall be implemented for the distributed computation cluster.


  • Knowledge in channel coding
  • Good programming skills in Python
  • Basic knowledge of machine learning with Tensorflow
  • Interest in distributed computing