Upon completion of this module, students are able to:
* Understand the design flow and design steps for deploying machine learning workloads on embedded devices.
* Evaluate the trade-offs involved in executing machine learning workloads such as neural network inference in software and hardware for on embedded processors and dedicated accelerators.
* Apply effectively model compression methods to embedded machine learning workloads and understand the theory behind the methods.
* Apply hardware acceleration principles (SIMD, Vector, 2D systolic arrays) for accelerating ML workloads and know about the influence of the memory system.
* Apply a state-of-the-art machine learning deployment flow to a simple machine learning application such as keyword recognition.
* Implement the deployment code on an embedded processor platform (micro-controller board) and to design a simple hardware accelerator for the application that works in simulation.
Description
The lecture will cover the following contents:
* Introduction to the design flow and design steps to deploy machine learn-ing workloads on embedded devices
* Machine learning theory to understand the typical structure, operatorsand trade-offs in accuracy, memory and performance demands of machinelearning workloads
* Neural Network Model Compression Methods: Number systems, Integerand sub-byte Quantization, Quantization aware training, Pruning, RankReduction.
* Software Optimization Methods: Memory planning, target-aware operator optimization, operator fusing and tiling
* Methods and basic HW blocks for embedded HW-acceleration (SIMD,Vector Instructions, loosely-coupled accelerators, memory systems)
The lab part will cover the following contents to transfer the theory into practice:
* Introduction to the Machine Learning Deployment Toolchain TVM
* Training, model optimization and deployment of a keyword recognition application onto a low-power micro-controller board using TVM
* Design of a simple HW accelerator to offload machine learning workload into hardware with test by simulation.
Prerequisites
Basic knowledge on embedded C and micro-controllers (e.g. in the form of a micro-controller programming lab) is assumed to be known.
Basic knowledge of Hardware design in VHDL or Verilog (e.g. in the form of a hardware design lab or the lecture "Entwurf Digitaler Systeme mit VHDL und SystemC" (Prof. Ecker) is assumed to be known.
Basic knowledge of machine learning algorithms (e.g. in the form of the computational intelligence lecture or the lecture "Machine Learning: Methods and Tools lecture") is recommended.
Teaching and learning methods
The module will consist of two parts, a weekly lecture and a parallel lab with three parts.
The lecture will consist of classroom sessions with slide presentation. The exercises will be integrated in the lecture flow in order to apply the learned content directly on example problems. This will be done using activating methods such as group works.
The lab will be split into three major tasks, that will be tutored, each being introduced in one classroom lab session.
The students will work on these tasks in small groups on their own schedule to also train team work and independent work. The lab tasks will directly put the lecture content into practise, hence, following a problem-oriented learning approach.
The course will be taught based on lecture material in the forms of slides and with additional exercises. The lab part will involve the work on a state of the art open-source simulation and deployment flow (TVM) that can be used either on university lab PCs or private machines. Additionally, low-power micro-controller boards will be used to demonstrate the application in real hardware.
Examination
1) Code submission for the lab part. (coursework, 20% of final grade)
2) Written final exam (90min) (Examination, 80% of final grade)
1) The student will hand in code submissions for the three lab parts, which will be graded. This grade counts 20% of the final grade.
2) In the final exam (90min written or 30min oral), the students will answer questions from the lecture content and from the lab part to test their understanding of the theoretical and practical aspects of embedded machine learning. This grade counts 80% of the final grade.
The first part (1) will test that the students have acquired the practical programming skills to deploy a ML embedded project, while part (2) tests for the understanding of the practical and theoretical concepts of embedded ML.
Recommended literature
There exist textbooks covering the content of the lectures so far. It is planned to provide a lecture script based on the lecture contents in the future. The following books cover parts of the lecture’s content and provide related information: * Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, Joel S. Emer;Efficient Processing of Deep Neural Networks; Morgan & Claypool Publishers * David Patterson, John Hennessy;Computer Organization and Design RISC-V Edition - The Hardware Software Interface; Elsevier