Robotic Gaze Control using Reinforcement Learning

Master's Thesis, Martin Rothbucher |


This thesis examines how adaptive control can learn to point a camera at the active speaker in a conversation. A motivating scenario for this problem is a video conferencing system that can move a camera to track the current speaker in a conversation. This domain presents several challenges for control-theory based approaches that rely on specific hardware (such as the need for an array of microphones) and on accurate models of the system dynamics (such as the difficulty of estimating the direction of sound arrival in beam-forming approaches within an echoic environment). Instead we use methods from reinforcement learning, where the task is specified with an observable objective referred to as the reward signal. Specifying this task with a reward signal enables an adaptive controller to improve its performance with experience. One might envision the reward for this task coming from detecting lip motion of the speakers, but for this work we have restricted our experiments to simpler direct visual feedback from the speakers. Multiple experiments have been performed on a physical robot system with real audio data to examine the feasibility and potential of this approach. We have tested the utility of a variety of different audio features. Our experimental results demonstrate that the system learns in real-time (within five minutes) to identify the active speakers and uses sound data to improve performance.