Imagine you are driving a car up a hill. This normally does not pose any problem. Now imagine, you have a very heavy car with an engine not powerful enough. Fortunately on the opposite side of the valley you are in, there is another hill, which you can use to gain speed in order to reach the other side.

Now your actions are heavily limited. You can either accelerate forward, backward, or do nothing at all. The goal is to drive up the hill with only this three distinct actions.

In this work a Reinforcement Learning (RL) controller is to be implemented which can learn tasks such as the described above. The experimentsare to be done in a simulated environment.

show more details

In Reinforcement Learning (RL) an intelligent agent acts in an environment of which it can only perceive the current state and a reward. By trying out from a set of actions and perceiving the state changes and rewards, the agent tries to learn a policy in order to maximize the expected reward.

In the course of learning a optimal policy to gain maximum reward, the agent builds up a so called value function. In continuous domains (e.g. pole balancing, where a pole with a joint is mounted on a car, which can move right or left. Here the position is a continuous number), it is no longer possible to build up a table of all possible states. A tool often used in this cases is value function approximation. Unfortunately under some conditions the traditional RL algorithms with value function approximations can diverge, when used with value function approximation.

A recently proposed algorithm, Gradient Temporal Difference (GTD) learning, is proven to be stable under this conditions.

The aim of this work is to investigate in the family of GTD algorithms. Then alternative gradient updates are to be implemented and the empirical performance is to be evaluated.

Further Information:

Gradient Temporal Differences (1)

Gradient Temporal Differences (2)

The work can be done as Bachelor-, Master-Thesis or Interdisciplinary Project (IDP). If interested, please contact Dominik Meyer.