## Deep Reinforcement Learning

**Goal**

In this week’s summary we introduce the basic concepts behind reinforcement learning and some ways it is applied in very controlled environments.

**Motivation**

Reinforcement learning methods recently experienced a hype through AlphaGo ranking next to the best human Go players.

Furthermore the complexity of Go might ease the transfer of reinforcement learning to very large NLP tasks like dialog handling.

**Ingredients**

Markov Decision Process, exploration and exploitation, policy, Monte Carlo Tree Search, Q-Learning

**Steps**

Reinforcement Learning is usually applied to tasks, where an environment is partially observable and a certain action has to be taken.

The influence on the state of the environment results in a backreaction in form of some reward.

Any kind of game basically fits the former description.

More mathematically, such situations are described by Markov Decision Processes.

An agent within an environment in a certain state gets some reward related to the action performed according to some policy.

In a lot of games additionally the Markov property holds which states that only the current state and the the taken action determines the next state.

Rolling out one Markov Decision Process in time, we can sum up the rewards at every step to a total reward.

Accordingly, we find a at each point in time a total future reward for which we can introduce some weighting for rewards being further in the future.

Then the agent should learn a policy that allows it to pick an action that maximizes future rewards.

One way to do this is to define a value function for every state that ranks possible actions. Other approaches try to find the policy directly or find transition models for the environment.

When setting up the agent, one usually runs into the dilemma between exploring the space of possible states and exploiting the states with the highest rewards.

Random initialization of the value function leads to exploration in the early phase since the value function will vary a lot in the beginning.

However, later on a greedy strategy is found an exploration will come to an end. Introducing a probabilistic element to picking the next action might provide more variation. Recently, Monte Carlo Tree Search methods have been deployed to unfold the tree of most promising actions and then picking the branch most often visited. This is somehow similar to approaching the multi-armed bandit problem.

A major problem for agents only learning online is that the states and actions it’s learning relies on are very correlated and hence learning might get stuck in a local minimum. Experience replay is one solution to this overcome this problem. By doing supervised learning on previous memories and randomly learning from these it is possible to break the correlation between states.

Deep neural networks come into play for learning appropriate value functions.

The Bellmann expectation and optimality equations introduce a recursive relation between a target value function and the current value function.

This relation can be used to define a mean squared error function which provides a way of backpropagating errors and updating the weights of the network.

Recent attention towards reinforcement learning was mostly driven by the success of AlphaGo.

Games in general provide a simple enough environments to apply reinforcement since the Markov property holds, the agents have perfect information about their environment and the number of possible states and actions is limited.

Therefore it is not to surprising that reinforcement learning was successfully applied to other simpler games, like Backgammon almost 25 years ago.

Progress was furthermore made in the field of computer games.

In particular, agents playing simple Atari games reached human level within the last years.

The game of Go is a new milestone in this area of research, since the state space and the amount of possible actions is magnitudes higher than for example in chess.

**Outlook**

The complexity of Go and the corresponding reinforcement learning methods developed are of interest to the field of NLP.

Simple tasks for dialogue management system easily reach a higher complexity and so transferring methods from Go to NLP are an obvious way to go.

However, in non-game environments it is naturally hard to define a reward function, since it is difficult to quantify the goal or purpose of an conversation.

**Resources
**“Deep Reinforcement Learning” (WEB).

*Deep Reinforcement Learning*. June 2015. Accessed 2nd April 2016.

“Guest Post (Part I): Demystifying Deep Reinforcement Learning” (WEB).

*Guest Post (Part I): Demystifying Deep Reinforcement Learning*. December 2015. Accessed 2nd April 2016.