Unity recently released v0.5 of its ML-Agents toolkit, and I am really excited for using them for learning and research.
The toolkit provides open-ai gym like environments which can be configured to a degree and built
for your experiment needs.
The environment I used for training an agent using DQN is a Banana Collector environment, where an agent moves around an area attempting to collect as many rewarding bananas (yellow) as possible while avoiding the blue bananas that penalize the agent for stepping over.
To have a perspective of the difficulty of the problem
Here the agent is trying random actions, it is allowed to move:
The agent perceives this environment as a state of 37 dimensions, which includes the agent’s velocity and a ray based perception of the objects around it. It shouldn’t take more than 1000 episodes to solve this environment, this particular project stated a reward of 13 points averaged over a 100 episodes will be considered solved. Here is how I fared: (Github link)
432: because the average of 13 rewards was taken over a 100 episodes
So let’s take a look at the difference
The agent learns to choose yellow bananas and avoids the blue bananas like plague! episodes
While it was fun to have the agent trained the difficult part was gaining an intuition of why this works and the inner workings. I had heard many times that Deep Reinforcement Learning algorithms are unstable, while that may be true or the fact that it is really difficult to get something of value accomplished with the technique it didn’t help knowing what it really meant.
I consider myself lucky when I run into things that don’t work. As that gives more perspective to the amount of work that needs to be put into things to get them working reliably, and the challenges that would appear in their real world applications. Here is a glimpse
An agent stuck between the choices
I had programmed the agent to early-stop when the environment is solved (13 points) but this picture shows the value of work done in 432 episodes. To be honest this happened many times, 40% of the times I would have a scenario like this and It would make me think what went wrong.
In an environment where the exploration space is infinite, there will never be a scenario where the agent has sufficient experience to know the best outcome for all states.
Using neural networks as approximators compounds the difficulty by adding the challenges involved in training them.
An agent will see the environment change really slowly, of which not a lot helps the agent until it starts getting a reward. What’s worse is association of sequential states with actions where getting rewarded for doing the same action in similar environment would convince the agent that the action is superior, while the agent has learnt to exploit only a small section of the environment.
The agent has to estimate the value of an action in a given state, such that the value of the resulting state is optimal. These are two separate approximations that need to comply. It is easy to make a mistake while selecting a good action for current state, also the assumption that the perceived next sate is actually the optimal.