Reward Function The Obstacle Tower is designed to be a sparse-reward environment. The environment comes with two possible configurations: a sparse and dense reward configuration. In the sparse reward configuration, a positive reward of +1 is provided only upon the agent reaching the exit stairs of a floor of the tower. In the dense reward version a positive reward of +0.1 is provided for opening doors, solving puzzles, or picking up keys. In many cases even the dense reward version of the Obstacle Tower will likely resemble the sparsity seen in previously sparse rewarding benchmarks, such as Montezuma’s Revenge (Bellemare et al. 2013). Given the sparse-reward nature of this task, we encourage researchers to develop novel intrinsic reward-based systems, such as curiosity (Pathak et al. 2017), empowerment (Mohamed and Rezende 2015), or other signals to augment the external reward signal provided by the environment.
I am wondering why this benchmark is aimed at being a sparsely rewarding environment. I ask this as my understanding right now is more that the sparse vs dense reward is actually more of a trade-off for learning time, and only that really. Sometimes we have the luxury to implement dense rewards that cut down the learning time, other times not.
I would actually find the reward engineering quite challenging to solve some of these tasks mentioned in the paper. Or rather, the articulation (implementation) of such a reward policy. I’ve been looking into RL for a number of years now, but I can’t say that I’ve been closely tied with the community up until this challenge, so I may be missing a bit on the intention as to why the tower has initially been setup in the direction of sparse rewards.
Could the tower not have been setup without a reward function, or rather that the reward function could be seen as more of a default?
When measuring the performance between agents, yes it makes sense that they can be measured on the same reward. However, then I stumble upon the thoughts in context of this competition, are we allowed to augment the reward function that our agent/neural net receives. As I understand it, just about anything goes - i.e I can build whatever neural net ensemble taking in any params that I feed it along with the data provided by the unity gym env. I could in fact go as far as setting up an imitation learning environment, where the rewards are not used at all. Are there any technical limits ?