Let's talk sparse / dense rewards

TruthMaker · April 30, 2019, 8:31pm

Hi Community

I’ve been contemplating the idea around sparse / dense rewards, and have been reading into @arthurj’s OT paper. I suppose the paper inspired my curiosity more than my agent’s so far… (ha ha)

Blockquote
Reward Function The Obstacle Tower is designed to be a sparse-reward environment. The environment comes with two possible configurations: a sparse and dense reward configuration. In the sparse reward configuration, a positive reward of +1 is provided only upon the agent reaching the exit stairs of a floor of the tower. In the dense reward version a positive reward of +0.1 is provided for opening doors, solving puzzles, or picking up keys. In many cases even the dense reward version of the Obstacle Tower will likely resemble the sparsity seen in previously sparse rewarding benchmarks, such as Montezuma’s Revenge (Bellemare et al. 2013). Given the sparse-reward nature of this task, we encourage researchers to develop novel intrinsic reward-based systems, such as curiosity (Pathak et al. 2017), empowerment (Mohamed and Rezende 2015), or other signals to augment the external reward signal provided by the environment.

I am wondering why this benchmark is aimed at being a sparsely rewarding environment. I ask this as my understanding right now is more that the sparse vs dense reward is actually more of a trade-off for learning time, and only that really. Sometimes we have the luxury to implement dense rewards that cut down the learning time, other times not.
I would actually find the reward engineering quite challenging to solve some of these tasks mentioned in the paper. Or rather, the articulation (implementation) of such a reward policy. I’ve been looking into RL for a number of years now, but I can’t say that I’ve been closely tied with the community up until this challenge, so I may be missing a bit on the intention as to why the tower has initially been setup in the direction of sparse rewards.
Could the tower not have been setup without a reward function, or rather that the reward function could be seen as more of a default?

When measuring the performance between agents, yes it makes sense that they can be measured on the same reward. However, then I stumble upon the thoughts in context of this competition, are we allowed to augment the reward function that our agent/neural net receives. As I understand it, just about anything goes - i.e I can build whatever neural net ensemble taking in any params that I feed it along with the data provided by the unity gym env. I could in fact go as far as setting up an imitation learning environment, where the rewards are not used at all. Are there any technical limits ?

arthurj · May 1, 2019, 12:46am

Hi @TruthMaker

These are good questions. You are right that providing a denser reward typically makes the problem easier. The problem is that dense rewards aren’t always available for a couple reasons. The first is that in real world problems such as robotics, there may only be a reward that is received at certain points and not others. It may not be possible to provide rewards arbitrarily like it can be done in a simulation. The second point is that in many cases even if you could provide a dense reward, it is unclear what reward to provide. In the case of an agent performing a backflip for example, aside from 1 - backflip completed, 0 - backflip failed, there isn’t a great way to provide a more dense signal.

Because of these two issues, there is a lot of interest in the community around designing algorithms which can perform well in the sparse reward case. In Obstacle Tower we provide two reward functions, but in a certain way both of them are pretty sparse, with the dense one still not providing a reward signal all that often. This is to allow Obstacle Tower to be useful as a benchmark for algorithms designed to handle sparse rewards. It is also because there isn’t a meaningful way to provide a more dense signal that wouldn’t be overly constraining the expected behavior of the agent.

I hope this provides some insight into our thoughts when designing it.

TruthMaker · May 1, 2019, 1:23am

Yes, thanks a lot for your insights. The idea around robots not having availability of a dense reward function like in the back-flip example is particularly interesting.

TruthMaker · May 1, 2019, 11:44am

@arthurj catching up on my understanding of the problems that have been observed in sparse reward systems, the objective of this competition becomes clearer to me. I was initially pulled into this competition thinking from a reward engineering perspective - classical RL. I realise I may have been mistaken.

I would like to have clarity on technical limits within this competition. There are known solutions (or solution attempts) for generalising in sparse reward systems where by the agents generate their own rewards. It’s clear to me now that these kinds of algorithms are more sought after in this competition event, however, does this mean that manually engineering rewards for OT are not part of the allowed solutions? Or is it expected that the agent only self learns particular goals/rewards - for example previously I mentioned imitation learning as a possibility for the AI to learn, and although the agent may learn to generalise on the OT, it may not be a generic solution that can be applied to other problems (each environment would require imitation - and who knows, the top of this tower may not be reachable by humans (looking forward to seeing round 2))

A bit more thought in context of your robot, I’d essentially try to cut down on training time by breaking the back-flip into smaller rewardable steps for the agent, perhaps best thought of as a form of curriculum learning. First learn to jump high enough, and get a grip on gravity, and learn to rotate in a particular direction. This is assuming the robot is equipped with an accelerometer - as I doubt the robot would ever reach a back-flip goal in a reasonable/feasible time with only CV - I’m not saying it can’t happen, but the amount of effort/resource cost doesn’t appeal as an effective solution. Hence, I’d definitely exploit an accelerometer - and attach one if it didn’t have one. I mean, there is a lot I could do with an accelerometer, and our world consists more than just a visual perspective - we have access to physics and other interesting perceptions, and creating new ways of perception is all part of the challenge in most learning (including CV).