What is considered an internal reward?


If the episode terminates and before I reach max step or max reward, is adding a negative reward to signal death (by drowning or being killed) considered tweaking the reward?

This seems like a natural thing to be allowed in real world context. If a robot manipulation task terminates prematurely, say due to an experimenter having to step in as the robot is wrecking havoc, it seems natural to automatically classify the terminal state as being negative.

1 Like

Hi jonchuang - Any manual reward shaping based on the environment, including how the agent terminated the episode, is not allowed in this iteration of the competition.
Part of the complexity of the competition is to learn agents without hard-coding features or rewards by leveraging human demonstrations. By relying on methods not tied to this specific environment nor on domain knowledge, results from this competition will be more widely applicable to other problems.
Luckily, avoiding death shouldn’t be to hard for an agent to learn. All rewards in the MineRLObtainDiamond-v0 environment are positive, so all Q-values/etc are positive. End of episode has fixed value of 0, so avoiding early termination of an episode strictly increases the expected reward obtained.

1 Like

Hi @BrandonHoughton, what if the RL method changes the reward automatically, i.e. the reward is not tweaked manually, but the algorithm changes it as is explained in, for example:

Reinforcement Learning with Unsupervised Auxiliary Tasks - DeepMind: https://arxiv.org/abs/1611.05397
Curiosity Driven Exploration - UC Berkeley: https://arxiv.org/abs/1705.05363
Hindsight Experience Replay - OpenAI: https://arxiv.org/abs/1707.01495



That would be allowed!

1 Like

Hi @BrandonHoughton. Is it allowed to re-scale the rewards? Having rewards in the range of 1 to 1024 can become unstable.

Also, in a hierarchical setting, is it allowed to train low-level policies on a subset of rewards, if the subset is constant and not state dependent?


Hi, @BrandonHoughton . If we use supervised learning algorithms to extract some rules from the MineRLdataset, and then in the process of reinforcement learning, these rules will be set as Non-trainable , is this permitted?