What is considered an internal reward?

jonchuang · July 25, 2019, 11:49am

If the episode terminates and before I reach max step or max reward, is adding a negative reward to signal death (by drowning or being killed) considered tweaking the reward?

This seems like a natural thing to be allowed in real world context. If a robot manipulation task terminates prematurely, say due to an experimenter having to step in as the robot is wrecking havoc, it seems natural to automatically classify the terminal state as being negative.

BrandonHoughton · July 25, 2019, 5:50pm

Hi jonchuang - Any manual reward shaping based on the environment, including how the agent terminated the episode, is not allowed in this iteration of the competition.
Part of the complexity of the competition is to learn agents without hard-coding features or rewards by leveraging human demonstrations. By relying on methods not tied to this specific environment nor on domain knowledge, results from this competition will be more widely applicable to other problems.
Luckily, avoiding death shouldn’t be to hard for an agent to learn. All rewards in the MineRLObtainDiamond-v0 environment are positive, so all Q-values/etc are positive. End of episode has fixed value of 0, so avoiding early termination of an episode strictly increases the expected reward obtained.

manuel · August 13, 2019, 5:12pm

Hi @BrandonHoughton, what if the RL method changes the reward automatically, i.e. the reward is not tweaked manually, but the algorithm changes it as is explained in, for example:

Reinforcement Learning with Unsupervised Auxiliary Tasks - DeepMind: https://arxiv.org/abs/1611.05397
Curiosity Driven Exploration - UC Berkeley: https://arxiv.org/abs/1705.05363
Hindsight Experience Replay - OpenAI: https://arxiv.org/abs/1707.01495

Thanks

BrandonHoughton · August 14, 2019, 2:23am

That would be allowed!

rl_if · August 14, 2019, 11:39am

Hi @BrandonHoughton. Is it allowed to re-scale the rewards? Having rewards in the range of 1 to 1024 can become unstable.

Also, in a hierarchical setting, is it allowed to train low-level policies on a subset of rewards, if the subset is constant and not state dependent?

weel2019 · August 16, 2019, 1:53am

Hi, @BrandonHoughton . If we use supervised learning algorithms to extract some rules from the MineRLdataset, and then in the process of reinforcement learning, these rules will be set as Non-trainable , is this permitted?

Thanks

notnanton · September 3, 2019, 8:42pm

I would also be interested in that, although I assume the answer is no.

I also have a related question regarding rewards: are we allowed to filter out rewards from the dataset of non-obtain_diamond_sparse replays such that the reward function equals the reward function of the evaluation environment?

And if you are already answering these questions I am also wondering about the neural network architecture: can it be modified to fit the observation dict? For example, I would like to use the inventory embedding layer in some specific places of my architecture. This is an end-to-end RL approach, but still it kind of specializes to the MineRL env.

notnanton · September 10, 2019, 12:34pm

@william_guss or @BrandonHoughton could we get some clarifications in this thread?

BrandonHoughton · September 10, 2019, 3:30pm

Thanks for the ping! In general the rule is that any reward shaping cannot be dependent of the agents state. Scaling, including zeroing rewards is fine as long as this is not used to somehow take advantage of the agent’s state to shape the rewards.

@weel2019 How are you intending to supervise learning? We intended for learning features of the data from human demonstration to be in integral part of the competition! However we don’t allow external datasets to be included in the learning procedure, so for example hand annotating 1000 frames with new labels is not allowed. Furthermore in round 2 we modify Minecraft with a new texture-pack so any annotations would not be valid or carry over

Note that even when parameters are learned, they can still count as hard-coded if they are being used to fix a policy or meta-controller. For example, if you use summaries of the dataset to encode constants, e.g. max(num_diamonds * c) = c this would still count as hard-coding.

notnanton · September 10, 2019, 3:54pm

Okay great. So then my reward shaping would be allowed it seems.

What about removing “done” 's ? For example, I would like to use the ObtainIronPickaxe data, but it is a bit misleading to tell the agent that the task is done when the pickaxes is attained if I want the agent to learn to get a diamond later on.

BrandonHoughton · September 10, 2019, 3:56pm

Sure - that’s a reasonable mapping letting you properly take advantage of the other environments for training!