If the episode terminates and before I reach max step or max reward, is adding a negative reward to signal death (by drowning or being killed) considered tweaking the reward?
This seems like a natural thing to be allowed in real world context. If a robot manipulation task terminates prematurely, say due to an experimenter having to step in as the robot is wrecking havoc, it seems natural to automatically classify the terminal state as being negative.
Hi jonchuang - Any manual reward shaping based on the environment, including how the agent terminated the episode, is not allowed in this iteration of the competition.
Part of the complexity of the competition is to learn agents without hard-coding features or rewards by leveraging human demonstrations. By relying on methods not tied to this specific environment nor on domain knowledge, results from this competition will be more widely applicable to other problems.
Luckily, avoiding death shouldn’t be to hard for an agent to learn. All rewards in the MineRLObtainDiamond-v0 environment are positive, so all Q-values/etc are positive. End of episode has fixed value of 0, so avoiding early termination of an episode strictly increases the expected reward obtained.
Hi @BrandonHoughton, what if the RL method changes the reward automatically, i.e. the reward is not tweaked manually, but the algorithm changes it as is explained in, for example:
Hi, @BrandonHoughton . If we use supervised learning algorithms to extract some rules from the MineRLdataset, and then in the process of reinforcement learning, these rules will be set as Non-trainable , is this permitted?
I would also be interested in that, although I assume the answer is no.
I also have a related question regarding rewards: are we allowed to filter out rewards from the dataset of non-obtain_diamond_sparse replays such that the reward function equals the reward function of the evaluation environment?
And if you are already answering these questions I am also wondering about the neural network architecture: can it be modified to fit the observation dict? For example, I would like to use the inventory embedding layer in some specific places of my architecture. This is an end-to-end RL approach, but still it kind of specializes to the MineRL env.
Thanks for the ping! In general the rule is that any reward shaping cannot be dependent of the agents state. Scaling, including zeroing rewards is fine as long as this is not used to somehow take advantage of the agent’s state to shape the rewards.
@weel2019 How are you intending to supervise learning? We intended for learning features of the data from human demonstration to be in integral part of the competition! However we don’t allow external datasets to be included in the learning procedure, so for example hand annotating 1000 frames with new labels is not allowed. Furthermore in round 2 we modify Minecraft with a new texture-pack so any annotations would not be valid or carry over
Note that even when parameters are learned, they can still count as hard-coded if they are being used to fix a policy or meta-controller. For example, if you use summaries of the dataset to encode constants, e.g. max(num_diamonds * c) = c this would still count as hard-coding.
Okay great. So then my reward shaping would be allowed it seems.
What about removing “done” 's ? For example, I would like to use the ObtainIronPickaxe data, but it is a bit misleading to tell the agent that the task is done when the pickaxes is attained if I want the agent to learn to get a diamond later on.