So far I understood that we definitely cannot hardcode any decisions based on the state, which makes sense, and we are allowed to shape rewards, but only on the basis of the reward itself, not dependent on the state.
Now I am also wondering to which extent we are allowed to filter through the dataset based on the done and action.
For example, I would like to:
- Filter out bad trajectories which do not reach a diamond (based on the reward signal)
- Filter out transitions in which the player does a noop action (as the player is most likely crafting during that time and a noop action never makes sense)
I assume this would be allowed?