So far I understood that we definitely cannot hardcode any decisions based on the state, which makes sense, and we are allowed to shape rewards, but only on the basis of the reward itself, not dependent on the state.
Now I am also wondering to which extent we are allowed to filter through the dataset based on the done and action.
For example, I would like to:
- Filter out bad trajectories which do not reach a diamond (based on the reward signal)
- Filter out transitions in which the player does a noop action (as the player is most likely crafting during that time and a noop action never makes sense)
I assume this would be allowed?
Yes - what you described are perfectly valid filtering steps. In addition if you add the ability to generate a black-list from your rules, we can build a way to provide that blacklist to the data loader to prevent those trajectories from appearing in the sarsd_itter trajectories!
I’m not sure if everyone would want to do the same filtering as I do. Also it seems hard to make a blacklist for my case, as I want to iterate from the end of the episode to the last received reward.
I have an additional question: As the dataset iteration takes a long time (and the dataset has a huge size with all the videos), would it be somehow possible that we preprocess the data according to our needs and then store it in the repo? I guess that would be problematic due to the large size (although it would be much less than the 15 GB), but maybe you can provide an official way for it?
I just don’t want to step through all observations using the sard_itter every time that I start my program. Also for the 4 day evaluation it would be annoying if the first hours would just go into loading the expert data.
1 Like