Division of datasets based on reward

Is it possible to divide the datasets based on the reward?
If that is possible, participants can take the following approachs

・Has the agent learn the divided datasets in turn.

・Create multiple sub-agents and have each sub-agent learn each of the divided datasets.
Then switch each sub-agent by a single master-agent.
(Like the approach of the team that finished 7th at MineRL 2019, https://openai.com/blog/learning-a-hierarchy/ and so on)

Are these approaches to dividing the datasets based on reward violate the rules?

2 Likes

Hi @michael-tanaka,

Thanks for your question! This year we decided that you CAN use reward when learning the action distribution from human demonstrations. E.g. it is permitted to learn the joint distribution between reward and human actions and condition on this distribution when sampling.

What you are describing, however, sounds like a hard-coded meta-controller, as the policy is dictated by hand-encoding the reward thresholds.

One option to mitigate this would simply be to learn a meta-controller that only observes reward, and decides against a fixed number of policies. You could then weight demonstrations by their reward to have a uniform sampling distribution.

1 Like

Hi @BrandonHoughton,

Just to confirm, is it against the rules to split the human demonstrations with hardcodes(if-then) based on the accumulated rewards?