Can I train a metacontroller based on Expert Data which is divided into several stages manually by reward, or does this count as hard coding?
Eg. stage 1 is the time when it has no reward .
Can I train a metacontroller based on Expert Data which is divided into several stages manually by reward, or does this count as hard coding?
Eg. stage 1 is the time when it has no reward .
I guess it is not allowed. The reward is closely related to the inventory. So using reward to split stages is equivalent to using inventory observation.