Train a metacontroller based on Expert Data which is divided into several stages manually by reward

Can I train a metacontroller based on Expert Data which is divided into several stages manually by reward, or does this count as hard coding?

Eg. stage 1 is the time when it has no reward .

I guess it is not allowed. The reward is closely related to the inventory. So using reward to split stages is equivalent to using inventory observation.

1 Like