May I ask if we are allowed to train a meta-controller from demo data and then fix it during training a policy?
And if we can use the number of steps elapsed as a meta-controller (e.g. after 8000 steps in the env, stop chopping tress and start to mine stones)? The number of steps elapsed is not part of the observation.
Learning a meta-controller through unsupervised means such as option extraction and then using that meta-controller later to orchestrate learning for new policies sounds like a great idea, and is permitted!
While number of steps is not an observation, it is a function of the agent’s state. You may be able to build a meta-controller with zero transition provability back to an initial state and start the agent in that state, but encoding transitions after a specified number of time-steps is not permitted.