With the torchbeast system, I understand I have access , at each step, to the environment and the actor state in the inference function while learning , but the batch size is dynamic and I have found no identifier which would allow me to know which environment/actor are in the batch.
It seems all the batch.get_inputs() is managed by the torchbeast library, I would rather not recompile it.
Does anyone have a smart hack ?
Hey Olivier,
Thanks for your question. Could you elaborate what it is you are trying to do?
Typically, it’s easier to do things on the learner side, but it depends on what you are trying to achieve.
My idea was to make something like an “assisted AI”. Instead of waiting millions of steps to avoid climbing the entrance stairs, eating kobold corpses and walking into walls, I would like to make an actor able to choose between not-obviously-stupid actions .
So for this I would like to have a state attached to the actor or the environment ( where I would store things like “the stairdown is there even if there is a bow dropped on it”, the intrinsics etc ), and alter the “call to inference” . Obviously this prevents the model to learn about some actions ( for example when asked to say y/n at: “Beware, there will be no return”, I force Direction.SE when the actor happily samples randomly on the action space ), but I am not sure there is a value, in this case, to let it try something else.
So while training and testing I want to be able to “help choose sensible options”.
Note that while training, overriding the “step()” function of the environment does not do what I want because I want to learn the forced action, not the original one.
I understand I should use a gym wrapper around the environment, this seems to be an allowed method, I have to see if this way I will be able to add one more key to the observations ( for example an uuid computed at reset() time ).
Hey Olivier,
Yes that makes sense. To do something like that, you’d have to change the logic in the actor. As you suggest, one way of doing that is to wrap the environment and add your custom logic to the wrapper, even if it’s only an id identifying the specific environment, and then changing how the actor function handles the environment based off that.
You could also consider not using the asynchronous setup of TorchBeast and doing something more direct (e.g., batched A2C, standard PPO, etc) instead, where the actor/learner split doesn’t happen in the same way, and batch entry i
is always referring to the same environment instance.