Problem with pytorch multiprocessing

kim135797531 · November 24, 2019, 5:55pm

Current local_evaluation.py doesn’t have if __name__ == "__main__".

It means user can’t share the CUDA tensors between python subprocesses with torch multiprocessing, because we can’t call mp.set_start_method(“forkserver”).

Even we can use multiprocessing, real_robots (0.1.16) never calls end_extrinsic_phase() method, so we have no idea when to terminate our subprocesses. (Maybe bug?)

ec_ai · November 25, 2019, 7:31am

Dear kim135797531,
I am not familiar with mpi.set_start_method, but I see that the reason to use it in main is to ensure it is called only once.
So you might have a workaround by using an an empty file as a lock.
i.e. before you invoke set_start_method you check if the file is present, if it is not, you create it and use set_start_method; otherwise, if the file is present, another process has already called set_start_method.

Even we can use multiprocessing, real_robots (0.1.16) never calls end_extrinsic_phase() method, so we have no idea when to terminate our subprocesses. (Maybe bug?)

Yes, it is a bug. I see now that at the end of the extrinsic phase, end_extrinsic_trial() is called again instead of end_extrinsic_phase().
It is too late now to release a fix (it might be disruptive with just a few hours for the final submissions), however you can still catch the end of the extrinsic_phase by detecting when end_extrinsic_trial is called two times in a row
The controller code might be like this:

def start_extrinsic_trial(self):
    self.trial_ends = 0
    pass

def end_extrinsic_trial(self):
    self.trial_ends +=1
    if self.trial_ends > 1:
        self.end_extrinsic_phase()
    pass

def end_extrinsic_phase(self):
    print("Extrinsic phase has ended!")
    pass

kim135797531 · November 26, 2019, 7:30am

Thanks for answers. Python fork method ‘spawn’ or ‘forkserver’ will reload every script when they received request to generate subprocess, leads to unavoidable re-executing of ‘EvaluationService’ for real_robots by separated python interpreter. So making lock file can’t prevent this kind of problem I think.
(https://docs.python.org/3/library/multiprocessing.html)

Anyway, we gave up to train our agent in parallel on the remote scoring environment, since our agent still tend to find ‘Do nothing’ policy

Main challenge of our team was to make our VAE to train environment correctly. Original VAE(or b-VAE, etc.) couldn’t distinguish the objects because of rare interaction.

We tried to make robot arms to approach to objects (by intrinsic algorithm) to make various changes in dataset. Simultaneously, we tried to improve the VAE to learn latent information in rarely changing dataset.

Even we trained with better VAE, training RL with retina and joint positions was difficult. Main reason was that goal images doesn’t include the robot’s shape.

It was great experience to chanllage this competition. Thanks to everyone who opened this challange!