Local evaluation process is super slow and takes up large GPU memory

The local evaluation process is super slow and takes up large GPU memory
I have a 4 A100 with 80G GPU memory, and the evaluation process will take up almost all GPU memory of the first GPU card.
Is anyone experiencing similar issues? Is it related to some issue in CUDA version??
I am using CUDA=11.6 and other steps follow the standard procedures provided, it is okay to run training and evaluate random policy, but the evaluation with trained policy takes forever to finish.