Hello,
Can you please check the policies dictionary of the sample target example that is provided for the N-1 and N-2 stage, i,e., first and second elements of the policies list?

I am getting an exact match with the expected rewards value but there is a mismatch in the optimal actions in the N-1 and N-2 stages.

I’m unable to reproduce this, and if all the values are same, the greedy policy should also be same. So can you elaborate a bit more?

You could share the results you get without sharing your code.

I’d like to clarify that in case all the 2 or more actions have same expected values, I always use the first action in the list of possible actions, but for the shared input case its not relevant.

The optimal actions for stage N-1 was 1,1,1 and for N-2 was 1,2,2.
(according to the targets_0.npy both were 2,2,2)
From inspection of the MDP in the question we can immediately see that 2,2,2 is wrong for the N-1 case at least.

You are right, I’ve found the bug in my code, its a python dictionary call by reference bug. Apologies for the confusion and thanks to all you guys who found the error.

I’ll be fixing it soon and updating once done. Redeploying takes a while so thanks for the patience.

I’ll be checking for any similar error on the VI problem as well.