RL: questions about the 2nd assignment.

Questions related to the 2nd assignment

This page contains questions from students and my answers to them. If a question is asked that has already been answered, then I will not post another answer. So if your questions seems to go unanswered, you should check over previous questions.

Q1
I program in Java. Should I name my programs via.ext and pia.ext or via.java and pia.java?
A1
Name your programs according to the programming language that you use; i.e., name them either via.java and pia.java if you program in java or pia.c and via.c if you program in C. In the case of other programming languages, replace "ext" accordingly to the standard naming convention of your programming language. Remember to compile/run your programs on moon under Linux before you submit them, because TA will run them on moons. Also, make sure that TA can use his own file for testing.

Q2
Where I can find an example of a C implementation of the policy iteration algorithm?
A2
You can read Example 4.3: Gambler's Problem and its solution in Handouts.

Q3
I need a bit of clarification. In the programming part of the assignment, you say:
"In both cases, output the number of iterations, computation time (for each iteration) and the optimal policy."
The part that I'm unsure about is what you consider to be an iteration. In value iteration there is only one repeated part in the algorithm, so that's fairly obvious. In policy iteration, however, there is an inner loop and an outer loop. So, looking at pseudocode in Section 4.3, is it the Loop .. until in step 2 that is the iteration, or is it the cycle of policy evaluation and policy improvement that is the iteration?
A3
In the case of policy iteration, we are interested in iterations of policy improvement (the cycle of policy evaluation and policy improvement, as you say), i.e., in iterations of an outer loop. The number of iterations in the inner loop ("repeat ... until" loop in the part 2 of code in Section 4.3) used to approximate the value function of the current policy is less important and it will be accounted because you measure also time taken per iteration.

Q4
In Part 3 it seems that we have 4 different types of rewards (*r-up, r-down, r-right, r-left*). So, from what I understand, these are given to the agent depending on where the agent is going to end up after making a move, regardless of where it was intending to go. E.g. agent moved right, but due to stochastic nature of the environment it ended up going left. Therefore the reward is going to be *r-left*. Could you confirm if this is correct?
A4
No, the reward is determined by the action that the agent was attempting, not by the real outcome of the action. For example, if the agent was trying to do the action "up", then it will get the award r_up even if in reality it will remain in the same state.

Q5
I assume in this setup there is only one policy which we have to improve. Or should we make multiple policies and iterate over them?
A5
You are told that the initial policy is equiprobable random policy, same as in Example 4.1. See also Q13/A13 below.

Q6
I have read chapter 3 and 4 many times and I am still unsure how to compute the V(s). The formula given is vague. According to the formula in the book in order to compute V(s) we have to know V(s'). How do I know what V(s') is?
A6

Consider the system of the following 3 equations:

x = -1 + 2y + z
y = -7 + 3x + 2z
z = -3 + 5x + 3y There several approaches to solving this system of linear equeations. Note that structurally, it is similar to the system of Bellman equations:
1) on the left-hand side we have a single unknown variable and
2) on the right-hand side we have a linear combination of unknown variables.
Now, consider the system of Bellman equations (4.4) in Section 4.1 of the textbook. The authors write that to solve this system of linear equations one can use the method of successive approximations. As the initial approximation we can take arbitrary V(s), e.g., V(s)=0 for all s. Then, we consider each equation as an assignment operation, i.e., we compute the right-hand side using the previous approximation and then assign the computed value to V(s) on the left-hand side. We continue updating values V(s) until the difference between two successive approximations becomes very small. When this algorithm stops, the result is a value function V(s) that is (an approximate) solution of the system of Bellman equations.

Q7
In terms of the programming question, I've read the algorithm and kind of understand it but being able to actually apply it in a program is a different matter. Hopefully you can give me and my partner some insight.
A7
First of all, you need some data structures. If you work with C or Java, you can use an array v[s] to represent values of states s. Because in the problem we have 14 usual states and 1 terminal state, your array must have 15 elements. Similarly, you need an array of transition probabilities p[s][a][s'], remember we have 4 different actions a in each state. Also, you need an array of rewards r[s]. In general case, one may need an array r[s][a][s'], but because in this problem the reward depends only on the agent's action and depends neither on the current state nor the successor state, we can consider only 1-dimensional array of rewards. Note that values for an array p[s][a][s'] are defined in the problem: you can initialize your array using double numbers p1 and p2 that will be entered by an user. Probabilities of transitions from terminal state are 0 no matter what the agent does. Finally, you need an array to represent a policy pi[s]: elements of this array can be numbers 0,1,2,3 or characters (u, d, l, r), whatever you find more convenient. This array can have only 14 elements, because we cannot do transitions from the terminal state. Note that values for an array p[s][a][s'] are defined in the problem. Now, you have everything you need to convert the pseudo-code given in the textbook into a working program. In your program, for every element s of the array v[s], you will add numbers v[s'] from this array with numbers from rewards array and multiply them by numbers from the probability array. The sum over s' in the pseudo-code means that you can use a for-loop over s' to add all required numbers. The max operation over actions can be easily implemented by another for-loop that compares 4 different numbers to find the greatest number.

Q8
When I compute a new approximation of V(s), should I use one array v[s] or two arrays? FOr example, I can keep my previous approximation in v[s] and compute values of the array new_v[s] and later update all elements in v[s] by elements from new_v[s] when I exit from the for-loop over states in the policy evaluation step of the algorithm (step 2).
A8
A quote from the text-book (Section 4.1): "To write a sequential computer program to implement iterative policy evaluation, as given by (4.5), you would have to use two arrays, one for the old values, and one for the new values. This way the new values can be computed one by one from the old values without the old values being changed. Of course it is easier simply to use one array and update the values ``in place", that is, with each new backed-up value immediately overwriting the old one. Then, depending on the order in which the states are backed up, sometimes new values are used instead of old ones on the righthand side of (4.5). This slightly different algorithm also converges to V(s); in fact, it usually converges faster than the two-array version, as you might expect since it uses new data as soon as it is available. We think of the backups as being done in a sweep through the state space. The order in which states are backed up during the sweep has a significant influence on the rate of convergence. We usually have the in-place version in mind when we think of DP algorithms."

Q9
For question 3, there are 14 states plus 1 terminal. One terminal is between states 4 and 1 and another terminal is between states 11 and 14. Then that means there should be 2 terminals and 14 states. So we would have to make the array 16. (we are refering to figure 4.1)
A9
According to Example 4.1, there is only 1 terminal state, all other states are not terminal (i.e., they are not absorbing states). A quote from the text-book: "The terminal state is shaded in the figure (although it is shown in two places it is formally one state)." This means that there is no need to introduce 2 differtent terminal states, it is enough to have just one.

Q10
I would like to ask you this question:

   +---+---+---+---+
   | 0 | 1 | 2 | 3 |
   +---+---+---+---+
   | 4 | 5 | 6 | 7 |
   +---+---+---+---+
   | 8 | 9 |10 |11 |
   +---+---+---+---+
   |12 |13 |14 |15 |
   +---+---+---+---+

If I am in 8, and move UP:

        p1              bring me to 4
        p2              bring me to 8
        (1-p1-p2)/2     bring me to 5
        (1-p1-p2)/2     bring me to which state?

A10
You can use the following probabilities:

        prob[8][up][4] = p1 + (1-p1-p2)/2 = (1+p1-p2)/2
        prob[8][up][8] = p2
        prob[8][up][5] = (1-p1-p2)/2
For all other states s',   prob[8][up][s'] = 0

Similarly,

        prob[8][down][12]= p1 + (1-p1-p2)/2
        prob[8][down][8]= p2
        prob[8][down][13]= (1-p1-p2)/2
For all other states s',	prob[8][down][s'] = 0

This means that the probability of moving in the desired direction is a bit higher when the agent moves along the edge of the grid in comparison to the case when the agent moves in "the free space" and can drift by chance (due to slippage or inaccuracies in wheels) to one of the two adjacent states. Note that this minor difference between states inside the grid and states along the edge of the grid has no effect on policy if all actions {up, down, left, right} are deterministic and always move the agent in the direction where the agent heads for, i.e., if p1= 1.0 and p2= 0.0

Q11
In the assignment it says: "All actions that would take the agent off the grid do not change the state with the probability p1+p2, and take the agent to one of the two adjacent states with the probability (1-p1-p2)/2" For example using figure 4.1 of the text book.

prob[8][left][8] = p1+p2
prob[8][left][?] = (1-p1-p2)/2

are the adjacent states [?] (5 and 13) or (4 and 12)?

A11
States 4 and 12 are adjacent to the state 8:

    prob[8][left][8] = p1+p2 ;
    prob[8][left][4] = (1-p1-p2)/2 ;
    prob[8][left][12] = (1-p1-p2)/2 ;

In addition,

    prob[12][left][12] = p1+p2 ;
    prob[12][left][8] = (1-p1-p2)/2;
    prob[12][left][13] = (1-p1-p2)/2;

and

       prob[12][down][12] = p1+p2 ;
       prob[12][down][8] = (1-p1-p2)/2;
       prob[12][down][13] = (1-p1-p2)/2;

Q12
Can you please explain me the meaning of the following: delta < max(delta, | v - V(s) |) does it mean assign to delta whatever is greater: either previous delta or |v-V(s)| ?

A12
Yes, it means that delta must be assigned the greatest of the two values: "previous value of delta" or |v - V(s)|. The following is the longer answer. There are several possible implementations of the policy evaluation step, e.g., you can write a recursive implementation or you can write a while-loop that computes a new approximation to the value function V[s] on every iteration (for a given policy pi). In all cases, the termination condition is the condition that depends on "theta" - the precision value. Let's assume that you decided to implement the while-loop. Your while-loop is supposed to compare two consecutive approximations to the correct value function V[s] for the current policy pi. As soon as a couple of two consecutive approximations are very close to each other, we can stop computing new approximations because we know that subsequent approximations will not change the value function significantly. Hence, to compare two approximations, we compute differences between V_old[s] and V_new[s] for all states s. The largest difference is assigned to delta. As soon as (delta < theta) we can terminate our while-loop. Remember to initialize "delta" to 0 (or to a negative number) each time *inside* the while-loop because you compute the largest difference between two approximation during each iteration of your while-loop.

Q13
I have a question about the starting value of pi[s]. Are we supposed to use a 2 dimensional array: pi[s,a] which will hold the probability of choosing each action from each state? And set it so they are all equally likely to start? eg. pi[s,a] = 1/(number of actions). Then, as we iterate, we change the probability of performing each action from each state to improve the policy?
OR
Are we supposed to use a 1 dimensional array: pi[s] which will keep the best action for the state s, and set initially pi[s] to a random starting value out of left, right, up or down for each s in S?
OR
Something completely different?
Currently I'm using the 2nd approach, but every time I run my program it results in a different ending policy.

A13
Because the pseudo-code for policy iteration considered in the text-book (Section 4.3) assumes that policies are deterministic, i.e., it assumes that for every state "s" there is only one action pi[s] to execute in this state, you can implement the 2nd approach (as you do now). In other words, initially you can choose the random policy pi[s] for every state s in S. Later, in the policy evaluation step you evaluate the current policy pi[s] and compute the value function V[s] for this policy. Later, in the policy improvement step, you improve the current policy by choosing the policy that is "greedy" with respect to the current value function. This new policy will be implemented by changing one or several values in your array pi[s]. Then, you continue to iterate the last two steps (new policy evaluation and policy improvement) until the policy will not change in the policy improvement step.
When you implement policy improvement step, make sure that you always break ties consistently. In other words, if two different actions provide maximal value for a value function in a state s, make sure that your program will always choose one of them.
As for the 1st approach, try it only if you have a correctly working program that uses one-dimensional array pi[s]. You are not required to implement the 1st approach in this assignment. But if you would be interested in implementing algorithms for the case when policies are stochastic (i.e., when policies are defined by a two-dimensional array), you will need to modify pseudo-code accordingly.

Q14
Another question, if I am at a state along the border (ie, cell 7) and I choose an action along that border (ie, up or down), how will the probabilities of adjacent target states change?

A14
This is similar to Q11/A11 and Q10/A10 answered above.

 p(7, r_up | 7, up) = p2
 p(3, r_up | 7, up) = p1 + (1-p1-p2)/2
 p(2, r_up | 7, up) = (1-p1-p2)/2
 p(s', r_up | 7, up) = 0, for all other states s'.

 p(7, r_right, | 7, right) = p1 + p2
 p(3, r_right | 7, right) = (1-p1-p2)/2
 p(11, r_right | 7, right) = (1-p1-p2)/2,
 p(s', r_right | 7, right) = 0, for all other states s'.

Q15
In the example from the textbook (figure 4.1), the optimal policy shown is non-deterministic. There are several spots on the grid where there are multiple optimal actions. Currently, both my algorithms output non-deterministic optimal policies like the one in the example mentioned above. If there is more than one optimal action, the policy is to pick from these optimal actions with equal probability. Is this acceptable, or should our algorithms just randomly assign one of the optimal actions to be the policy for a given state?

A15
As long as your programs produce correct optimal stochastic policies, it should be fine. Please write a brief note about this in your report to inform the TA. You can also mention Q13/A13 from the FAQs where I mentioned at the end that it is acceptable for your programs to produce a correct stochastic optimal policy.

Q16

I can't find reasoning to which two out of the four adjacent states I would pick. For example, if I was going from 9 -> 10, the adjacent states would be 6,9,11,14 or not?

A16

If the agent is going from 9 to 10, then the two adjacent states to 10 will be 6 and 14, but not 11. The agent can also stay at 9.

Q17

A17

Q18

A18

Q19

A19

Q20

A20

Q21

A21

RL Main page