You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm a bit confused about why the Q values of actions drawn from three distinct distributions can be used to compute this quantity:
q1_rand: uniform distribution
q1_pred: dataset distribution
q1_curr_actions and q1_next_actions: last-iteration policy
Here are my questions:
In Appendix A section CQL(rho), don't we have that the expectation is with respect to the rho distribution only (which we have chosen to be the last-iteration policy)?
Why do we use log-sum-exp here while the corresponding term (the first term) in Equation 7 of the paper does not contain log at all?
I'm able to completely understand how CQL(H) works in the codebase though.
The text was updated successfully, but these errors were encountered:
I think they only gave the implementation of CQL(H). In their code base, the min_q_version is always set to 3, which corresponds to CQL(H). The equation with log-sum-exp is present in Appendix F (Additional Experimental Setup and Implementation Details).
This is a question regarding how CQL(rho) works in terms of code 😊.
In the CQL section (starting from line 235) within
/CQL/d4rl/rlkit/torch/sac/cql.py
, we first computed:and then used them to compute
I'm a bit confused about why the Q values of actions drawn from three distinct distributions can be used to compute this quantity:
q1_rand
: uniform distributionq1_pred
: dataset distributionq1_curr_actions
andq1_next_actions
: last-iteration policyHere are my questions:
I'm able to completely understand how CQL(H) works in the codebase though.
The text was updated successfully, but these errors were encountered: