Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential mismatch between math and code for CQL(rho) #8

Open
zhihanyang2022 opened this issue Dec 25, 2020 · 2 comments
Open

Potential mismatch between math and code for CQL(rho) #8

zhihanyang2022 opened this issue Dec 25, 2020 · 2 comments

Comments

@zhihanyang2022
Copy link

zhihanyang2022 commented Dec 25, 2020

This is a question regarding how CQL(rho) works in terms of code 😊.

In the CQL section (starting from line 235) within /CQL/d4rl/rlkit/torch/sac/cql.py, we first computed:

cat_q1 = torch.cat(
    [q1_rand, q1_pred.unsqueeze(1), q1_next_actions, q1_curr_actions], 1
)
cat_q2 = torch.cat(
    [q2_rand, q2_pred.unsqueeze(1), q2_next_actions, q2_curr_actions], 1
)

and then used them to compute

min_qf1_loss = torch.logsumexp(cat_q1 / self.temp, dim=1,).mean() * self.min_q_weight * self.temp
min_qf2_loss = torch.logsumexp(cat_q2 / self.temp, dim=1,).mean() * self.min_q_weight * self.temp

I'm a bit confused about why the Q values of actions drawn from three distinct distributions can be used to compute this quantity:

  • q1_rand: uniform distribution
  • q1_pred: dataset distribution
  • q1_curr_actions and q1_next_actions: last-iteration policy

Here are my questions:

  • In Appendix A section CQL(rho), don't we have that the expectation is with respect to the rho distribution only (which we have chosen to be the last-iteration policy)?
  • Why do we use log-sum-exp here while the corresponding term (the first term) in Equation 7 of the paper does not contain log at all?

I'm able to completely understand how CQL(H) works in the codebase though.

@loicsacre
Copy link

I think they only gave the implementation of CQL(H). In their code base, the min_q_version is always set to 3, which corresponds to CQL(H). The equation with log-sum-exp is present in Appendix F (Additional Experimental Setup and Implementation Details).

@qsa-fox
Copy link

qsa-fox commented May 28, 2022

Equation 7 missed some item, i.e. the KL-divergence, after adding this item any you can deduce logsumexp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants