QF_Loss backprops policy network #5

00schen · 2020-12-04T20:40:34Z

In the CQL trainer, the policy_loss is formulated before the QF_Loss is, but the QF_Loss backprops the policy network before policy_loss does, which causes a Torch error. Would the intended use be to optimize policy network on the policy_loss before formulating the QF_Loss (and still optimize the policy using the QF_Loss) or to not reparametrize the policy output when formulating the QF_Loss (eg line 201)?

olliejday · 2020-12-09T00:43:21Z

Is this the error you are talking about? Because I have been trying to debug this too, can add full outputs if helpful.

/home/.../torch/autograd/__init__.py:132: UserWarning: Error detected in AddmmBackward. Traceback of forward call that caused the error:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [256, 1]], which is output 0 of TBackward, is at version 40001; expected version 40000 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
python-BaseException

dosssman · 2020-12-09T21:41:03Z

In the CQL trainer, the policy_loss is formulated before the QF_Loss is, but the QF_Loss backprops the policy network before policy_loss does, which causes a Torch error.

I don't think the Qf_loss backprops the policy_loss, because they use different optimizer for the policy and the Q networks, respectively.
It is hard to tell what is really causing that error without the stack trace also.
Furthermore, what arguments are you using for the training ?
(Because the script has a lot of parameterization: max-backup, n-qs, min-q-version, with_lagrange etc...., depending on the combination you use there might be an unforeseen computation that happens, thus causing the error).

In any case, have you tried to move the:

self._num_policy_update_steps += 1
self.policy_optimizer.zero_grad()
policy_loss.backward(retain_graph=False)
self.policy_optimizer.step()

just after the computation of the policy loss ?
This way, the policy network would be optimized early on, and the subsequent operations involving the policy should not cause further error when used to compute the Q losses.
(The problem might also be somewhere around automatic entropy tuning, which also uses the log_pi.)

olliejday · 2020-12-09T21:50:02Z

I'm currently testing this (small change) PR. It blocks the gradient flow to the q functions in the policy update which prevents the error.

dosssman · 2020-12-09T21:58:42Z

I am afraid that change will break the learning of the policy itself, because the q_new_actions.detach() in

policy_loss = (alpha*log_pi - q_new_actions.detach()).mean()

will also block the gradient flow to the policy, since q_new_actions is computed as below:

if self.num_qs == 1:
    q_new_actions = self.qf1(obs, new_obs_actions)
else:
    q_new_actions = torch.min(
          self.qf1(obs, new_obs_actions),
          self.qf2(obs, new_obs_actions),
    )

and the new_obs_actions are sampled use the re-parameterization trick (.rsample) in that self.policy() method.

new_obs_actions, policy_mean, policy_log_std, log_pi, *_ = self.policy(
     obs, reparameterize=True, return_log_prob=True,
)

Therefore, the policy weights will only be updated to minimize the alpha*log_pi, but not actually maximize the action value q_new_actions.

(On a personal anecdote, I did the exact same thing when implementing SAC a while ago. It is critical not to detach the Q values when updating the policy. I think that is also the main reason the optimizers are separated for the policy and the Q networks: so that the action value of the policy can be backproped through the Q functions, without altering the weights of the latter.)

aviralkumar2907 · 2020-12-09T22:34:53Z

@olliejday I think the error is caused due to pytorch version. If you try like torch 1.4 that could fix it. Something more might break it. Could you please confirm if this is the issue or not?

aviralkumar2907 · 2020-12-09T22:35:56Z

@olliejday @dosssman The Q-function detach will not work, since then the policy is not trained using the Q-function which is incorrect.

olliejday · 2020-12-10T10:22:12Z

I'm looking at what @dosssman says but in reverse (ie. moving the q functions rather than the policy update)

So moving

        if self.num_qs == 1:
            q_new_actions = self.qf1(obs, new_obs_actions)
        else:
            q_new_actions = torch.min(
                self.qf1(obs, new_obs_actions),
                self.qf2(obs, new_obs_actions),
            )

        policy_loss = (alpha*log_pi - q_new_actions).mean()

To just before the policy update but after the Q updates, ie. directly above here:

        self._num_policy_update_steps += 1
        self.policy_optimizer.zero_grad()
        policy_loss.backward(retain_graph=False)
        self.policy_optimizer.step()

This stops the error and seems to match the order in the paper

I'm trying it now otherwise will test which torch versions work.

Thanks

dssrgu · 2021-02-18T08:58:59Z

Hello,

Any updates on this issue?
I tried using @olliejday's solution, but the results are different in some environments. For example, on hopper-expert-v0 with policy_lr=1e-4, min_q_weight=5.0, and langrange_thresh=-1.0, the average return results are:

Unmodified code w/ torch=1.4 : 3638.71
Modified code w/ torch=1.7 : 3.08

olliejday · 2021-02-18T10:27:49Z

Hi, I ended up just reverting torch versions to 1.4

dssrgu · 2021-02-26T05:20:36Z

Adding .detach() to the outputs of _get_policy_actions() and switching the update order of the policy network and the q-function networks seem to solve the issue (Tested in torch=1.7).

sweetice · 2021-04-14T03:57:25Z

@dssrgu Thanks for your contribution!, I have adapted your commits. But here are some questions.
Does your modification works? Have you compared with the original version (PyTorch version==1.4, work)? If yes, which one is better?

dssrgu · 2021-04-14T13:14:05Z

@sweetice Hi, I did test the modified version against the original version (which was ran on torch == 1.4), and the two versions had similar performances on d4rl datasets. I do not have the actual values right now though.

Note: You may have to additionally correct the retain_graph parameters on the backward step according to change of the update order.

Zhendong-Wang · 2021-05-07T00:21:38Z

@dssrgu Did you get similar results to the values reported in D4RL paper?

I both tried the paper hyperparameters(policy_lr = 3e-5, lagrange_thresh=10.0) and the recommended one in this github (policy_lr = 1e-4, lagrange_thresh=-1.0) in Pytorch 1.4 and 1.7+, but I can not obtain similar values in some environemnts, for example, there is a big difference in 'halfcheetah-medium-expert-v0', and huge difference in Adroit task, like 'pen-human', 'hammer-human' and 'door-human'.

Do you know how to set the hyperparemeters to make CQL work in most cases? Thanks!

dssrgu · 2021-05-07T02:18:57Z

@Zhendong-Wang I found policy_lr=1e-4, min_q_weight=10.0, lagrange_thresh=-1.0 to work fairly well on most of the gym environments, though I used '*-v2' datasets. Exceptionally, for 'halfcheetah-random-v2', policy_lr=1e-4, min_q_weight=1.0, lagrange_thresh=10.0 works well. If the problem is only the medium-expert datasets, it seems the algorithm needs to run 3000 epochs to converge.

For Adroit task, I also could not reproduce the results...

glorgao · 2021-07-16T03:04:43Z

@dssrgu Could you give me some advice?

I use the hyparameter you recommended, and the results in 'medium' envs are keep in line with the CQL paper results.
However, the results for 'walker2d-expert-v2' task cannot improve after achieving 5000, and the papers are about 7000.
The result line seems be limited to 5000 as the curve is so straight after reaching 5000.

I believe there must be something wrong in my settings, which are:
mujoco200
pytroch=1.4 or 1.1
d4rl=1.1
for walker2d-expert-v2, walker2d-expert-v0 tasks

Do you have any suggestions for me?

Zhendong-Wang · 2021-07-16T16:33:10Z

@cangcn Actually, with the github code and the hyperparameters recommended in Readme file, I can not reproduce the reported results in D4RL paper, even in Gym tasks. I tried both 'v2' and 'v0'. The performance on 'v2' is generally better than 'v0'. It still can not match most of the results reported, though it is mentioned in D4RL they used 'v0' for fair comparison .

jihwan-jeong · 2021-07-19T02:09:23Z

Hi.

@olliejday: I think this issue shouldn't be resolved by just resorting to switching back to torch versions below 1.5 (i.e. <=1.4), because then the reproducibility relies on the bug in the torch code (see this thread). According to the linked discussion, in torch < 1.5, even when the code runs and trains network parameters, the computed gradients can be incorrect, which is fixed in torch >=1.5.

Hopefully, the PR that @dssrgu posted can solve this issue, but for some tasks, it seems the results cannot be reproduced.. I hope the original author @aviralkumar2907 can provide some feedback on this matter :) In the meantime, I think I'll use @dssrgu's modifications to make the code runnable.

Thanks!

olliejday mentioned this issue Dec 9, 2020

Update cql.py #6

Closed

dssrgu linked a pull request Feb 26, 2021 that will close this issue

Minor changes for torch > 1.4 compatibility #13

Open

JasonMa2016 mentioned this issue Sep 19, 2021

Question about cql baseline JasonMa2016/CODAC#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QF_Loss backprops policy network #5

QF_Loss backprops policy network #5

00schen commented Dec 4, 2020 •

edited

Loading

olliejday commented Dec 9, 2020

dosssman commented Dec 9, 2020

olliejday commented Dec 9, 2020

dosssman commented Dec 9, 2020 •

edited

Loading

aviralkumar2907 commented Dec 9, 2020

aviralkumar2907 commented Dec 9, 2020

olliejday commented Dec 10, 2020

dssrgu commented Feb 18, 2021

olliejday commented Feb 18, 2021

dssrgu commented Feb 26, 2021

sweetice commented Apr 14, 2021

dssrgu commented Apr 14, 2021

Zhendong-Wang commented May 7, 2021

dssrgu commented May 7, 2021

glorgao commented Jul 16, 2021

Zhendong-Wang commented Jul 16, 2021

jihwan-jeong commented Jul 19, 2021

QF_Loss backprops policy network #5

QF_Loss backprops policy network #5

Comments

00schen commented Dec 4, 2020 • edited Loading

olliejday commented Dec 9, 2020

dosssman commented Dec 9, 2020

olliejday commented Dec 9, 2020

dosssman commented Dec 9, 2020 • edited Loading

aviralkumar2907 commented Dec 9, 2020

aviralkumar2907 commented Dec 9, 2020

olliejday commented Dec 10, 2020

dssrgu commented Feb 18, 2021

olliejday commented Feb 18, 2021

dssrgu commented Feb 26, 2021

sweetice commented Apr 14, 2021

dssrgu commented Apr 14, 2021

Zhendong-Wang commented May 7, 2021

dssrgu commented May 7, 2021

glorgao commented Jul 16, 2021

Zhendong-Wang commented Jul 16, 2021

jihwan-jeong commented Jul 19, 2021

00schen commented Dec 4, 2020 •

edited

Loading

dosssman commented Dec 9, 2020 •

edited

Loading