Conservative Q-learning (CQL) [1] is an algorithmic framework for offline RL that learns a expected lower bound on the policy value, which effectively penalizes the Q function at states in the dataset for actions not observed in the dataset. This enables a conservative estimation of the value function for any policy, mitigating the challenges of over-estimation bias and distribution shift. On d4rl tasks, CQL is implemented on top of soft actor-critic (SAC).The iteration of Q function is shown as
It has a bootstrap error term and a CQL divergence term, which is a result for the optimal current policy optimized aiming to minimize the Q function(added by an additional term of a Unif regularization over current policy) of actions sampled by the current policy and simultaneously maximize the actions sampled by the behavioral policy, implicitly shows in the first and second term of the ‘max’ term respectively. is an automatically adjustable value via Lagrangian dual gradient descent and is a threshold value. When CQL is running on continuous benchmark like the Mujoco tasks, is computed using importance sampling, shown as follows:
The policy improvement step is the same as SAC's.
- Kumar A, Zhou A, Tucker G, Levine S. Conservative q-learning for offline reinforcement learning[C]//Advances in Neural Information Processing Systems. 2020;33:1179-91.