Shows that surrogate PPO/TRPO rewards do not lead in direction of true policy gradient:
Citation:
@article{DBLP:journals/corr/abs-1811-02553,
author = {Andrew Ilyas and
Logan Engstrom and
Shibani Santurkar and
Dimitris Tsipras and
Firdaus Janoos and
Larry Rudolph and
Aleksander Madry},
title = {Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms?},
journal = {CoRR},
volume = {abs/1811.02553},
year = {2018},
url = {http://arxiv.org/abs/1811.02553},
eprinttype = {arXiv},
eprint = {1811.02553},
timestamp = {Thu, 22 Nov 2018 17:58:30 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-1811-02553.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}