Shows that surrogate PPO/TRPO rewards do not lead in direction of true policy gradient:

Citation:

@article{DBLP:journals/corr/abs-1811-02553,
  author    = {Andrew Ilyas and
               Logan Engstrom and
               Shibani Santurkar and
               Dimitris Tsipras and
               Firdaus Janoos and
               Larry Rudolph and
               Aleksander Madry},
  title     = {Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms?},
  journal   = {CoRR},
  volume    = {abs/1811.02553},
  year      = {2018},
  url       = {http://arxiv.org/abs/1811.02553},
  eprinttype = {arXiv},
  eprint    = {1811.02553},
  timestamp = {Thu, 22 Nov 2018 17:58:30 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1811-02553.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}