A combination of 3 different insights

If I am right this should have:

  1. Better sample efficiency due to reward factorization (giving stronger learning signal)
  2. Less biased value estimator (due to target value learning)
  3. Unsupervised skill discovery (due to information encoding from clock-DIAYN)