A combination of 3 different insights
If I am right this should have:
- Better sample efficiency due to reward factorization (giving stronger learning signal)
- Less biased value estimator (due to target value learning)
- Unsupervised skill discovery (due to information encoding from clock-DIAYN)