Out of sample value estimation is known to be a hard problem. However one would home that RL should be capable of producing conservative value estimates.
Ordinarily this should be solvable with double Q networks. If one network is biased upwards, use the other network–chances are it will be biased downwards. The problem is then systematic bias. But what systematic bias can occur?