Training RL policies on edge cases by using humans to collect and instrument previously closed data systems.