The 42 -> 137 also jumped out at me. On the face of it, the associated improvement sure does sound like overfitting to the eval set.