▲ | IanCal 8 hours ago | |
I'm not arguing this is the purpose here but data augmentation has been done for ages. It just kind of sucks a lot of the time. You take your images and crop, shift, etc them so that your model doesn't learn "all x are in the middle of the image". For text you might auto replace days of the week with others, there's a lot of work there. Broadly the intent is to keep the key information and generate realistic but irrelevant noise so that you train a model that correctly ignores the noise. You don't want to train your model identifying some class of ship to base it on how choppy the water is, just because that was the simple signal that correlated well. There was a case of radiology results that detected cancer well but actually was detecting rulers in the image because in images with tumors there was often a ruler so the tumor could be sized. (I think it was cancer, broad point applies if it was something else). |