▲ | antichronology 4 days ago | |
I watched an interview with one of the co-founders of Anthropic where his point is that although benchmarks saturate they're still an important signal for model development. We think the situation is similar here - one the challenges is aligning the benchmark with the function of the models. Genomic benchmarks for gLMs and RNA foundation models have been very resistant to staturation. I think in NLP the problem is that they are victims of their own success where the models can be overfit to particular benchmarks really fast. In genomics we're a bit behind. A good paper on this is DartEval where they provide levels of complexity https://arxiv.org/abs/2412.05430 in RNA the models work much better than DNA prediction but it's key to have benchmarks to measure progress. |