| ▲ | msp26 5 hours ago | |
> Data extraction tasks are amongst the easiest to evaluate because there’s a known “right” answer. Wrong. There can be a lot of subjectivity and pretending that some golden answer exists does more harm and narrows down the scope of what you can build. My other main problem with data extraction tasks and why I'm not satisfied with any of the existing eval tools is that the schemas I write change can drastically as my understanding of the problem increases. And nothing really seems to handle that well, I mostly just resort to reading diffs of what happens when I change something and reading the input/output data very closely. Marimo is fantastic for anything visual like this btw. Also there is a difference between: the problem in reality → the business model → your db/application schema → the schema you send to the LLM. And to actually improve your schema/prompt you have to be mindful of the entire problem stack and how you might separate things that are handled through post processing rather than by the LLM directly. > Abstract model calls. Make swapping GPT-4 for Claude a one-line change. And in practice random limitations like structured output API schema limits between providers can make this non-trivial. God I hate the Gemini API. | ||
| ▲ | sbpayne 5 hours ago | parent | next [-] | |
This is very true! I could have been more careful/precise in how I worded this. I was really trying to just get across that it's in a sense easier than some tasks that can be much more open ended. I'll think about how to word this better, thanks for the feedback! | ||
| ▲ | sethkim 4 hours ago | parent | prev | next [-] | |
This is extremely true. In fact, from what we see many/most of the problems to be solved with LLMs do not have ground-truth values; even hand-labeled data tends to be mostly subjective. | ||
| ▲ | rco8786 5 hours ago | parent | prev [-] | |
I think they're just saying that data extraction tasks are easy to evaluate because for a given input text/file you can specify the exact structured output you expect from it. | ||