▲ | vunderba 3 days ago | |
Hi Isharmla, the giraffe one was a tough call. IMHO, even when correcting for perspective, I do feel like it managed to follow the directive of the prompt and shorten the neck. To answer your question, all of the evaluations are performed manually. On the trickier results I'll occasionally conscript some friends to get a group evaluation. The bottom section of the site has an FAQ that gives more detail, I'll include it here: It's hard to define a discrete rubric for grading at an inherently qualitative level. To keep things simple, this test is purely PASS/FAIL - unsuccessful means that the model NEVER managed to generate an image adhering to the prompt. In many cases, we often attempt a generous interpretation of the prompt - if it gets close enough, we might consider it a pass. To paraphrase former Supreme Court Justice Potter Stewart, "I may not be able to define a passing image, but I know it when I see it." |