| ▲ | albumen 2 hours ago | |
Your evidence seems very anecdata. The graphite.io study does make an effort to quantify the false positive and false negative rates of the three detectors, rather than just saying “they work”. They generate 2000 ai articles and ask the detectors to evaluate them, measuring the false negatives (articles falsely IDd as human written); and they use a separate pre-AI dataset (years 2000-2022) to determine false positives. | ||
| ▲ | embedding-shape 2 hours ago | parent [-] | |
Yeah, I suppose it is, I haven't finished my dissertation on it yet, I'll get right on that :) Throughout them being available I've tried them every now and then, both with AI generated trash and my own pre-LLM writings, and had about 0% success in getting them to accurately report what it actually is. Maybe my writing style and what specific LLM you use matters a lot, I'm sure these platform's training data is mostly from the mainstream models so as soon as you use anything else, they'll get trivially lost. But again, I don't have any evidence and proof behind this, based only on when I've tried to evaluate them myself in the past. | ||