It doesn't look like the code anonymizes usernames when sending the thread for grading. This likely induces bias in the grades based on past/current prevailing opinions of certain users. It would be interesting to see the whole thing done again but this time randomly re-assigning usernames, to assess bias, and also with procedurally generated pseudonyms, to see whether the bias can be removed that way.

I'd expect de-biasing would deflate grades for well known users.

It might also be interesting to use a search-grounded model that provides citations for its grading claims. Gemini models have access to this via their API, for example.

▲

ProllyInfamous 2 days ago | parent | next [-]

What a human-like critizicism of human-like behavior.

I [as a human] also do the same thing when observing others in IRL and forum interactions. Reputation matters™

----

A further question is whether a bespoke username could influence the bias of a particular comment (e.g. A username of something like HatesPython might influence the interpretation of that commenter's particular perception of the Python coding language, which might actually be expressing positivity — the username's irony lost to the AI?).

▲

khafra 2 days ago | parent | prev [-]

You can't anonymize comments from well-known users, to an LLM: https://gwern.net/doc/statistics/stylometry/truesight/index

▲

WithinReason 2 days ago | parent [-]

That's an overly strong claim, an LLM could also be used to normalise style

▲

wetpaws 2 days ago | parent [-]

How would you possibly grade comments if you change them?

▲

strken 2 days ago | parent | next [-]

Extract the concrete predictions, evaluate them as true/false/indeterminate, and grade the user on the number of true vs false?

	▲	Natsu 2 days ago \| parent [-]
		This doesn't even seem to look at "predictions" if you dig into what it actually did. Looking at my own example (#210 on https://karpathy.ai/hncapsule/hall-of-fame.html with 4 comments), very little of what I said could be construed as "predictions" at all. I got an A for commenting on DF saying that I had not personally seen save corruption and listing weird bugs. It's true that weird bugs have long been a defining feature of DF, but I didn't predict it would remain that way or say that save corruption would never be a big thing, just that I hadn't personally seen it. Another A for a comment on Google wallet just pointing out that users are already bad at knowing what links to trust. Sure, that's still true (and probably will remain true until something fundamental changes), but it was at best half a prediction as it wasn't forward looking. Then something on hospital airships from the 1930s. I pointed out that one could escape pollution, I never said I thought it would be a big thing. Airships haven't really ever been much of a thing, except in fiction. Maybe that could change someday, but I kinda doubt it. Then lastly there was the design patent famously referred to as the "rounded corner" patent. It dings me for simplifying it to that label, despite my actual statements being that yes, there's more, but just minor details like that can be sufficient for infringement. But the LLM says I'm right about ties to the Samsung case and still oversimplifying it. Either way, none of this was really a prediction to begin with.

▲

koakuma-chan 2 days ago | parent | prev [-]

You don’t need comments, just facts in them to see if they’re accurate.