|
| ▲ | aembleton 17 minutes ago | parent | next [-] |
| > Grok will absolutely do the same thing another time you try it. True; it's just not happened yet. It will at some point though. With the Sunnypilot example it right out told me that it is not possible on that fork which I appreciated. The others all seem to hallucinate some setting. |
|
| ▲ | ToucanLoucan 2 hours ago | parent | prev | next [-] |
| It is really, really genuinely concerning how many people think there are profound measurable differences between these things. Like yeah tonally I guess there are. But with regard to references and information? You’re literally just using three different slot machines and claiming one is hot. I suppose though I shouldn’t be that surprised then since Vegas and every other casino on Earth has been built on duping people in that exact way. |
| |
| ▲ | aembleton 20 minutes ago | parent [-] | | > You’re literally just using three different slot machines and claiming one is hot. It's a fair point. I haven't tested many queries across them all and checked their answers, but if I want to ask one of them a question - right now its Grok just because I trust its answers more. | | |
| ▲ | ToucanLoucan 12 minutes ago | parent [-] | | It's not a methodology problem, it's a test-ability problem. LLMs are not deterministic. You can ask the same question to the same LLM five times and you'll likely get at least 3 answers. Again. Slot machine. | | |
| ▲ | Ukv 2 minutes ago | parent [-] | | You can meaningfully test if one slot machine hits the jackpot more often than another, just that the methodology would involve a large number of repeats rather than a few anecdotes. There are some LLM leaderboard sites that do it with blind comparisons. |
|
|
|
|
| ▲ | cyanydeez 2 hours ago | parent | prev [-] |
| humans make poor scientists. most people have already made a decision before they run any tests. the smartest among them just make the tests complicated and biased; the less intelligent just cherry pick. of course, would you really expect anyone to do real rsearch in this economy? |