Remix clone Hacker News

new | show | ask | jobs Github

	▲	plagiarist 2 hours ago
		IMO it should need a third party running the LLM anyway. Otherwise the evaluated company could notice they're receiving the same requests daily and discover benchmarking that way.
	▲	jabedude 42 minutes ago \| parent [-]
		But that's removing a component that's critical for the test. We as users/benchmark consumers care that the service as provided by Anthropic/OpenAI/Google is consistent over time given the same model/prompt/context