▲	mossTechnician 7 months ago
		We both assumed, so I didn't expect to need to back up my thoughts, but their own website ticks the "for humans" trope checkbox: Their "purpose is the responsible development and maintenance of advanced AI for the long-term benefit of humanity." I acknowledge and appreciate Anthropic's addition to the corpus of scraped data, but that data (both input and output) is still ultimately from others; if it did not exist, there would be no product. This is very different from a video editing tool, which I purchase or lease with the understanding that I will provide my own content, or maybe use licensed footage for B-roll
	▲	derefr 7 months ago \| parent [-]
		> I acknowledge and appreciate Anthropic's addition to the corpus of scraped data, but that data (both input and output) is still ultimately from others; if it did not exist, there would be no product. There’s a Ship of Theseus thing going on here with the training corpus, though. Consider the progression of DeepMind’s game-of-go-playing model from AlphaGo to AlphaZero. AlphaGo needed a training corpus of real human games of Go. But AlphaZero was trained by playing against the already-trained AlphaGo model; and then, after that, against earlier versions of itself. AlphaZero never saw any training corpus authored by humans; it only reacted to an agent that knew such a corpus (at the bootstrapping phase) — and since it was treating that agent as a black box to play against, it didn’t actually matter where that other agent’s knowledge of go came from. Another analogy might be to compilers. The first version of a (systems) programming language’s compiler must necessarily be written in some other language. But usually, a compiler is then written in the language itself, and the non-self-hosted compiler is then used to compile the self-hosted compiler. Would it be common sense to say that AlphaZero, or the self-hosted compiler, is derived from data “ultimately from others”? IMHO no. Why? I think because, in both cases, 1. the “bootstrap data” is a fungible commodity — many possible datasets (go plays, host languages) are “good enough” to make the bootstrap phase work, with no particular need to be picky; and 2. the particulars of the original “bootstrap data” become irrelevant as soon as the bootstrapping phase is complete, no longer having any impact on further iterations of the product. ——— Now, mind you, I’m not saying that LLMs fit this mental model perfectly. LLMs have a certain structure to their connections that, like AlphaZero, could be (and at this point, likely has been) fully Ship of Theseus-ed with a replacement dataset. But LLMs also know specific things — the concrete associations that hang off the structure — and that data does need to come from somewhere; a single company has no hope of ever just “internally sourcing” an Encyclopedia Galactica worth of knowledge. My argument is that this dataset can eventually be Ship-of-Theseus-ed as well — not by “internally sourced” data, but rather by ethically sourced data. Consider one of those AI “character” chatbot websites — but one where they not only shove a click-wrap disclaimer in your face that your responses will be used for training, but in fact advertise that as the premise of the site. And in a way that will make people actually interested in giving their “explicit, enthusiastic consent” to participating in model training. Can’t picture that? Imagine the site isn’t owned by a company trying to capture the data to build a proprietary model, but rather is owned by a co-op you implicitly join when you agree to participate, where your ownership stake in the resulting model / training dataset is proportionate to your contributed training data, and where you can then earn royalties from any ML companies that want to license the training dataset for use [probably along with many other such licensed training datasets] in training an “ethically-sourced” model on top of their Theseus-ed core.