Remix.run Logo
rdevilla 7 hours ago

I think it's only a matter of time before people start trying to optimize model performance and token usage by creating their own more technical dialect of English (LLMSpeak or something). It will reduce both ambiguity and token usage by using a highly compressed vocabulary, where very precise concepts are packed into single words (monads are just monoids in the category of endofunctors, what's the problem?). Grammatically, expect things like the Oxford comma to emerge that reduce ambiguity and rounds of back-and-forth clarification with the agent.

The uninitiated can continue trying to clumsily refer to the same concepts, but with 100x the tokens, as they lack the same level of precision in their prompting. Anyone wanting to maximize their LLM productivity will start speaking in this unambiguous, highly information-dense dialect that optimizes their token usage and LLM spend...

grey-area 3 hours ago | parent | next [-]

Have you just reinvented programming languages and reinforced the author's point?

Setting aside the problem of training, why bother prompting if you’re going to specify things so tightly that it resembles code?

an hour ago | parent | next [-]
[deleted]
mike_hearn 2 hours ago | parent | prev | next [-]

Programming languages admit only unambiguous text. What he's proposing is more like EARS, Gherkin or Planguage.

rdevilla 2 hours ago | parent [-]

Not necessarily. I was intending it as a thought experiment illustrating why some kind of formal language (whether that mean technical jargon, unambiguous syntax, unambiguous semantics, conlangs, specification languages, or some combination thereof) will eventually arise from natural language - as it has countless times in the past, within mathematics (as referenced in TFA) and elsewhere. Gherkin is kind of nice though.

rdevilla 2 hours ago | parent | prev [-]

[dead]

majormajor 6 hours ago | parent | prev | next [-]

Unless you're training your own model, wouldn't you have to send this dialect in your context all the time? Since the model is trained on all the human language text of the internet, not on your specialized one? At which point you need to use human language to define it anyway? So perhaps you could express certain things with less ambiguity once you define that, but it seems like your token usage will have to carry around that spec.

nomel 7 hours ago | parent | prev | next [-]

Let's use a non-ambiguous language for this. May I suggest Lojban [1][2]?

[1] https://en.wikipedia.org/wiki/Lojban

[2] Someone speaking it: https://www.youtube.com/watch?v=lxQjwbUiM9w

mike_hearn 2 hours ago | parent | next [-]

Lojban allows you to speak ambiguously, it just disallows grammatical ambiguity because in the 70s it was hypothesized that NLP understanding was impossible so humans would have to adapt instead of computers. That debate is over; understanding grammar is solved. The new debate is over semantic ambiguity.

dooglius 6 hours ago | parent | prev | next [-]

It looks like that's about syntactic ambiguity, whereas the parent is talking semantic ambiguity

7 hours ago | parent | prev [-]
[deleted]
kstenerud 4 hours ago | parent | prev | next [-]

Human language is already very efficient for conveying the ideas we have. Some languages are more efficient at conveying certain concepts, but all are able to handle the 90% case. I'd expect any attempts to build a "technical dialect of English" to go about as well as Esperanto.

nextaccountic 3 hours ago | parent [-]

We already speak in a "technical dialect of English". All we need is some jargon to talk about technical things. (Lawyers have their own jargon too, also chemists, etc)

Some languages don't have this kind of vocabulary, because there aren't enough speakers that deal with technical things in a given area (and those that do, use another language to communicate)

steve_adams_86 5 hours ago | parent | prev | next [-]

The thing is, doesn't the LLM need to be trained on this dialect, and if the training material we have is mostly ambiguous, how do we disambiguate it for the purpose of training?

In my mind this is solving different problems. We want it to parse out our intent from ambiguous semantics because that's how humans actually think and speak. The ones who think they don't are simply unaware of themselves.

If we create this terse and unambiguous language for LLMs, it seems likely to me that they would benefit most from using it with each other, not with humans. Further, they already kind of do this with programming languages which are, more or less, terse and unambiguous expression engines for working with computers. How would we meaningfully improve on this, with enough training data to do so?

I'm asking sincerely and not rhetorically because I'm under no illusion that I understand this or know any better.

manmal 6 hours ago | parent | prev | next [-]

Codex already has such a language. The specs it’s been writing for me are full of “dedupe”, “catch-up”, and I often need to feedback that it should use more verbose language. Some of that has been creeping into my lingo already. A colleague of mine suddenly says the word “today” all the time, and I suspect that’s because he uses Claude a lot. Today, as in, current state of the code.

vrighter 4 hours ago | parent | prev | next [-]

and then someone will como along and say "wouldn'tt it be nice if this highly specific dialect was standardized?" goto 1

anonzzzies 6 hours ago | parent | prev | next [-]

It was mentioned somewhere else on hn today, but why do I care about token usage? I prompt AI day and night for coding and other stuff via claude code max 200 and mistral; haven't had issues for many months now.

sda2 5 hours ago | parent [-]

it’s a measure of efficiency. one might not care about tokens until vendors jack up the price and running your own comparable model is infeasible.

otabdeveloper4 6 hours ago | parent | prev | next [-]

> optimizes their token usage and LLM spend

Context pollution is a bigger problem.

E.g., those SKILL.md files that are tens of kilobytes long, as if being exhaustively verbose and rambling will somehow make the LLM smarter. (It won't, it will just dilute the context with irrelevant stuff.)

est 6 hours ago | parent | prev | next [-]

> by creating their own more technical dialect of English

Ah, the Lisp curse. Here we go again.

coincidently, the 80s AI bubble crashed partly because Lisp dialetcts aren't inter-changable.

Dylan16807 6 hours ago | parent | next [-]

Lisp doesn't get to claim all bad accidental programming languages are simply failing to be it, I don't care how cute that one quote is.

reverius42 6 hours ago | parent | prev [-]

I bet a modern LLM could inter-change them pretty easily.

est 6 hours ago | parent [-]

trained on public data, yes.

But some random in-house DSL? Doubt it.

6 hours ago | parent | prev | next [-]
[deleted]
dwd 6 hours ago | parent | prev | next [-]

[dead]

sjeiuhvdiidi 4 hours ago | parent | prev | next [-]

[dead]

capt-obvious 6 hours ago | parent | prev | next [-]

[dead]

noosphr 5 hours ago | parent | prev [-]

Or they could look at the past few centuries of language theory and start crafting better tokenizers with inductive biases.

We literally have proof that an iron age ontology of meaning as represented in Chinese characters is 40% more efficient than naive statistical analysis over a semi phonetic language and we still are acting like more compute will solve all our problems.

retsibsi 4 hours ago | parent | next [-]

> We literally have proof that an iron age ontology of meaning as represented in Chinese characters is 40% more efficient than naive statistical analysis over a semi phonetic language

Can you elaborate? I think you're talking about https://github.com/PastaPastaPasta/llm-chinese-english , but I read those findings as far more nuanced and ambiguous than what you seem to be claiming here.

umanwizard 4 hours ago | parent | prev [-]

> We literally have proof that an iron age ontology of meaning as represented in Chinese characters is 40% more efficient than naive statistical analysis over a semi phonetic language and we still are acting like more compute will solve all our problems.

Post a link because until you do, I’m almost certain this is pseudoscientific crankery.

Chinese characters are not an “iron age ontology of meaning” nor anything close to that.

Also please cite the specific results in centuries-old “language theory” that you’re referring to. Did Saussure have something to say about LLMs? Or someone even older?