Remix.run Logo
noosphr 11 days ago

The Chinese alphabet is very much a dictionary. All the major tokenizers are far larger.

dpark 11 days ago | parent | next [-]

That doesn’t make any sense. A alphabet is a list of valid characters. A dictionary is not just a list. Even in a language like Chinese where individual characters carry meaning, a dictionary tells you what that meaning is. It’s not just a list of characters.

Or to echo article, the dictionary is made out of weights.

simonh 11 days ago | parent | prev | next [-]

A list of words isn’t a dictionary. What a dictionary adds over a list of words is all the relationships between the words needed to interpret them and use them, and all of that is in the weights.

JdeBP 11 days ago | parent [-]

We should tell the Unix people that they've been giving /usr/share/dict the wrong name for over three decades. (-:

yencabulator 11 days ago | parent [-]

I mean, they did, and we have, and we've also stopped doing that.

https://en.wikipedia.org/wiki/Words_(Unix)

JdeBP 11 days ago | parent [-]

We should start telling them again, then. (-:

In the current versions of FreeBSD, NetBSD, DragonFlyBSD, Illumos, and Debian, it is still /usr/share/dict .

* https://cgit.freebsd.org/src/tree/share/dict/

* https://cvsweb.netbsd.org/bsdweb.cgi/src/share/dict/

* https://gitweb.dragonflybsd.org/?p=dragonfly.git;a=tree;f=sh...

* https://cvsweb.openbsd.org/src/share/dict

* https://refspecs.linuxfoundation.org/FHS_3.0/fhs/ch04s11.htm...

* https://packages.debian.org/sid/all/wbritish/filelist

Amusingly for https://en.wikipedia.org/wiki/Special:Diff/325776830 , the last place to use /usr/dict (Debian, which changed it in 1998; Berkeley having changed it in Net/2 in 1991) stopped doing so years before Wikipedia was invented.

simonh 8 days ago | parent [-]

Sure, but the fact that people are doing something isn't evidence that it isn't a mistake. Also they may be stuck due to concerns about backwards compatibility. There may be games and utilities they are shipping, that come from upstreams, that rely on these files.

canjobear 11 days ago | parent | prev | next [-]

A mapping of Chinese characters to integers (like a tokenizer) would not be a dictionary. You’d also need definitions. At best it’s an index to a hypothetical dictionary.

maxbond 11 days ago | parent | prev [-]

It's beside the point and so I only note it out of interest, but the Chinese writing system doesn't use an alphabet (or a syllabary like Japanese kana), it's logography.