Remix.run Logo
dpark 13 days ago

A tokenizer is not a dictionary any more than an alphabet is a dictionary.

noosphr 13 days ago | parent [-]

The Chinese alphabet is very much a dictionary. All the major tokenizers are far larger.

dpark 13 days ago | parent | next [-]

That doesn’t make any sense. A alphabet is a list of valid characters. A dictionary is not just a list. Even in a language like Chinese where individual characters carry meaning, a dictionary tells you what that meaning is. It’s not just a list of characters.

Or to echo article, the dictionary is made out of weights.

simonh 13 days ago | parent | prev | next [-]

A list of words isn’t a dictionary. What a dictionary adds over a list of words is all the relationships between the words needed to interpret them and use them, and all of that is in the weights.

JdeBP 13 days ago | parent [-]

We should tell the Unix people that they've been giving /usr/share/dict the wrong name for over three decades. (-:

yencabulator 13 days ago | parent [-]

I mean, they did, and we have, and we've also stopped doing that.

https://en.wikipedia.org/wiki/Words_(Unix)

JdeBP 13 days ago | parent [-]

We should start telling them again, then. (-:

In the current versions of FreeBSD, NetBSD, DragonFlyBSD, Illumos, and Debian, it is still /usr/share/dict .

* https://cgit.freebsd.org/src/tree/share/dict/

* https://cvsweb.netbsd.org/bsdweb.cgi/src/share/dict/

* https://gitweb.dragonflybsd.org/?p=dragonfly.git;a=tree;f=sh...

* https://cvsweb.openbsd.org/src/share/dict

* https://refspecs.linuxfoundation.org/FHS_3.0/fhs/ch04s11.htm...

* https://packages.debian.org/sid/all/wbritish/filelist

Amusingly for https://en.wikipedia.org/wiki/Special:Diff/325776830 , the last place to use /usr/dict (Debian, which changed it in 1998; Berkeley having changed it in Net/2 in 1991) stopped doing so years before Wikipedia was invented.

simonh 10 days ago | parent [-]

Sure, but the fact that people are doing something isn't evidence that it isn't a mistake. Also they may be stuck due to concerns about backwards compatibility. There may be games and utilities they are shipping, that come from upstreams, that rely on these files.

canjobear 13 days ago | parent | prev | next [-]

A mapping of Chinese characters to integers (like a tokenizer) would not be a dictionary. You’d also need definitions. At best it’s an index to a hypothetical dictionary.

maxbond 13 days ago | parent | prev [-]

It's beside the point and so I only note it out of interest, but the Chinese writing system doesn't use an alphabet (or a syllabary like Japanese kana), it's logography.