| ▲ | noosphr 11 days ago |
| The Chinese alphabet is very much a dictionary. All the major tokenizers are far larger. |
|
| ▲ | dpark 11 days ago | parent | next [-] |
| That doesn’t make any sense. A alphabet is a list of valid characters. A dictionary is not just a list. Even in a language like Chinese where individual characters carry meaning, a dictionary tells you what that meaning is. It’s not just a list of characters. Or to echo article, the dictionary is made out of weights. |
|
| ▲ | simonh 11 days ago | parent | prev | next [-] |
| A list of words isn’t a dictionary. What a dictionary adds over a list of words is all the relationships between the words needed to interpret them and use them, and all of that is in the weights. |
| |
| ▲ | JdeBP 11 days ago | parent [-] | | We should tell the Unix people that they've been giving /usr/share/dict the wrong name for over three decades. (-: | | |
| ▲ | yencabulator 11 days ago | parent [-] | | I mean, they did, and we have, and we've also stopped doing that. https://en.wikipedia.org/wiki/Words_(Unix) | | |
| ▲ | JdeBP 11 days ago | parent [-] | | We should start telling them again, then. (-: In the current versions of FreeBSD, NetBSD, DragonFlyBSD, Illumos, and Debian, it is still /usr/share/dict . * https://cgit.freebsd.org/src/tree/share/dict/ * https://cvsweb.netbsd.org/bsdweb.cgi/src/share/dict/ * https://gitweb.dragonflybsd.org/?p=dragonfly.git;a=tree;f=sh... * https://cvsweb.openbsd.org/src/share/dict * https://refspecs.linuxfoundation.org/FHS_3.0/fhs/ch04s11.htm... * https://packages.debian.org/sid/all/wbritish/filelist Amusingly for https://en.wikipedia.org/wiki/Special:Diff/325776830 , the last place to use /usr/dict (Debian, which changed it in 1998; Berkeley having changed it in Net/2 in 1991) stopped doing so years before Wikipedia was invented. | | |
| ▲ | simonh 8 days ago | parent [-] | | Sure, but the fact that people are doing something isn't evidence that it isn't a mistake. Also they may be stuck due to concerns about backwards compatibility. There may be games and utilities they are shipping, that come from upstreams, that rely on these files. |
|
|
|
|
|
| ▲ | canjobear 11 days ago | parent | prev | next [-] |
| A mapping of Chinese characters to integers (like a tokenizer) would not be a dictionary. You’d also need definitions. At best it’s an index to a hypothetical dictionary. |
|
| ▲ | maxbond 11 days ago | parent | prev [-] |
| It's beside the point and so I only note it out of interest, but the Chinese writing system doesn't use an alphabet (or a syllabary like Japanese kana), it's logography. |