Remix.run Logo
staplung 5 days ago

Cool. One thing I found odd was that on export there are two listed formats. "ASCII" and "ASCII extended" but as far as I can tell, the ASCII version is actually outputting UTF-8. It's hard to tell for sure though because the output is just text that you can cut and paste and so it's difficult to know what conversions the browser or OS might be doing behind the scenes. But when I paste it into a text editor on my mac, it's definitely UTF-8, not ASCII encoded.

Which is probably more useful anyway given that if it really outputted ASCII encoded line drawing characters, you'd end up with gibberish on a system that assumed UTF-8 encoding.

numpad0 5 days ago | parent | next [-]

20 20 78 78 78 ... Looks ASCII to me, Firefox on Windows. Could be OS.

16bytes 5 days ago | parent | next [-]

" xxx"? That's the same in ASCII and UTF-8.

OP is asking what are the line-drawing characters encoded as e.g: "┌" and "┐".

Since the charset returned by the app is UTF-8, these will be interpreted and encoded as UTF-8 and not whatever "ASCII - Extended" means.

numpad0 5 days ago | parent [-]

that would be completely correct... sorry. the export options now read "ASCII Basic" and "ASCII Extended", and "Basic" generates plus signs for corners, as of now. I feel like the behavior might have changed. Extended option seem to use 0xE294xx range for lines.

1: https://gist.github.com/numpad0/7880ad1e3ed32b91d1ccf9c3374f...

craftkiller 5 days ago | parent | prev [-]

Firefox on Linux: I just copied and pasted into emacs and did a M-x describe-char and got 0xE29480 which is definitely not ASCII: https://www.compart.com/en/unicode/U+2500

Also confirmed with hexl-mode (hex editor in emacs)

ilovetux 5 days ago | parent | prev [-]

Disclaimer, I just pulled this quote from Google ai which probably took it from somewhere else, but I just wanted to provide a little context. ASCII encoded text is also valid utf8.

> The first 128 characters of Unicode, which are the same as the ASCII character set (characters 0-127), are encoded in UTF-8 using a single byte with the exact same binary value as their ASCII representation. This means that any file containing only ASCII characters is also a valid UTF-8 file

staplung 5 days ago | parent | next [-]

Yes, but the box drawing characters in "ASCII" are all above 127 so they don't encode the same way. So that last AI generated sentence is basically false (or really misleading): ASCII files that consist only of characters in the lower 127 will also be valid UTF-8. But ASCII files that use characters above 127 will not be valid UTF-8.

Now, technically, ASCII only concerns the lower 127 characters. There's no single standard definition as to what the upper half of the byte space represents in ASCII itself so technically it's true that all valid ASCII files are valid UTF-8. By the same logic however, the box drawing characters are not ASCII. They're actually part of something called code page 437, which maps those bit patterns to box drawing characters. With other code pages they map to something else, often non-Latin characters or ones with accents.

So, the name ASCII flow is misleading and the the output options are too. ;-) Basically, if the high bit is set in UTF-8 it indicates that more than one byte is needed to represent the code point.

ilovetux 5 days ago | parent [-]

Granted, all of that is true, but GP specifically differentiated between ASCII and ASCII Extended, then GP went on to say that after choosing the ASCII option and pasting the text in a text editor on Mac it was reported as UTF-8, which I was pointing out would be true because if the ASCII option is chosen as opposed to the ASCII Extended option then what he ends up with (ASCII) is valid UTF-8 as reported by the text editor.

em3rgent0rdr 5 days ago | parent | prev [-]

Indeed, UTF-8 "was designed for backward compatibility with ASCII: the first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that a UTF-8-encoded file using only those characters is identical to an ASCII file."

https://en.wikipedia.org/wiki/UTF-8