Remix.run Logo
koakuma-chan 4 days ago

> What if the file name is not valid UTF-8

Nothing? Neither Go nor the OS require file names to be UTF-8, I believe

zimpenfish 4 days ago | parent | next [-]

> Nothing?

It breaks. Which is weird because you can create a string which isn't valid UTF-8 (eg "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98") and print it out with no trouble; you just can't pass it to e.g. `os.Create` or `os.Open`.

(Bash and a variety of other utils will also complain about it being valid UTF-8; neovim won't save a file under that name; etc.)

yencabulator 3 days ago | parent | next [-]

That sounds like your kernel refusing to create that file, nothing to do with Go.

  $ cat main.go
  package main

  import (
   "log"
   "os"
  )

  func main() {
   f, err := os.Create("\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98")
   if err != nil {
    log.Fatalf("create: %v", err)
   }
   _ = f
  }
  $ go run .
  $ ls -1
  ''$'\275\262''='$'\274'' ⌘'
  go.mod
  main.go
kragen 3 days ago | parent | next [-]

I've posted a longer explanation in https://news.ycombinator.com/item?id=44991638. I'm interested to hear which kernel and which firesystem zimpenfish is using that has this problem.

yencabulator 2 days ago | parent [-]

I believe macOS forces UTF-8 filenames and normalizes them to something near-but-not-quite Unicode NFD.

Windows doing something similar wouldn't surprise me at all. I believe NTFS internally stores filenames as UTF-16, so enforcing UTF-8 at the API boundary sounds likely.

kragen 2 days ago | parent [-]

That sounds right. Fortunately, it's not my problem that they're using a buggy piece of shit for an OS.

commandersaki 3 days ago | parent | prev | next [-]

I'm confused, so is Go restricted to UTF-8 only filenames, because it can read/write arbitrary byte sequences (which is what string can hold), which should be sufficient for dealing with other encodings?

yencabulator 3 days ago | parent [-]

Go is not restricted, since strings are only conventionally utf-8 but not restricted to that.

commandersaki 3 days ago | parent [-]

Then I am having a hard time understanding the issue in the post, it seems pretty vague, is there any idea what specific issue is happening, is it how they've used Go, or does Go have an inherent implementation issue, specifically these lines:

If you stuff random binary data into a string, Go just steams along, as described in this post.

Over the decades I have lost data to tools skipping non-UTF-8 filenames. I should not be blamed for having files that were named before UTF-8 existed.

yencabulator 3 days ago | parent | next [-]

Let me translate: "I have decided to not like something so now I associate miscellaneous previous negative experiences with it"

kragen 3 days ago | parent | prev | next [-]

The post is wrong on this point, although it's mostly correct otherwise. Just steaming along when you have random binary data in a string, as Golang does, is how you avoid losing data to tools that skip non-UTF-8 filenames, or crash on them.

comex 3 days ago | parent | prev [-]

Yeah, the complaint is pretty bizarre, or at least unclear.

zimpenfish 3 days ago | parent | prev [-]

> That sounds like your kernel refusing to create that file

Yes, that was my assumption when bash et al also had problems with it.

kragen 3 days ago | parent | prev [-]

It sounds like you found a bug in your filesystem, not in Golang's API, because you totally can pass that string to those functions and open the file successfully.

johncolanduoni 4 days ago | parent | prev | next [-]

Well, Windows is an odd beast when 8-bit file names are used. If done naively, you can’t express all valid filenames with even broken UTF-8 and non-valid-Unicode filenames cannot be encoded to UTF-8 without loss or some weird convention.

You can do something like WTF-8 (not a misspelling, alas) to make it bidirectional. Rust does this under the hood but doesn’t expose the internal representation.

jstimpfle 4 days ago | parent | next [-]

What do you mean by "when 8-bit filenames are used"? Do you mean the -A APIs, like CreateFileA()? Those do not take UTF-8, mind you -- unless you are using a relatively recent version of Windows that allows you to run your process with a UTF-8 codepage.

In general, Windows filenames are Unicode and you can always express those filenames by using the -W APIs (like CreateFileW()).

af78 4 days ago | parent | next [-]

I think it depends on the underlying filesystem. Unicode (UTF-16) is first-class on NTFS. But Windows still supports FAT, I guess, where multiple 8-bit encodings are possible: the so-called "OEM" code pages (437, 850 etc.) or "ANSI" code pages (1250, 1251 etc.). I haven't checked how recent Windows versions cope with FAT file names that cannot be represented as Unicode.

johncolanduoni 3 days ago | parent | prev [-]

Windows filenames in the W APIs are 16-bit (which the A APIs essentially wrap with conversions to the active old-school codepage), and are normally well formed UTF-16. But they aren’t required to be - NTFS itself only cares about 0x0000 and 0x005C (backslash) I believe, and all layers of the stack accept invalid UTF-16 surrogates. Don’t get me started on the normal Win32 path processing (Unicode normalization, “COM” is still a special file, etc.), some of which can be bypassed with the “\\?\” prefix when in NTFS.

The upshot is that since the values aren’t always UTF-16, there’s no canonical way to convert them to single byte strings such that valid UTF-16 gets turned into valid UTF-8 but the rest can still be roundtripped. That’s what bastardized encodings like WTF-8 solve. The Rust Path API is the best take on this I’ve seen that doesn’t choke on bad Unicode.

andyferris 4 days ago | parent | prev [-]

I believe the same is true on linux, which only cares about 0x2f bytes (i.e. /)

johncolanduoni 3 days ago | parent | next [-]

Windows paths are not necessarily well-formed UTF-16 (UCS-2 by some people’s definition) down to the filesystem level. If they were always well formed, you could convert to a single byte representation by straightforward Unicode re-encoding. But since they aren’t - there are choices that need to be made about what to do with malformed UTF-16 if you want to round trip them to single byte strings such that they match UTF-8 encoding if they are well formed.

In Linux, they’re 8-bit almost-arbitrary strings like you noted, and usually UTF-8. So they always have a convenient 8-bit encoding (I.e. leave them alone). If you hated yourself and wanted to convert them to UTF-16, however, you’d have the same problem Windows does but in reverse.

orthoxerox 3 days ago | parent | prev | next [-]

And 0x00, if I remember correctly.

matt_kantor 3 days ago | parent | prev [-]

And 0x00.

4 days ago | parent | prev [-]
[deleted]