Remix.run Logo
eru 7 days ago

Python 3 deals with this reasonable sensibly, too, I think. They use UTF-8 by default, but allow you to specify other encodings.

ynik 7 days ago | parent | next [-]

Python 3 internally uses UTF-32. When exchanging data with the outside world, it uses the "default encoding" which it derives from various system settings. This usually ends up being UTF-8 on non-Windows systems, but on weird enough systems (and almost always on Windows), you can end up with a default encoding other than UTF-8. "UTF-8 mode" (https://peps.python.org/pep-0540/) fixes this but it's not yet enabled by default (this is planned for Python 3.15).

arcticbull 7 days ago | parent [-]

Apparently Python uses a variety of internal representations depending on the string itself. I looked it up because I saw UTF-32 and thought there's no way that's what they do -- it's pretty much always the wrong answer.

It uses Latin-1 for ASCII strings, UCS-2 for strings that contain code points in the BMP and UCS-4 only for strings that contain code points outside the BMP.

It would be pretty silly for them to explode all strings to 4-byte characters.

jibal 7 days ago | parent | next [-]

You are correct. Discussions of this topic tend to be full of unvalidated but confidently stated assertions, like "Python 3 internally uses UTF-32." Also unjustified assertions, like the OP's claim that len(" ") == 5 is "rather useless" and that "Python 3’s approach is unambiguously the worst one". Unlike in many other languages, the code points in Python's strings are always directly O(1) indexable--which can be useful--and the subject string has 5 indexable code points. That may not be the semantics that someone is looking for in a particular application, but it certainly isn't useless. And given the Python implementation of strings, the only other number that would be useful would be the number of grapheme clusters, which in this case is 1, and that count can be obtained via the grapheme or regex modules.

account42 7 days ago | parent | prev [-]

It conceptually uses arrays of code points, which need up to 24 bits. Optimizing the storage to use smaller integers when possible is an implementation detail.

jibal 7 days ago | parent | next [-]

Python3 is specified to use arrays of 8, 16, or 32 bit units, depending on the largest code point in the string. As a result, all code points in all strings are O(1) indexable. The claim that "Python 3 internally uses UTF-32" is simply false.

zahlman 6 days ago | parent | prev [-]

> code points, which need up to 24 bits

They need at most 21 bits. The bits may only be available in multiples of 8, but the implementation also doesn't byte-pack them into 24-bit units, so that's moot.

xigoi 7 days ago | parent | prev [-]

I prefer languages where strings are simply sequences of bytes and you get to decide how to interpret them.

zahlman 6 days ago | parent | next [-]

Such languages do not have strings. Definitionally a string is a sequence of characters, and more than 256 characters exist. A byte sequence is just an encoding; if you are working with that encoding directly and have to do the interpretation yourself, you are not using a string.

But if you do want a sequence of bytes for whatever reason, you can trivially obtain that in any version of Python.

capitainenemo 6 days ago | parent [-]

My experience personally with python3 (and repeated interactions with about a dozen python programmers, including core contributors) is that python3 does not let you trivially work with streams of bytes, esp if you need to do character set conversions, since a tiny python2 script that I have used for decades for conversion of character streams in terminals has proved to be repeated unportable to python3. The last attempt was much larger, still failed, and they thought they could probably do it, but it would require far more code and was not worth their effort.

I'll probably just use rust for that script if python2 ever gets dropped by my distro. Reminds me of https://gregoryszorc.com/blog/2020/01/13/mercurial%27s-journ...

zahlman 6 days ago | parent [-]

> a tiny python2 script that I have used for decades for conversion of character streams in terminals has proved to be repeated unportable to python3.

Show me.

capitainenemo 6 days ago | parent [-]

Heh. It always starts this way... then they confidently send me something that breaks on testing it, then half a dozen more iterations, then "python2 is doing the wrong thing" or, "I could get this working but it isn't worth the effort" but sure, let's do this one more time. Could be they were all missing something obvious - wouldn't know, I avoid python personally, apart from when necessary like with LLM glue. https://pastebin.com/j4Lzb5q1

This is a script created by someone on #nethack a long time ago. It works great with other things as well like old BBS games. It was intended to transparently rewrite single byte encodings to multibyte with an optional conversion array.

zahlman 6 days ago | parent [-]

> then they confidently send me something that breaks on testing it, then half a dozen more iterations, then "python2 is doing the wrong thing or, 'I could get this working but it isn't worth the effort'"

It almost works as-is in my testing. (By the way, there's a typo in the usage message.) Here is my test process:

  #!/usr/bin/env python
  import random, sys, time
  
  
  def out(b):
      # ASCII 0..7 for the second digit of the color code in the escape sequence
      color = random.randint(48, 55)
      sys.stdout.buffer.write(bytes([27, 91, 51, color, 109, b]))
      sys.stdout.flush()
  
  
  for i in range(32, 256):
      out(i)
      time.sleep(random.random()/5)
  
  
  while True:
      out(random.randint(32, 255))
      time.sleep(0.1)
I suppressed random output of C0 control characters to avoid messing up my terminal, but I added a test that basic ANSI escape sequences can work through this.

(My initial version of this didn't flush the output, which mistakenly lead me to try a bunch of unnecessary things in the main script.)

After fixing the `print` calls, the only thing I was forced to change (although I would do the code differently overall) is the output step:

  # sys.stdout.write(out.encode("UTF-8"))
  sys.stdout.buffer.write(out.encode("UTF-8"))
  sys.stdout.flush()
I've tried this out locally (in gnome-terminal) with no issue. (I also compared to the original; I have a local build of 2.7 and adjusted the shebang appropriately.)

There's a warning that `bufsize=1` no longer actually means a byte buffer of size 1 for reading (instead it's magically interpreted as a request for line buffering), but this didn't cause a failure when I tried it. (And setting the size to e.g. `2` didn't break things, either.)

I also tried having my test process read from standard input; the handling of ctrl-C and ctrl-D seems to be a bit different (and in general, setting up a Python process to read unbuffered bytes from stdin isn't the most fun thing), but I generally couldn't find any issues here, either. Which is to say, the problems there are in the test process, not in `ibmfilter`. The input is still forwarded to, and readable from, the test process via the `Popen` object. And any problems of this sort are definitely still fixable, as demonstrated by the fact that `curses` is still in the standard library.

Of course, keys in the `special` mapping need to be defined as bytes literals now. Although that could trivially be adapted if you insist.

capitainenemo 5 days ago | parent [-]

Sorry, I'm not a python guy, do you have a script you'd like me to run against python3? Just toss me a pastebin link, and ideally the version of python3 to run, since half the python3 scripts on my system seem to require a different version of python3 from the other half and a variety of isolated sets of python libs in virtual environments (heck, pip even warns you not to try installing libs globally so everyone can use same set these days). I'd rather not try to follow a set of suggestions and then be told I did it wrong.

As for typo, yep. But then, I've left this script essentially untouched for a couple of decades since I was given it.

zahlman 5 days ago | parent [-]

> do you have a script you'd like me to run against python3? Just toss me a pastebin link, and ideally the version of python3 to run

Here's a diff:

  diff --git a/ibmfilter b/ibmfilter
  index 245d32c..2633335 100755
  --- a/ibmfilter
  +++ b/ibmfilter
  @@ -1,6 +1,5 @@
  -#!/usr/bin/python2 -tt
  -# vim:set fileencoding=utf-8
  - 
  +#!/usr/bin/python3
  +
   from subprocess import *
   import sys 
   import os, select
  @@ -10,8 +9,8 @@ special = {
   }
    
   if len(sys.argv) < 2:
  -    print "usage: ibmfilter [command]"
  -    print "Runs command in a subshell and translates its output from ibm473 codepage to UTF-8."
  +    print("usage: ibmfilter [command]")
  +    print("Runs command in a subshell and translates its output from ibm473 codepage to UTF-8.")
       sys.exit(0)
    
   handle = Popen(sys.argv[1:], stdout=PIPE, bufsize=1)
  @@ -26,8 +25,10 @@ while buf != '':
           os.kill(handle.pid)
           os.system('reset')
           raise Exception("Timed out while waiting for stdout to be writeable...")
  -    sys.stdout.write(out.encode("UTF-8"))
  - 
  +    sys.stdout.buffer.write(out.encode("UTF-8"))
  +    sys.stdout.flush()
  +
       buf = handle.stdout.read(1)
    
   handle.wait()
I already have tested it and it works fine as far as I can tell on every version since at least 3.3 through 3.13 inclusive. There's really nothing version specific here, except the warning I mentioned which is introduced in 3.8. If you encounter a problem, some more sophisticated diagnostics would be needed, and honestly I'm not actually sure where to start with that. (Although I'm mildly impressed that you still have access to a 2.7 interpreter in /usr/bin without breaking anything else.)

If you want to add overrides, you must use bytes literals for the keys. That looks like:

  b'\xff': 'X'
> (heck, pip even warns you not to try installing libs globally so everyone can use same set these days)

Some Python programs have mutually incompatible dependencies, and you can't really have two versions of the same dependency loaded in the same runtime. This has always been a problem; you're just looking at the current iteration of pip trying to cooperate with Linux distros to help you not break your system as a result.

"Using the same set" is not actually desirable for development.

capitainenemo 5 days ago | parent [-]

So, the patch failed with both my original file and the pastebin one - perhaps due to indentation of Hacker News, so I manually applied since it did seem pretty straightforward - honestly given how short the file was, it would have taken up the same amount of space here as the diff I think, but hopefully I applied it correctly. Manual copy/paste in python always worries me w/ the significant white-space thing (one of our friends accidentally DoS'd our server with his first python script due to that, but that was back when mixing tabs and spaces didn't throw an error by default), but I probably did it right.

And with that out of the way. This one seems to mostly work!

So python3 did not significantly change handling this sort of byte stream and while Mercurial folks might well have had their own woes, I have no idea what the issues were in all those prior attempts with this file.

... that said, it does do one odd thing (following is output on launching):

    /usr/lib/python3.12/subprocess.py:1016: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used
        self.stdout = io.open(c2pread, 'rb', bufsize)
And yet, I can't spot any issues in gameplay, yet, caused by this, so I'm inclined to let it pass? But, it does make me wonder if later on, I might hit issues...

At least for now, I'm going to tentatively say it seems fine. Hm. You know what. Let me try with some more obvious things that might fail if the buffer size is wrong.

So. Now I'm wondering why, given how relatively minor this change is (aside from the odd error message, and the typical python3 changes just one slightly modified line and one inserted line), why did so many pythonistas have so much difficulty over the many years I asked about this? I mean, I only formed my opinion that maybe there was a problem with python3 byte/string handling to just how many attempts there were... Were they trying to do things in a more idiomatic python3 fashion? Did the python3 APIs change? Does the error hint at something more concerning? Well, whatever. Clearly it's (mostly) fine now. And my carefully tweaked nethack profile is safe if python2 is removed without needing to make my own stream filter. Yay! Thanks!

... further updates.. ok there are a few issues.

1) the warning

2) there's an odd ghost flicker that jumps around the nethack level as if a cursor is appearing - does not happen in the python2 one.

3) on quitting it no longer it no longer exits gracefully and I have to ctrl-c the script.

4) It is much slower to render. the python2 one draws a screen almost instantly for most uses (although still a bit slower than not filtered, at least on this computer, for things that change a lot, like video). This one ripples down - that might explain the ghost flickering in ② and might be related to the buffer warning. This becomes much more noticeable with BBSes although it is usually fine in nethack. You can see the difference on a simpler testcase without setting up a BBS account by streaming a bit more data at once say by running: ibmfilter curl ascii.live/nyan

So, clearly not perfect but.. eh. functional? Still far better than prior attempts, and at least it mostly works with nethack.

zahlman 5 days ago | parent [-]

> perhaps due to indentation of Hacker News, so I manually applied since it did seem pretty straightforward

Yes, that would be exactly why. You can use e.g. `sed` to remove leading whitespace from each line (I used it to add the leading whitespace for posting).

> ... that said, it does do one odd thing (following is output on launching):

Yes, that's the warning I mentioned. The original code requests to use a buffer size of 1, which is no longer supported (it now means to use line buffering).

> It is much slower to render.

Avoiding line buffering (by requiring a buffer size of 2 or more) might fix that. Actually, it might be a good idea to use a significantly larger buffer, so that e.g. an entire ANSI colour code can be read all at once.

The other issues are, I'm pretty sure, because of other things that changed in how `subprocess` works. Fixing things at this level would indeed require quite a bit more hacking around with the low-level terminal APIs.

> I mean, I only formed my opinion that maybe there was a problem with python3 byte/string handling to just how many attempts there were... Were they trying to do things in a more idiomatic python3 fashion? Did the python3 APIs change? Does the error hint at something more concerning?

Most likely, other attempts either a) didn't understand what the original code was doing in precise enough detail, or b) didn't know how to send binary data to standard output properly (Python 3 defaults to opening standard output as a text stream).

All of that said: I think that nowadays you should just be able to get a build of NetHack that just outputs UTF-8 characters directly; failing that, you can use the `locale` command to tell your terminal to expect cp437 data.

capitainenemo 4 days ago | parent [-]

Well, I also use it for other old terminal apps (BBS games), and the remapping of characters was fun too, using unicode to make certain features more distinguishable. (Only downside is it messes up the crystal ball, if you remapped that char, since it wants the standard values, but you can just use memorised values or turn it off temporarily)

The unfortunate thing is the "lag" is a bit annoying with some apps, so I'll probably still use the python2 one for now.

capitainenemo 4 days ago | parent [-]

oh. and, using 2 did silence the warning, but performance was still bad compared to python2, and it still requires ctrl-c to exit

afiori 7 days ago | parent | prev | next [-]

I would like an utf-8 optimized bag of bytes where arbitrary byte operations are possible but the buffer keeps track of whether is it valid utf-8 or not (for every edit of n bytes it should be enough to check about n+8 bytes to validate) then utf-8 then utf-8 encoding/decoding becomes a noop and utf-8 specific apis can check quickly is the string is malformed or not.

account42 7 days ago | parent [-]

But why care if it's malformed UTF-8? And specifically, what do you want to happen when you get a malformed UTF-8 string. Keep in mind that UTF-8 is self-synchronizing so even if you encode strings into a larger text-based format without verifying them it will still be possible to decode the document. As a user I normally want my programs to pass on the string without mangling it further. Some tool throwing fatal errors because some string I don't actually care about contains an invalid UTF-8 byte sequence is the last thing I want. With strings being an arbitrary bag of bytes many programs can support arbitrary encodings or at least arbitrary ASCII-supersets without any additional effort.

afiori 7 days ago | parent [-]

The main issue I can see is not garbage bytes in text but mixing of incompatible encoding eg splicing latin-1 bytes in a utf-8 string.

My understanding of the current "always and only utf-8/unicode" zeitgeist is that is comes mostly from encoding issues among which the complexity of detecting encoding.

I think that the current status quo is better than what came before, but I also think it could be improved.

bawolff 7 days ago | parent | prev | next [-]

Me too.

The languages that i really dont get are those that force valid utf-8 everywhere but dont enforce NFC. Which is most of them but seems like the worst of both worlds.

Non normalized unicode is just as problematic as non validated unicode imo.

jibal 7 days ago | parent | prev | next [-]

Python has byte arrays that allow for that, in addition to strings consisting of arrays of Unicode code points.

account42 7 days ago | parent | prev [-]

Yes, I always roll my eyes when people complain that C strings or C++'s std::string/string_view don't have Unicode support. They are bags of bytes with support for concatenation. Any other transformation isn't going to have a "correct" way to do it so you need to be aware of what you want anyway.

astrange 6 days ago | parent [-]

C strings are not bags of bytes because they can't contain 0x00.