| ▲ | baq 7 days ago |
| ASCII is very convenient when it fits in the solution space (it’d better be, it was designed for a reason), but in the global international connected computing world it doesn’t fit at all. The problem is all the tutorials, especially low level ones, assume ASCII so 1) you can print something to the console and 2) to avoid mentioning that strings are hard so folks don’t get discouraged. Notably Rust did the correct thing by defining multiple slightly incompatible string types for different purposes in the standard library and regularly gets flak for it. |
|
| ▲ | craftkiller 6 days ago | parent | next [-] |
| > Notably Rust did the correct thing In addition to separate string types, they have separate iterator types that let you explicitly get the value you want. So: String.len() == number of bytes
String.bytes().count() == number of bytes
String.chars().count() == number of unicode scalar values
String.graphemes().count() == number of graphemes (requires unicode-segmentation which is not in the stdlib)
String.lines().count() == number of lines
Really my only complaint is I don't think String.len() should exist, it's too ambiguous. We should have to explicitly state what we want/mean via the iterators. |
| |
| ▲ | pron 6 days ago | parent | next [-] | | Similar to Java: String.chars().count(), String.codePoints().count(), and, for historical reasons, String.getBytes(UTF-8).length
| |
| ▲ | westurner 6 days ago | parent | prev [-] | | String.graphemes().count()
That's a real nice API. (Similarly, python has @ for matmul but there is not an implementation of matmul in stdlib. NumPy has a matmul implementation so that the `@` operator works.)ugrapheme and ucwidth are one way to get the graphene count from a string in Python. It's probably possible to get the grapheme cluster count from a string containing emoji characters with ICU? | | |
| ▲ | dhosek 6 days ago | parent [-] | | Any correctly designed grapheme cluster handles emoji characters. It’s part of the spec (says the guy who wrote a Unicode segmentation library for rust). |
|
|
|
| ▲ | account42 7 days ago | parent | prev | next [-] |
| > in the global international connected computing world it doesn’t fit at all. I disagree. Not all text is human prose. For example, there is nothing wrong with an programming language that only allows ASCII in the source code and many downsides to allowing non-ASCII characters outside string constants or comments. |
| |
| ▲ | andriamanitra 6 days ago | parent | next [-] | | > For example, there is nothing wrong with an programming language that only allows ASCII in the source code and many downsides to allowing non-ASCII characters outside string constants or comments. That's a tradeoff you should carefully consider because there are also downsides to disallowing non-ASCII characters. The downsides of allowing non-ASCII mostly stem from assigning semantic significance to upper/lowercase (which is itself a tradeoff you should consider when designing a language). The other issue I can think of is homographs but it seems to be more of a theoretical concern than a problem you'd run into in practice. When I first learned programming I used my native language (Finnish, which uses 3 non-ASCII letters: åäö) not only for strings and comments but also identifiers. Back then UTF-8 was not yet universally adopted (ISO 8859-1 character set was still relatively common) so I occasionally encountered issues that I had no means to understand at the time. As programming is being taught to younger and younger audiences it's not reasonable to expect kids from (insert your favorite non-English speaking country) to know enough English to use it for naming. Naming and, to an extent, thinking in English requires a vocabulary orders of magnitude larger than knowing the keywords. By restricting source code to ASCII only you also lose the ability to use domain-specific notation like mathematical symbols/operators and Greek letters. For example in Julia you may use some mathematical operators (eg. ÷ for Euclidean division, ⊻ for exclusive or, ∈/∉/∋ for checking set membership) and I find it really makes code more pleasant to read. | |
| ▲ | eviks 6 days ago | parent | prev | next [-] | | The "nothing wrong" is, of course, this huge issue of not being able to use your native language, especially important when learning something by avoiding the extra language barrier on top of another language barrier Now list anything as important from your list of downsides that's just as unfixable | |
| ▲ | simonask 7 days ago | parent | prev [-] | | This is American imperialism at its worst. I'm serious. Lots of people around the world learn programming from sources in their native language, especially early in their career, or when software development is not their actual job. Enforcing ASCII is the same as enforcing English. How would you feel if all cooking recipes were written in French? If all music theory was in Italian? If all industrial specifications were in German? It's fine to have a dominant language in a field, but ASCII is a product of technical limitations that we no longer have. UTF-8 has been an absolute godsend for human civilization, despite its flaws. | | |
| ▲ | 0x000xca0xfe 7 days ago | parent | next [-] | | Well I'm not American and I can tell you that we do not see English source code as imperialism. In fact it's awesome that we have one common very simple character set and language that works everywhere and can do everything. I have only encountered source code using my native language (German) in comments or variable names in highly unprofessional or awful software and it is looked down upon. You will always get an ugly mix and have to mentally stop to figure out which language a name is in. It's simply not worth it. Please stop pushing this UTF-8 everywhere nonsense. Make it work great on interactive/UI/user facing elements but stop putting UTF-8-only restrictions in low-level software. Example: Copied a bunch of ebooks to my phone, including one with a mangled non-UTF-8 name. It was ridiculously hard to delete the file as most Android graphical and console tools either didn't recognize it or crashed. | | |
| ▲ | flohofwoe 7 days ago | parent | next [-] | | > Please stop pushing this UTF-8 everywhere nonsense. I was with you until this sentence. UTF-8 everywhere is great exactly because it is ASCII-compatible (e.g. all ASCII strings are automatically also valid UTF-8 strings, so UTF-8 is a natural upgrade path from ASCII) - both are just encodings for the same UNICODE codepoints, ASCII just cannot go beyond the first 127 codepoints, but that's where UTF-8 comes in and in a way that's backward compatible with ASCII - which is the one ingenious feature of the UTF-8 encoding. | | |
| ▲ | 0x000xca0xfe 7 days ago | parent | next [-] | | I'm not advocating for ASCII-everywhere, I'm for bytes-everywhere. And bytes can conveniently fit both ASCII and UTF-8. If you want to restrict your programming language to ASCII for whatever reason, fine by me. I don't need "let wohnt_bei_Böckler_STRAẞE = ..." that much. But if you allow full 8-bit bytes, please don't restrict them to UTF-8. If you need to gracefully handle non-UTF-8 sequences graphically show the appropriate character "�", otherwise let it pass through unmodified. Just don't crash, show useless error messages or in the worst case try to "fix" it by mangling the data even more. | | |
| ▲ | flohofwoe 7 days ago | parent [-] | | > "let wohnt_bei_Böckler_STRAẞE" This string cannot be encoded as ASCII in the first place. > But if you allow full 8-bit bytes, please don't restrict them to UTF-8 UTF-8 has no 8-bit restrictions... You can encode any 21-bit UNICODE codepoint with UTF-8. It sound's like you're confusing ASCII, Extended ASCII and UTF-8: - ASCII: 7-bits per "character" (e.g. not able to encode international characters like äöü) but maps to the lower 7-bits of the 21-bits of UNICODE codepoints (e.g. all ASCII character codes are also valid UNICODE code points) - Extended ASCII: 8-bits per "character" but the interpretation of the upper 128 values depends on a country-specific codepage (e.g. the intepretation of a byte value in the range between 128 and 255 is different between countries and this is what causes all the mess that's usually associated with "ASCII". But ASCII did nothing wrong - the problem is Extended ASCII - this allows to 'encode' äöü with the German codepage but then shows different characters when displayed with a non-German codepage) - UTF-8: a variable-width encoding for the full range of UNICODE codepoints, uses 1..4 bytes to encode one 21-bit UNICODE codepoint, and the 1-byte encodings are identical with 7-bit ASCII (e.g. when the MSB of a byte in an UTF-8 string is not set, you can be sure that it is a character/codepoint in the ASCII range). Out of those three, only Extended ASCII with codepages are 'deprecated' and should no longer be used, while ASCII and UTF-8 are both fine since any valid ASCII encoded string is indistinguishable from that same string encoded as UTF-8, e.g. ASCII has been 'retconned' into UTF-8. | | |
| ▲ | dhosek 6 days ago | parent | next [-] | | I’d go farther and say that extended ASCII was an unmitigated disaster of compatibility issues (not to mention that more than a few scripts still don’t fit in the available spaces of an 8-bit encoding). Those of us who were around for the pre-Unicode days understand what a mess it was (not to mention the lingering issues thanks to the much vaunted backwards compatibility of some operating systems). | |
| ▲ | moefh 6 days ago | parent | prev [-] | | I'm not GP, but I think you're completely missing their point. The problem they're describing happens because file names (in Linux and Windows) are not text: in Linux (so Android) they're arbitrary sequences of bytes, and in Windows they're arbitrary sequences of UTF-16 code points not necessarily forming valid scalar values (for example, surrogates can be present alone). And yet, a lot of programs ignore that and insist on storing file names as Unicode strings, which mostly works (because users almost always name files by inputting text) until somehow a file gets written as a sequence of bytes that doesn't map to a valid string (i.e., it's not UTF-8 or UTF-16, depending on the system). So what's probably happening in GP's case is that they managed somehow to get a file with a non-UTF-8-byte-sequence name in Android, and subsequently every App that tries to deal with that file uses an API that converts the file name to a string containing U+FFFD ("replacement character") when the invalid UTF-8 byte is found. So when GP tries to delete the file, the App will try to delete the file name with the U+FFFD character, which will fail because that file doesn't exist. GP is saying that showing the U+FFFD character is fine, but the App should understand that the actual file name is not UTF-8 and behave accordingly (i.e. use the original sequence-of-bytes filename when trying to delete it). Note that this is harder than it should be. For example, with the old Java API (from java.io[1]) that's impossible: if you get a `File` object from listing a directory and ask if it exists, you'll get `false` for GP's file, because the `File` object internally stores the file name as a Java string. To get the correct result, you have to use the new API (from java.nio.file[2]) using `Path` objects. [1] https://developer.android.com/reference/java/io/File [2] https://developer.android.com/reference/java/nio/file/Path | | |
| ▲ | jibal 6 days ago | parent [-] | | This is correct, both about what happens and about what your P is not understanding. Your P's claim that "UTF-8 has no 8-bit restrictions" is nonsense. It's too bad that your GP wasn't clearer about what the problem is: not all byte strings are valid UTF-8. This is why people have had to invent hacks like WTF-8 (https://news.ycombinator.com/item?id=9611710) |
|
|
| |
| ▲ | numpad0 7 days ago | parent | prev [-] | | UTF-8 everywhere is not great and UTF-8 in practice is hardly ASCII-compatible. UTF-8 in source codes and file paths outside pure ASCII range breaks a lot of things especially on non-English systems due to legacy dependencies, ironically. Sure, it's backward compatible, as in ASCII handling codes work on systems with UTF-8 locales, but how important is that? | | |
| ▲ | flohofwoe 6 days ago | parent [-] | | > as in ASCII handling codes work on systems with UTF-8 locales, but how important is that? It's only Windows which is stuck in the past here, and Microsoft had 3 decades to fix that problem and migrate away from codegpages to locale-asgnostic UTF-8 (UTF-8 was invented in 1992). |
|
| |
| ▲ | BobbyTables2 5 days ago | parent | prev | next [-] | | I once saw an electrical schematic from a non-English speaking designer. None of the signals were intuitive because they weren’t the typical English abbreviations! | |
| ▲ | sussmannbaka 6 days ago | parent | prev [-] | | You say this because your native language broadly fits into ascii and you would sing a different tune if it didn’t. |
| |
| ▲ | jibal 7 days ago | parent | prev | next [-] | | It's neither American nor imperialism -- those are both category mistakes. Andreas Rumpf, the designer of Nim, is Austrian. All the keywords of Nim are in English, the library function names are in English, the documentation is in English, Rumpf's book Mastering Nim is in English, the other major book for the language, Nim In Action (written by Dominik Picheta, nationality unknown but not American) is in English ... this is not "American imperialism" (which is a real thing that I don't defend), it's for easily understandable pragmatic reasons. And the language parser doesn't disallow non-ASCII characters but it doesn't treat them linguistically, and it has special rules for casefolding identifiers that only recognize ASCII letters, hobbling the use of non-ASCII identifiers because case distinguishes between types and other identifiers. The reason for this lack of handling of Unicode linguistically is simply to make the lexer smaller and faster. | | |
| ▲ | rurban 6 days ago | parent | next [-] | | > The reason for this lack of handling of Unicode linguistically is simply to make the lexer smaller and faster. No, it is actually for security reasons. Once you allow non-ASCII identifiers, identifiers will become non identifiable. Only zig recognized that. Nim allows insecure identifiers. https://github.com/rurban/libu8ident/blob/master/doc/c11.md#... | | |
| ▲ | jibal 5 days ago | parent [-] | | Reading is fundamental. I was referring to the Nim lexer. Obviously the reason that it "allows insecure identifiers" is not "actually for security reasons". It is, as I stated, for reasons of performance ... I know this from reading the code and the author's statements. | | |
| ▲ | rurban 5 days ago | parent [-] | | Yes, you are right. Andi didn't care at all, same as PHP. |
|
| |
| ▲ | jibal 5 days ago | parent | prev | next [-] | | P.S. The response is a https://en.wikipedia.org/wiki/Motte-and-bailey_fallacy The motte: non-ASCII identifiers should be allowed The bailey: disallowing non-ASCII identifiers is American imperialism at its worst | |
| ▲ | simonask 6 days ago | parent | prev [-] | | I mean, the keywords of a programming language have to be in some language (unless you go the cursed route of Excel). I'm arguing against the position that non-ASCII identifiers should be disallowed. | | |
| ▲ | lsaferite 4 days ago | parent [-] | | > I'm arguing against the position that non-ASCII identifiers should be disallowed. Maybe I'm tired, but I've read this multiple times and can't quite figure out your desired position. I *think* you are in favor of non -ASCII identifiers? Like I said, I must be tired. | | |
| ▲ | jibal 10 hours ago | parent [-] | | He says that disallowing non-ASCII identifiers is "American imperialism at its worst". |
|
|
| |
| ▲ | account42 7 days ago | parent | prev | next [-] | | Actually, it would be great to have a lingua franca in every field that all participants can understand. Are you also going to complain that biologists and doctors are expected to learn some rudimentary Latin? English being dominant in computing is absolutely a strength and we gain nothing by trying to combat that. Having support for writing your code in other languages is not going to change that most libraries will use English and and most documentation will be in English and most people you can ask for help will understand English. If you want to participate and refuse to learn English you are only shooting yourself in the foot - and if you are going to learn English you may as well do it from the beginning. Also due to the dominance of English and ASCII in computing history, most languages already have ASCII-alternatives for their writing so even if you need to refer to non-English names you can do that using only ASCII. | | |
| ▲ | simonask 7 days ago | parent [-] | | Well, the problem is that what you are advocating is also that knowing Latin would be a prerequisite for studying medicine, which it isn't anywhere. That's the equivalent. Doctors learn a (very limited) Latin vocabulary as they study and work. You are severely underestimate how far you can get without any real command of the English language. I agree that you can't become really good without it, just like you can't do haute cuisine without some French, but the English language is a huge and unnecessary barrier of entry that you would put in front of everyone in the world who isn't submerged in the language from an early age. Imagine learning programming using only your high school Spanish. Good luck. | | |
| ▲ | nitwit005 6 days ago | parent | next [-] | | You don't need to become fluent in Greek and Latin, but if you want to be able to read your patent's diagnosis, you're absolutely going to need to know the terms used. The standard names are in those languages. And frequently, there is no other name. There are a lot of diseases, and no language has names for all of them. | | |
| ▲ | simonask 6 days ago | parent [-] | | Sure, you can also look them up though, because it is a limited vocabulary. Identifiers in code are not a limited vocabulary, and understanding the structure of your code is important, especially so when you are in the early stages of learning. |
| |
| ▲ | numpad0 6 days ago | parent | prev [-] | | > Imagine learning programming using only your high school Spanish. Good luck. This + translated materials + locally written books is how STEM fields work in East Asia, the odds of success shouldn't be low. There just needs to be enough population using your language. |
|
| |
| ▲ | flohofwoe 7 days ago | parent | prev | next [-] | | Calm down, ASCII is a UNICODE compatible encoding for the first 127 UNICODE code points (which maps directly to the entire ASCII range). If you need to go beyond that, just 'upgrade' to UTF-8 encoding. UNICODE is essentially a superset of ASCII, and the UTF-8 encoding also contains ASCII as compatible subset (e.g. for the first 127 UNICODE code points, an UTF-8 encoded string is byte-by-byte compatible with the same string encoded in ASCII). Just don't use any of the Extended ASCII flavours (e.g. "8-bit ASCII with codepages") - or any of the legacy 'national' multibyte encodings (Shift-JIS etc...) because that's how you get the infamous `?????` or `♥♥♥♥♥` mismatches which are commonly associated with 'ASCII' (but this is not ASCII, but some flavour of Extended ASCII decoded with the wrong codepage). | |
| ▲ | ksenzee 6 days ago | parent | prev | next [-] | | I don’t see much difference between the amount of Italian you need for music and the amount of English you need for programming. You can have a conversation about it in your native language, but you’ll be using a bunch of domain-specific terms that may not be in your native language. | | |
| ▲ | simonask 6 days ago | parent [-] | | I agree, but we're talking about identifiers in code you write yourself here. Not the limited vocabulary of keywords, which are easy to memorize in any language. Standard libraries may trip you up, but documentation for those may be available in your native language. |
| |
| ▲ | nkrisc 6 days ago | parent | prev | next [-] | | There was a time when most scientific literature was written in French. People learned French. Before that it was Latin. People learned Latin. | | |
| ▲ | tehjoker 6 days ago | parent [-] | | This is true but it’s important to recognize that this was because of the French (Napoleon) and Roman empires, Christianity just as the brutal American and UK empires created these circumstances today | | |
| ▲ | wredcoll 6 days ago | parent [-] | | The napoleonic empire lasted about 15 years, so that's a bit of a stretch. More relevantly though, good things can come from people who also did bad things; this isn't to justify doing bad things in hopes something good also happens, but it doesn't mean we need to ideologically purge good things based on their creators. |
|
| |
| ▲ | schrodinger 5 days ago | parent | prev [-] | | American Imperialism has absolutely resulted in some horrible things, but I hardly think that ASCII is one of them. ASCII wasn't "imperialism," it was pragmatism. Yes, it privileged English -- but that's because the engineers designing it _spoke_ English and the US was funding + exporting most of the early computer and networking gear. The US Military essentially gave the world TCP/IP (via DARPA) for free! Maybe "cultural dominance", but "imperialism at its worst" is a ridiculous take. |
|
|
|
| ▲ | flohofwoe 7 days ago | parent | prev | next [-] |
| ASCII is totally fine as encoding for the lower 127 UNICODE code points. If you need to go above those 127 code points, use a different encoding like UTF-8. Just never ever use Extended ASCII (8-bits with codepages). |
|
| ▲ | bigstrat2003 6 days ago | parent | prev | next [-] |
| > in the global international connected computing world it doesn’t fit at all. Most people aren't living in that world. If you're working at Amazon or some business that needs to interact with many countries around the globe, sure, you have to worry about text encoding quite a bit. But the majority of software is being written for a much narrower audience, probably for one single language in one single country. There is simply no reason for most programmers to obsess over text encoding the way so many people here like to. |
| |
| ▲ | arp242 6 days ago | parent | next [-] | | No one is "obsessing" over anything. Reality is there are very few cases where you can use a single 8-bit character set and not run in to problems sooner or later. Say your software is used only in Greece so you use ISO-8859-7 for Greek. That works fine, but now you want to talk to your customer Günther from Germany who has been living in Greece for the last five years, or Clément from France, or Seán from Ireland and oops, you can't. Even plain English text can't be represented with plain ASCII (although ISO-8859-1 goes a long way). There are some cases where just plain ASCII is okay, but there are quite few of them (and even those are somewhat controversial). The solution is to just use UTF-8 everywhere. Or maybe UTF-16 if you really have to. | |
| ▲ | rileymat2 6 days ago | parent | prev | next [-] | | Except, this is a response to emoji support, which does have encoding issues even if your user base is in the US and only speaks English. Additionally, it is easy to have issues with data that your users use from other sources via copy and paste. | |
| ▲ | raverbashing 6 days ago | parent | prev | next [-] | | This is naive at best Here's a better analogy, in the 70s "nobody planned" for names with 's in then. SQL injections, separators, "not in the alphabet", whatever. In the US. Where a lot of people with 's in their names live... Or double-barrelled names. It's a much simpler problem and still tripped a lot of people And then you have to support a user with a "funny name" or a business with "weird characters", or you expand your startup to Canada/Mexico and lo and behold... | | |
| ▲ | ryandrake 6 days ago | parent [-] | | Yea, I cringe when I hear the phrase "special characters." They're only special because you, the developer, decided to treat them as special, and that's almost surely going to come back to haunt you at some point in the form of a bug. |
| |
| ▲ | wat10000 6 days ago | parent | prev [-] | | Which audience makes it so you don’t have to worry about text encodings? |
|
|
| ▲ | 7 days ago | parent | prev | next [-] |
| [deleted] |
|
| ▲ | eru 7 days ago | parent | prev [-] |
| Python 3 deals with this reasonable sensibly, too, I think. They use UTF-8 by default, but allow you to specify other encodings. |
| |
| ▲ | ynik 7 days ago | parent | next [-] | | Python 3 internally uses UTF-32.
When exchanging data with the outside world, it uses the "default encoding" which it derives from various system settings. This usually ends up being UTF-8 on non-Windows systems, but on weird enough systems (and almost always on Windows), you can end up with a default encoding other than UTF-8.
"UTF-8 mode" (https://peps.python.org/pep-0540/) fixes this but it's not yet enabled by default (this is planned for Python 3.15). | | |
| ▲ | arcticbull 7 days ago | parent [-] | | Apparently Python uses a variety of internal representations depending on the string itself. I looked it up because I saw UTF-32 and thought there's no way that's what they do -- it's pretty much always the wrong answer. It uses Latin-1 for ASCII strings, UCS-2 for strings that contain code points in the BMP and UCS-4 only for strings that contain code points outside the BMP. It would be pretty silly for them to explode all strings to 4-byte characters. | | |
| ▲ | jibal 7 days ago | parent | next [-] | | You are correct. Discussions of this topic tend to be full of unvalidated but confidently stated assertions, like "Python 3 internally uses UTF-32." Also unjustified assertions, like the OP's claim that len(" ") == 5 is "rather useless" and that "Python 3’s approach is unambiguously the worst one". Unlike in many other languages, the code points in Python's strings are always directly O(1) indexable--which can be useful--and the subject string has 5 indexable code points. That may not be the semantics that someone is looking for in a particular application, but it certainly isn't useless. And given the Python implementation of strings, the only other number that would be useful would be the number of grapheme clusters, which in this case is 1, and that count can be obtained via the grapheme or regex modules. | |
| ▲ | account42 7 days ago | parent | prev [-] | | It conceptually uses arrays of code points, which need up to 24 bits. Optimizing the storage to use smaller integers when possible is an implementation detail. | | |
| ▲ | jibal 7 days ago | parent | next [-] | | Python3 is specified to use arrays of 8, 16, or 32 bit units, depending on the largest code point in the string. As a result, all code points in all strings are O(1) indexable. The claim that "Python 3 internally uses UTF-32" is simply false. | |
| ▲ | zahlman 6 days ago | parent | prev [-] | | > code points, which need up to 24 bits They need at most 21 bits. The bits may only be available in multiples of 8, but the implementation also doesn't byte-pack them into 24-bit units, so that's moot. |
|
|
| |
| ▲ | xigoi 7 days ago | parent | prev [-] | | I prefer languages where strings are simply sequences of bytes and you get to decide how to interpret them. | | |
| ▲ | zahlman 6 days ago | parent | next [-] | | Such languages do not have strings. Definitionally a string is a sequence of characters, and more than 256 characters exist. A byte sequence is just an encoding; if you are working with that encoding directly and have to do the interpretation yourself, you are not using a string. But if you do want a sequence of bytes for whatever reason, you can trivially obtain that in any version of Python. | | |
| ▲ | capitainenemo 6 days ago | parent [-] | | My experience personally with python3 (and repeated interactions with about a dozen python programmers, including core contributors) is that python3 does not let you trivially work with streams of bytes, esp if you need to do character set conversions, since a tiny python2 script that I have used for decades for conversion of character streams in terminals has proved to be repeated unportable to python3. The last attempt was much larger, still failed, and they thought they could probably do it, but it would require far more code and was not worth their effort. I'll probably just use rust for that script if python2 ever gets dropped by my distro.
Reminds me of https://gregoryszorc.com/blog/2020/01/13/mercurial%27s-journ... | | |
| ▲ | zahlman 6 days ago | parent [-] | | > a tiny python2 script that I have used for decades for conversion of character streams in terminals has proved to be repeated unportable to python3. Show me. | | |
| ▲ | capitainenemo 6 days ago | parent [-] | | Heh. It always starts this way... then they confidently send me something that breaks on testing it, then half a dozen more iterations, then "python2 is doing the wrong thing" or, "I could get this working but it isn't worth the effort" but sure, let's do this one more time. Could be they were all missing something obvious - wouldn't know, I avoid python personally, apart from when necessary like with LLM glue.
https://pastebin.com/j4Lzb5q1 This is a script created by someone on #nethack a long time ago. It works great with other things as well like old BBS games. It was intended to transparently rewrite single byte encodings to multibyte with an optional conversion array. | | |
| ▲ | zahlman 6 days ago | parent [-] | | > then they confidently send me something that breaks on testing it, then half a dozen more iterations, then "python2 is doing the wrong thing or, 'I could get this working but it isn't worth the effort'" It almost works as-is in my testing. (By the way, there's a typo in the usage message.) Here is my test process: #!/usr/bin/env python
import random, sys, time
def out(b):
# ASCII 0..7 for the second digit of the color code in the escape sequence
color = random.randint(48, 55)
sys.stdout.buffer.write(bytes([27, 91, 51, color, 109, b]))
sys.stdout.flush()
for i in range(32, 256):
out(i)
time.sleep(random.random()/5)
while True:
out(random.randint(32, 255))
time.sleep(0.1)
I suppressed random output of C0 control characters to avoid messing up my terminal, but I added a test that basic ANSI escape sequences can work through this.(My initial version of this didn't flush the output, which mistakenly lead me to try a bunch of unnecessary things in the main script.) After fixing the `print` calls, the only thing I was forced to change (although I would do the code differently overall) is the output step: # sys.stdout.write(out.encode("UTF-8"))
sys.stdout.buffer.write(out.encode("UTF-8"))
sys.stdout.flush()
I've tried this out locally (in gnome-terminal) with no issue. (I also compared to the original; I have a local build of 2.7 and adjusted the shebang appropriately.)There's a warning that `bufsize=1` no longer actually means a byte buffer of size 1 for reading (instead it's magically interpreted as a request for line buffering), but this didn't cause a failure when I tried it. (And setting the size to e.g. `2` didn't break things, either.) I also tried having my test process read from standard input; the handling of ctrl-C and ctrl-D seems to be a bit different (and in general, setting up a Python process to read unbuffered bytes from stdin isn't the most fun thing), but I generally couldn't find any issues here, either. Which is to say, the problems there are in the test process, not in `ibmfilter`. The input is still forwarded to, and readable from, the test process via the `Popen` object. And any problems of this sort are definitely still fixable, as demonstrated by the fact that `curses` is still in the standard library. Of course, keys in the `special` mapping need to be defined as bytes literals now. Although that could trivially be adapted if you insist. | | |
| ▲ | capitainenemo 5 days ago | parent [-] | | Sorry, I'm not a python guy, do you have a script you'd like me to run against python3? Just toss me a pastebin link, and ideally the version of python3 to run, since half the python3 scripts on my system seem to require a different version of python3 from the other half and a variety of isolated sets of python libs in virtual environments (heck, pip even warns you not to try installing libs globally so everyone can use same set these days). I'd rather not try to follow a set of suggestions and then be told I did it wrong. As for typo, yep. But then, I've left this script essentially untouched for a couple of decades since I was given it. | | |
| ▲ | zahlman 5 days ago | parent [-] | | > do you have a script you'd like me to run against python3? Just toss me a pastebin link, and ideally the version of python3 to run Here's a diff: diff --git a/ibmfilter b/ibmfilter
index 245d32c..2633335 100755
--- a/ibmfilter
+++ b/ibmfilter
@@ -1,6 +1,5 @@
-#!/usr/bin/python2 -tt
-# vim:set fileencoding=utf-8
-
+#!/usr/bin/python3
+
from subprocess import *
import sys
import os, select
@@ -10,8 +9,8 @@ special = {
}
if len(sys.argv) < 2:
- print "usage: ibmfilter [command]"
- print "Runs command in a subshell and translates its output from ibm473 codepage to UTF-8."
+ print("usage: ibmfilter [command]")
+ print("Runs command in a subshell and translates its output from ibm473 codepage to UTF-8.")
sys.exit(0)
handle = Popen(sys.argv[1:], stdout=PIPE, bufsize=1)
@@ -26,8 +25,10 @@ while buf != '':
os.kill(handle.pid)
os.system('reset')
raise Exception("Timed out while waiting for stdout to be writeable...")
- sys.stdout.write(out.encode("UTF-8"))
-
+ sys.stdout.buffer.write(out.encode("UTF-8"))
+ sys.stdout.flush()
+
buf = handle.stdout.read(1)
handle.wait()
I already have tested it and it works fine as far as I can tell on every version since at least 3.3 through 3.13 inclusive. There's really nothing version specific here, except the warning I mentioned which is introduced in 3.8. If you encounter a problem, some more sophisticated diagnostics would be needed, and honestly I'm not actually sure where to start with that. (Although I'm mildly impressed that you still have access to a 2.7 interpreter in /usr/bin without breaking anything else.)If you want to add overrides, you must use bytes literals for the keys. That looks like: b'\xff': 'X'
> (heck, pip even warns you not to try installing libs globally so everyone can use same set these days)Some Python programs have mutually incompatible dependencies, and you can't really have two versions of the same dependency loaded in the same runtime. This has always been a problem; you're just looking at the current iteration of pip trying to cooperate with Linux distros to help you not break your system as a result. "Using the same set" is not actually desirable for development. | | |
| ▲ | capitainenemo 5 days ago | parent [-] | | So, the patch failed with both my original file and the pastebin one - perhaps due to indentation of Hacker News, so I manually applied since it did seem pretty straightforward - honestly given how short the file was, it would have taken up the same amount of space here as the diff I think, but hopefully I applied it correctly. Manual copy/paste in python always worries me w/ the significant white-space thing (one of our friends accidentally DoS'd our server with his first python script due to that, but that was back when mixing tabs and spaces didn't throw an error by default), but I probably did it right. And with that out of the way. This one seems to mostly work! So python3 did not significantly change handling this sort of byte stream and while Mercurial folks might well have had their own woes, I have no idea what the issues were in all those prior attempts with this file. ... that said, it does do one odd thing (following is output on launching): /usr/lib/python3.12/subprocess.py:1016: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used
self.stdout = io.open(c2pread, 'rb', bufsize)
And yet, I can't spot any issues in gameplay, yet, caused by this, so I'm inclined to let it pass? But, it does make me wonder if later on, I might hit issues...At least for now, I'm going to tentatively say it seems fine. Hm. You know what. Let me try with some more obvious things that might fail if the buffer size is wrong. So. Now I'm wondering why, given how relatively minor this change is (aside from the odd error message, and the typical python3 changes just one slightly modified line and one inserted line), why did so many pythonistas have so much difficulty over the many years I asked about this? I mean, I only formed my opinion that maybe there was a problem with python3 byte/string handling to just how many attempts there were... Were they trying to do things in a more idiomatic python3 fashion? Did the python3 APIs change? Does the error hint at something more concerning? Well, whatever. Clearly it's (mostly) fine now. And my carefully tweaked nethack profile is safe if python2 is removed without needing to make my own stream filter. Yay! Thanks! ... further updates.. ok there are a few issues. 1) the warning 2) there's an odd ghost flicker that jumps around the nethack level as if a cursor is appearing - does not happen in the python2 one. 3) on quitting it no longer it no longer exits gracefully and I have to ctrl-c the script. 4) It is much slower to render. the python2 one draws a screen almost instantly for most uses (although still a bit slower than not filtered, at least on this computer, for things that change a lot, like video). This one ripples down - that might explain the ghost flickering in ② and might be related to the buffer warning. This becomes much more noticeable with BBSes although it is usually fine in nethack. You can see the difference on a simpler testcase without setting up a BBS account by streaming a bit more data at once say by running: ibmfilter curl ascii.live/nyan So, clearly not perfect but.. eh. functional? Still far better than prior attempts, and at least it mostly works with nethack. | | |
| ▲ | zahlman 5 days ago | parent [-] | | > perhaps due to indentation of Hacker News, so I manually applied since it did seem pretty straightforward Yes, that would be exactly why. You can use e.g. `sed` to remove leading whitespace from each line (I used it to add the leading whitespace for posting). > ... that said, it does do one odd thing (following is output on launching): Yes, that's the warning I mentioned. The original code requests to use a buffer size of 1, which is no longer supported (it now means to use line buffering). > It is much slower to render. Avoiding line buffering (by requiring a buffer size of 2 or more) might fix that. Actually, it might be a good idea to use a significantly larger buffer, so that e.g. an entire ANSI colour code can be read all at once. The other issues are, I'm pretty sure, because of other things that changed in how `subprocess` works. Fixing things at this level would indeed require quite a bit more hacking around with the low-level terminal APIs. > I mean, I only formed my opinion that maybe there was a problem with python3 byte/string handling to just how many attempts there were... Were they trying to do things in a more idiomatic python3 fashion? Did the python3 APIs change? Does the error hint at something more concerning? Most likely, other attempts either a) didn't understand what the original code was doing in precise enough detail, or b) didn't know how to send binary data to standard output properly (Python 3 defaults to opening standard output as a text stream). All of that said: I think that nowadays you should just be able to get a build of NetHack that just outputs UTF-8 characters directly; failing that, you can use the `locale` command to tell your terminal to expect cp437 data. | | |
| ▲ | capitainenemo 4 days ago | parent [-] | | Well, I also use it for other old terminal apps (BBS games), and the remapping of characters was fun too, using unicode to make certain features more distinguishable. (Only downside is it messes up the crystal ball, if you remapped that char, since it wants the standard values, but you can just use memorised values or turn it off temporarily) The unfortunate thing is the "lag" is a bit annoying with some apps, so I'll probably still use the python2 one for now. | | |
| ▲ | capitainenemo 4 days ago | parent [-] | | oh. and, using 2 did silence the warning, but performance was still bad compared to python2, and it still requires ctrl-c to exit |
|
|
|
|
|
|
|
|
|
| |
| ▲ | afiori 7 days ago | parent | prev | next [-] | | I would like an utf-8 optimized bag of bytes where arbitrary byte operations are possible but the buffer keeps track of whether is it valid utf-8 or not (for every edit of n bytes it should be enough to check about n+8 bytes to validate) then utf-8 then utf-8 encoding/decoding becomes a noop and utf-8 specific apis can check quickly is the string is malformed or not. | | |
| ▲ | account42 7 days ago | parent [-] | | But why care if it's malformed UTF-8? And specifically, what do you want to happen when you get a malformed UTF-8 string. Keep in mind that UTF-8 is self-synchronizing so even if you encode strings into a larger text-based format without verifying them it will still be possible to decode the document. As a user I normally want my programs to pass on the string without mangling it further. Some tool throwing fatal errors because some string I don't actually care about contains an invalid UTF-8 byte sequence is the last thing I want. With strings being an arbitrary bag of bytes many programs can support arbitrary encodings or at least arbitrary ASCII-supersets without any additional effort. | | |
| ▲ | afiori 7 days ago | parent [-] | | The main issue I can see is not garbage bytes in text but mixing of incompatible encoding eg splicing latin-1 bytes in a utf-8 string. My understanding of the current "always and only utf-8/unicode" zeitgeist is that is comes mostly from encoding issues among which the complexity of detecting encoding. I think that the current status quo is better than what came before, but I also think it could be improved. |
|
| |
| ▲ | bawolff 7 days ago | parent | prev | next [-] | | Me too. The languages that i really dont get are those that force valid utf-8 everywhere but dont enforce NFC. Which is most of them but seems like the worst of both worlds. Non normalized unicode is just as problematic as non validated unicode imo. | |
| ▲ | jibal 7 days ago | parent | prev | next [-] | | Python has byte arrays that allow for that, in addition to strings consisting of arrays of Unicode code points. | |
| ▲ | account42 7 days ago | parent | prev [-] | | Yes, I always roll my eyes when people complain that C strings or C++'s std::string/string_view don't have Unicode support. They are bags of bytes with support for concatenation. Any other transformation isn't going to have a "correct" way to do it so you need to be aware of what you want anyway. | | |
| ▲ | astrange 6 days ago | parent [-] | | C strings are not bags of bytes because they can't contain 0x00. |
|
|
|