Remix.run Logo
flexagoon 13 hours ago

Wouldn't at least the first issue be solved by using Unicode case folding instead of lowercase? Python, for example, has separate .casefold() and .lower() methods, and AFAIK casefold would always turn I into i, and is much more appropriate for this use case.

rhdunn 5 hours ago | parent | next [-]

There are 3 types of case folding:

1. Simple one-to-one mappings -- E.g. `T` to `t`. These are typically the ones handled by `lower()` or similar methods as they work on single characters so can modify a string in place (the length of the string doesn't change).

2. More complex one-to-many mappings -- E.g. German `ß` to `ss`. These are covered by functions like `casefold()`. You can't modify the string in place so the function needs to always write to a new string buffer.

3. Locale-specific mappings -- This is what this bug is about. In Turkish `I` maps to `ı` whereas other languages/locales it maps to `i`. You can only implement this by passing the locale to the case function, irrespective of whether you are also doing (1) or (2).

zahlman 3 hours ago | parent [-]

This is not quite right, at least for Python. .upper() and .lower() (and .casefold() as well) implement the default casing algorithms from the Unicode specification, which are one-to-many (but still locale-naive). Other languages, meanwhile, might well implement locale-aware mapping that defaults to the system locale rather than requiring a locale to be passed.

zahlman 3 hours ago | parent | prev [-]

Both .casefold() and .lower() in Python use the default Unicode casing algorithms. They're unicode-aware, but locale-naive. So .lower() also works for this purpose; the point of .casefold() is more about the intended semantics.

See also: https://stackoverflow.com/questions/19030948 where someone sought the locale-sensitive behaviour.