Remix.run Logo
binaryturtle 6 hours ago

Isn't the real issue here that tons of projects that depend on the "chardet" now drag in some crappy still unverified AI slop? AI forgery poisoning, IMHO.

Why does this new project here needed to replace the original like that in this dishonourable way? The proper way would have been to create a proper new project.

Note: even Python's own pip drags this in as dependency it seems (hopefully they'll stick to a proper version)

robinsonb5 5 hours ago | parent | next [-]

This indeed the real issue (not the AI angle per se, but the wholesale replacement. The licensing issue is real, but less important IMO).

Half a million lines of code have been deleted and replaced over the course of four days, directly to the main branch with no opportunity for community review and testing. (I've no idea whether depending projects use main or the stable branch, but stable is nearly 4 years old at this point, so while I hope it's the version depending projects use, I wouldn't put money on it.)

The whole thing smells a lot like a supply chain attack - and even if it's in good faith, that's one hell of a lot of code to be reviewed in order to make sure.

duskdozer 3 hours ago | parent | next [-]

The test coverage is going to be entirely different, unless of course they copied the tests, which would then preclude them from changing the license. They didn't even bother to make sure the CI passed on merging a major version release https://github.com/chardet/chardet/actions/runs/22563903687/...

earthscienceman 3 hours ago | parent | prev [-]

Woah. As someone not in this particular community but dependent on these tools this is exactly the terrifying underbelly we've all discussed with the user architecture of tools like pip and npm. It's horrifying that a major component just got torn apart, rebuilt, and deployed to anyone who uses those python ecosystems (... many millions? ... billions of people?)

adrian17 3 hours ago | parent | prev | next [-]

The drop"-in" compatibility claims are also just wrong? I ran it on the old test suite from 6.0 (which is completely absent now), and quickly checking:

- the outputs, even if correctly deduced, are often incompatible: "utf-16be" turns into "utf-16-be", "UTF-16" turns into "utf-16-le" etc. FWIW, the old version appears to have been a bit of a mess (having had "UTF-16", "utf-16be" and "utf-16le" among its outputs) but I still wouldn't call the new version _compatible_,

- similarly, all `ascii` turn into `Windows-1252`

- sometimes it really does appear more accurate,

- but sometimes it appears to flip between wider families of closely related encodings, like one SHIFT_JIS test (confidence 0.99) turns into cp932 (confidence 0.34), or the whole family of tests that were determined as gb18030 (chinese) are now sometimes determined as gb2312 (the older subset of gb18030), and one even as cp1006, which AFAIK is just wrong.

As for performance claims, they appear not entirely false - analyzing all files took 20s, versus 150s with v6.0. However, looks like the library sometimes takes 2s to lazy initialize something, which means that if one uses `chardetect` CLI instead of Python API, you'll pay this cost each time and get several times slower instead.

Oh, and this "Negligible import memory (96 B)" is just silly and obviously wrong.

duskdozer 4 hours ago | parent | prev [-]

Yeah, there's really low quality code added if you take a look.