The problem with regex is multi-language support and how big the regex will bloat if you to support even 10 languages.

doublesocket 4 hours ago | parent | next [-]

Supporting 10 different languages in regex is a drop in the ocean. The regex can be generated programmatically and you can compress regexes easily. We used to have a compressed regex that could match any placename or street name in the UK in a few MB of RAM. It was silly quick.

▲

astrocat 2 hours ago | parent | next [-]

woah. This is a regex use I've never heard of. I'd absolutely love to see a writeup on this approach - how its done and when it's useful.

	▲	benlivengood an hour ago \| parent [-]
		You can literally \| together every street address or other string you want to match in a giant disjunction, and then run a DFA/NFA minimization over that to get it down to a reasonable size. Maybe there are some fast regex simplification algorithms as well, but working directly with the finite automata has decades of research and probably can be more fully optimized.

▲

cogman10 2 hours ago | parent | prev [-]

I think it will depend on the language. There are a few non-latin languages where a simple word search likely won't be enough for a regex to properly apply.

▲

TeMPOraL 4 hours ago | parent | prev | next [-]

We're talking about Claude Code. If you're coding and not writing or thinking in English, the agents and people reading that code will have bigger problems than a regexp missing a swear word :).

▲

MetalSnake 4 hours ago | parent | next [-]

I talk to it in non-English. But have rules to have everything in code and documentation in english. Only speaking with me should use my native language. Why would that be a problem?

▲

ekropotin 3 hours ago | parent [-]

Because 90% of training data was in English and therefore the model perform best in this language.

▲

foldr 3 hours ago | parent [-]

In my experience these models work fine using another language, if it’s a widely spoken one. For example, sometimes I prompt in Spanish, just to practice. It doesn’t seem to affect the quality of code generation.

	▲	ekropotin an hour ago \| parent \| next [-]
		It’s just a subjective observation. It just can’t be a case simply because how ML works. In short, the more diverse and high quality texts with reasoning reach examples were in the training set, the better model performs on a given language. So unless Spanish subset had much more quality-dense examples, to make up for volume, there is no way the quality of reasoning in Spanish is on par with English. I apologise for the rambling explanation, I sure someone with ML expertise here can it explain it better.
	▲	adamsb6 3 hours ago \| parent \| prev [-]
		They literally just have to subtract the vector for the source language and add the vector for the target. It’s the original use case for LLMs.

▲

cryptonector an hour ago | parent | prev | next [-]

Claude handles human languages other than English just fine.

▲

formerly_proven 4 hours ago | parent | prev [-]

In my experience agents tend to (counterintuitively) perform better when the business language is not English / does not match the code's language. I'm assuming the increased attention mitigates the higher "cognitive" load.

▲

crimsonnoodle58 4 hours ago | parent | prev | next [-]

They only need to look at one language to get a statistically meaningful picture into common flaws with their model(s) or application.

If they want to drill down to flaws that only affect a particular language, then they could add a regex for that as well/instead.

▲

b112 4 hours ago | parent | prev [-]

Did you just complain about bloat, in anything using npm?