I know this is grumpy but this I’ve never liked this answer. It is a perfect encapsulation of the elitism in the SO community—if you’re new, your questions are closed and your answers are edited and downvoted. Meanwhile this is tolerated only because it’s posted by a member with high rep and username recognition.

▲

1718627440 3 hours ago | parent | next [-]

I think this answer was tolerated when SO wasn't as bad as it is now, and wouldn't be tolerated now from anyone.

	▲	bombcar an hour ago \| parent [-]
		It's because SO at the time was a small high-trust society where "everyone knew each other" and so things flew back then that wouldn't fly now.

▲

throwaway_61235 3 hours ago | parent | prev [-]

As someone who used to write custom crawlers 20 years ago, I can confirm that regular expressions worked great. All my crawlers were custom designed for a page and the sites were mostly generated by some CMS and had consistent HTML. I don't remember having to do much bug fixes that were related to regular expression issues.

I don't suggest writing generic HTML parsers that works with any site, but for custom crawlers they work great.

Not to say that the tools available are the same now as 20 years ago. Today I would probably use puppeteer or some similar tool and query the DOM instead.

	▲	vbezhenar 23 minutes ago \| parent \| next [-]
		An interesting thing is that most webpages are generated using text templates. There's some text processing like escaping special characters, but it's mostly text that happened to be (somewhat) valid HTML. So extracting information from this text with regexps often makes perfect sense.
	▲	wat10000 an hour ago \| parent \| prev [-]
		I would distinguish between parsing and scraping. Parsing really needs a, well, parser. Otherwise you’ll get things wrong on perfectly well formed input and your program will be brittle and weird. A scraper is already resigned to being brittle and weird. You’re relying not only on the syntax of the data, but an implicit structure beyond that. This structure is unspecified and may change without notice, so whatever robustness you can achieve will come from being loose with what you accept and trying to guess what changes might be made on the other end. Regex is a decent tool for that.