Remix.run Logo
pfdietz 14 hours ago

Well, if you're programming in C or C++, there may not be a parse tree. Tree-sitter makes a best effort attempt to parse but it can't in general due to the preprocessor.

rs545837 14 hours ago | parent [-]

Great point. C/C++ with macros and preprocessor directives is where tree-sitter's error recovery gets stretched. We support both C and C++ in sem-core(https://github.com/Ataraxy-Labs/sem) but the entity extraction is best-effort for heavily macro'd code. For most application-level C++ it works well, but something like the Linux kernel would be rough. Honestly that's an argument for gritzko's AST-native storage approach where the parser can be more tightly integrated.

pfdietz 9 hours ago | parent [-]

It's an argument against preprocessors for programming languages.

Tree-sitter's error handling is constrained by its intended use in editors, so incrementality and efficiency are important. For diffing/merging, a more elaborate parsing algorithm might be better, for example one that uses an Earley/CYK-like algorithm but attempts to minimize some error term (which a dynamic programming algorithm could be naturally extended to.)

rs545837 4 hours ago | parent [-]

Interesting idea. Tree-sitter's trade-off (speed + incrementality over completeness) makes sense for editors but you're right that for merge/diff a more thorough parser could be worth the cost since it's a cold path, not real-time. We only parse three file versions at merge time so spending an extra 50ms on a better parse would be fine. Worth exploring, thanks for the pointer.