Would standard HDL synthesis engines be better at this in terms of schematic capture? They could do optimizations that I think if I'm reading right weren't done here
Well stuff like if in one block you invert a signal, and later you invert it again, those optimization engines can cancel that out. Possibly they could also do to fanciness with complicated logical operations to reduce transistor count too. But I'm not entirely sure if their same benefit on FPGA applies to transistors