I wrote an MCP based on that technique - https://github.com/whs/mcp-chinesewall

Basically to avoid the ambiguity of training LLM from unlicensed code, I use it to generate description of the code to another LLM trained from permissively licensed code. (There aren't any usable public domain models I've found)

I use it in real world and it seems that the codegen model work 10-20% of the time (the description is not detailed enough - which is good for "clean room" but a base model couldn't follow that). All models can review the code, retry and write its own implementation based on the codegen result though.

▲

ghuntley 7 days ago | parent [-]

Nice. Any chance you could put in some attributions and credits in your paper? https://orcid.org/0009-0007-3955-9994

	▲	whs 7 days ago \| parent [-]
		I never read your work though (and still haven't since it's paywalled), I just discovered today that we independently discovered the same thing.