Remix.run Logo
jp57 4 days ago

"git doesn't really work ... because docx is a binary blob."

Well, yes, but the binary blob is a zip archive of a directory of text XML files, and one could imagine tooling that wraps the git interaction in an unzip/zip bracket.

The real problem is that lawyers, like basically all other non-programmers, neither know nor care about the sequence of bytes that makes a file in the minds of programmers. In their minds the file IS what they see when they open it in word: a sequence of white rectangles with text laid out on it in specific ways, including tables with borders, etc. The fact that a lot of really complicated stuff goes on inside the file to get the WYSIWYG rendering is not only irrelevant to them, it's unknown.

Maybe the answer here will be along the lines of Karpathy's musings about making LLMs work directly with pixels (images of text), instead of encoded text and tokenizers [1]. An AI tool would take the document visually-standard legal document form, and read it, and produce output with edits, redlines, etc as directed by the user.

[1] https://x.com/karpathy/status/1980397031542989305

jpbryan 4 days ago | parent | next [-]

Diffing the XML is a complete nonstarter. I've spent years working with the OpenXML format and can assure you it is very complex even for a professional software engineer with 10 years of experience.

The diff of the document (referred to as a "redline") is what lawyers send to the client and their counterparties. It's essential that the redline is legible for all parties and reflects their professionalism.

Moreover, it is not enough to see the structural changes between the versions. A lawyer needs to see the formatting changes between the versions as well which cannot be accomplished by diffing XML files.

HPsquared 4 days ago | parent | next [-]

And, importantly, there already is an official diff tool: the "Compare" button.

jpbryan 4 days ago | parent [-]

Correct. Solely relying on the built in Word Compare tool results in a whole host of version control issues, however, which I outline in detail in my post "On Building Git for Lawyers."

https://theredline.versionstory.com/p/on-building-git-for-la...

xg15 3 days ago | parent | next [-]

Git supports registering custom diff tools for specific file types [1]

Wouldn't the obvious solution then be to take the tool they already use for redlining (e.g. Word's compare function) and integrate it into a git workflow?

[1] https://stackoverflow.com/questions/12356917/how-to-set-diff...

4 days ago | parent | prev [-]
[deleted]
Imustaskforhelp 3 days ago | parent | prev | next [-]

Pardon me but is there any way that openxml can be converted to a format similar to https://www.gnu.org/software/recutils/

If openxml can be converted to csv/similar perhaps which can be converted to recutils

Recutils supports both mdb (Microsoft Access database files)/csv files to/from recutils

I saw this project on a recent hackernews comment and I had seen some comments there about how it does / can work decently with git features iirc (https://news.ycombinator.com/item?id=46265811)

I am interested to hear what your thoughts on recutils are and if perhaps we can have microsoft word/similar to git+recutils like workflow maybe

I thought about it and a tar/zipped git folder which can contain images/other content too which can be referenced with recutils instead of openxml/word document to me does feel an interesting idea

I am not sure but I think that openxml directly embeds data like pictures which can defnitely make it hard for git software to work perhaps but basically I am interested what you think about this/any feedback

conartist6 3 days ago | parent | prev [-]

You don't seem to be aware of any of the work I'm doing on CSTML (built to replace HTML and XML, and yes, built to be useful for legal documents (even though IANAL)). If you're interested in collaborating to go after the law market, let's talk! You're trying to sneak in a side door. I'm planning to smash down the main gates, the ones you say are impregnable. My investigation says they're not unbreakable, but instead strong and brittle. Many attacks will bounce off, yes, but brittleness means that these are defenses that will shatter before they bend.

jiggawatts 4 days ago | parent | prev [-]

Something I've started doing in my workflow is using Pandoc to convert between Markdown and DOCX when authoring long documents. This lets me put the Markdown into Git and apply the Gemini CLI to it. When referencing other documents, I'll also convert them to MD and drop them into a folder so I can tell the AI to read them and cross-reference things.

At the start of the project the Markdown is authoritative, and the DOCX is just for previewing the styling. (Pandoc can insert the text into a layout template with place holders.)

Towards the end of a project I'll start treating the DOCX as authoritative but continue generating Markdown from it, so I can run the AI over it as a final proof-read or whatever.

This is similar to what people used to do with DocBook, but with a more friendly text format and a more AI-friendly "modern" workflow with Git, etc...

conception 4 days ago | parent [-]

I do this with asciidoc instead of- same advantages with git and llms but you get a tremendous amount more styling and functionality.