Thank you!
Yea absolutely, but man, where to even start, it is very specific.
Fundementally I didn't use any wrappers like unsloth or axolotl, although I have used the latter before a year or two back and it was good, but I needed something very very custom. I also wanted the whole fine tuning pipeline to exported OpenVino model to be seamless.
I heavily leaned on codex, claude and some manual sleuthing around the internet to understand what I needed. I'd played about with QLoRA finetuning with axolotl before and felt most comfortable with that. So I needed to keep everything as stripped down as possible and figured I can just utilise the 3 main huggingface libraries (transformers, peft and datasets) and also bitsandbytes (as suggested by claude to quantize the model to keep this working on my GPU) along with some custom scripts generated by claude/codex (each cross referencing each other) that will do the different stages of the training run.
The next part was the data. Obviously didn't have access to thousands of meetings and associated output documents but I did have a 3090ti sitting there and a codex subscription. So I set about working out what format I needed the data in (many thanks again, to claude/codex) and started generating hundreds of different transcripts, different amounts of speakers, content, tones, subjects, spelling mistakes - like all the different things you could think a meeting would have. Then it's a case of actually generating a good meeting document off the back of the transcripts and creating the "gold standard" that we'd use.
I'm going to gloss over a lot here as I'd rather not detail it as it relates to some propriatary stuff that I had to work through, but you basically pair the transcripts together and run the training.
At the verification stage, there was pretty much 3 things:
1. "just" do some regex string matching to see if there's any of the source transcript key facts in the output to ensure fact preservation. Same with owner fabrication (who said what), I don't want something attributed to someone when it wasn't them that said it and then finally markdown validation.
2. Using codex/claude to validate the transcript and output from the model - I used the latest frontier models, probably overkill for my task, but they were good at the job
3. Finally me going through some actual recordings of myself, groups, meetings and manually verifiying the output
So a fair bit of work, and for context I'm on version 10 now, so it's been a journey!