Remix.run Logo
schyzomaniac 6 days ago

hi, congrats for the amazing work!

i love the 27b model, and i use it basically daily. however when i tried to finetune it for a task in a low resource language, unfortunately i did not succeed: lora just did not picked up the gist of the task, full finetune lead to catastrophic forgetting.

may i ask four your advice, or do you have any general tips how to do that properly?

thanks in advance for your help :)

canyon289 6 days ago | parent | next [-]

Without seeing the full experiment and data its hard to tell, sort of like guessing why a soup tastes bad without trying it but here's my guesses!

1. Good instinct with LORA and PEFT. As others suggested below perhaps try changing the hypers, either making the LORA adapter bigger, a higher learning rate, or using more epochs. See where things start to shift from "nothing" to closer to what you want

2. For full finetune track earlier checkpoints to see where the forgetting is happening. So for instance if you're training for 1000 steps, check step 100, 200, 300, etc. You'll see where the shift starts to happen and where it becomes too much. Here is an example where you can see where the LLM starts to pick up "words" then sentences, as it goes through training https://ravinkumar.com/GenAiGuidebook/deepdive/GPTFromScratc...

3. Use smaller models for testing before moving up. Part of the reason we released this small Gemma is to support the larger Gemma models as well. Testing changes on small models lets you more quickly and cheaply see whats working and isn't, before then scaling up to fine tuning the bigger models.

Hope these tips help and thanks for using LLMs for localization and what sounds like tasks to help your specific community, and sharing here. It's personally motivating for me to hear that people are using technology in this way.

ActorNightly 6 days ago | parent | prev | next [-]

Feed in Context with documentation for that language?

namibj 6 days ago | parent | prev [-]

lora hyper parameter change? Defaults may not be tuned for knowledge insertion , but rather for style imprinting.