▲ | ndai a day ago | |||||||
I’m curious where you got your training data? I will look myself, but saw this and thought I’d ask. I have a CPU-first, no-backprop architecture that works very well on classification datasets. It can do single‑example incremental updates which might be useful for continuous learning. I made a toy demo to train on tiny.txt and it can predict next characters, but I’ve never tried to make an LLM before. I think my architecture might work well as an on-device assistant or for on-premises needs, but I want to work with it more before I embarrass myself. Any open-source LLM training datasets you would recommend? | ||||||||
▲ | electroglyph a day ago | parent | next [-] | |||||||
https://huggingface.co/datasets/NousResearch/Hermes-3-Datase... | ||||||||
| ||||||||
▲ | kachapopopow a day ago | parent | prev | next [-] | |||||||
huggingface has plenty of openai and antrophic user to assistant chains, beware there are dragons (hallucinations), but good enough for instruction training. I actually recommend distilling kimi k2 instead for instruction following capabilities. | ||||||||
▲ | hadlock a day ago | parent | prev [-] | |||||||
The training data is contained inside main.rs; it looks like about 50 statements about general facts. Probably to keep training time down. If you go off-script things fall apart pretty quickly: Enter prompt: hello Model output: What are eclipses ? Assistant : Eclipses occur when one celestial body moves into the shadow of another </s> Enter prompt: what are facts Model output: eclipses Assistant : Eclipses Eclipses What What was when What through you ? through you meet through using of What was What tall ? of What thank help you explain using , and do you a patterns you you a patterns through air in tall you help metal metal </s> Enter prompt: how are mountains formed? Model output: I ? ' I ' : Of : in happy Hi wind in yeast altering it it </s> |