| ▲ | ashirviskas a day ago | |
Wanted to save up a few tokens when passing data to LLMs and did not like anything on the market, so I made minemizer. Minemizer is a data formatter that produces csv-like output, but supports nested and sparse data, is human readable and super simple. It produces even less tokens than csv for flat data, due to most tokenizers better tokenizing full words that contain a space before the word, and leads to less fragmentation. There are many cool things I discovered while running tons of testing and benchmarking, but it's getting late here. Code, benchmarks, tokenization examples and everything else can be found in the repo, but it is still very WIP: https://github.com/ashirviskas/minemizer Or here: https://ashirviskas.github.io EDIT: Ignore latency timings and token counts in "LLM Accuracy Summary" in benchmarks as different size datasets were used to generate accuacy numbers while I was running tons of experiments. For accurate compression numbers see compression benchmarks results. Or each benchmark one by one. I will eventually fix all the benchmark numbers to be representative. | ||
| ▲ | Minor49er 18 hours ago | parent [-] | |
Why the name Minemizer instead of something like Minimizer? | ||