| ▲ | julianlam 9 hours ago | |||||||
Can you share more about what adaptations you made when using smaller models? I'm just starting my exploration of these small models for coding on my 16GB machine (yeah, puny...) and am running into issues where the solution may very well be to reduce the scope of the problem set so the smaller model can handle it. | ||||||||
| ▲ | ukuina 8 hours ago | parent | next [-] | |||||||
You'd do most of the planning/cognition yourself, down to the module/method signature level, and then have it loop through the plan to "fill in the code". Need a strong testing harness to loop effectively. | ||||||||
| ▲ | adrian_b 8 hours ago | parent | prev [-] | |||||||
It is very unlikely that general claims about a model are useful, but only very specific claims, which indicate the exact number of parameters and quantization methods that are used by the compared models. If you perform the inference locally, there is a huge space of compromise between the inference speed and the quality of the results. Most open weights models are available in a variety of sizes. Thus you can choose anywhere from very small models with a little more than 1B parameters to very big models with over 750B parameters. For a given model, you can choose to evaluate it in its native number size, which is normally BF16, or in a great variety of smaller quantized number sizes, in order to fit the model in less memory or just to reduce the time for accessing the memory. Therefore, if you choose big models without quantization, you may obtain results very close to SOTA proprietary models. If you choose models so small and so quantized as to run in the memory of a consumer GPU, then it is normal to get results much worse than with a SOTA model that is run on datacenter hardware. Choosing to run models that do not fit inside the GPU memory reduces the inference speed a lot, and choosing models that do not fit even inside the CPU memory reduces the inference speed even more. Nevertheless, slow inference that produces better results may reduce the overall time for completing a project, so one should do a lot of experiments to determine an appropriate compromise. When you use your own hardware, you do not have to worry about token cost or subscription limits, which may change the optimal strategy for using a coding assistant. Moreover, it is likely that in many cases it may be worthwhile to use multiple open-weights models for the same task, in order to choose the best solution. For example, when comparing older open-weights models with Mythos, by using appropriate prompts all the bugs that could be found by Mythos could also be found by old models, but the difference was that Mythos found all the bugs alone, while with the free models you had to run several of them in order to find all bugs, because all models had different strengths and weaknesses. (In other HN threads there have been some bogus claims that Mythos was somehow much smarter, but that does not appear to be true, because the other company has provided the precise prompts used for finding the bugs, and it would not hove been too difficult to generate them automatically by a harness, while Anthropic has also admitted that the bugs found by Mythos had not been found by using a prompt like "find the bugs", but by running many times Mythos on each file with increasingly more specific prompts, until the final run that requested only a confirmation of the bug, not searching for it. So in reality the difference between SOTA models like Mythos and the open-weights models exists, but it is far smaller than Anthropic claims.) | ||||||||
| ||||||||