Remix.run Logo
howardyou 2 days ago

Touching on what you were saying about accuracy converging like a log-like curve while computation increases exponentially, do you have an example where increasing computational resources by ten times leads to, say, only a 20% improvement in accuracy?

godelski a day ago | parent [-]

What I said before is a bit handwavy so I want to clarify this first. If we make the assumption that there is something that's 100% accurate, I'm saying that your curve will typically make the most gains at the start and much less as at the end. There can be additional nuances in this when discussing the limitations of metrics but I'd table that for your current stage (it is an incredibly important topic, so make sure you come back to it. You just need some pre-reqs to get a lot out of it[0]).

So maybe a classic example of this is the infamous 80/20 rule. You can read about the Pareto Principle[1] which really stems from the Pareto Distribution, which is a form of a Power Distribution. If you're looking at the wiki page for the Pareto Distribution (or Power Law), you'll see the shapes I'm talking about.

A real life example of this is when you train a machine learning model. Let's take accuracy for just simplicity. Let's look at PyTorch's example on using Tensorboard since that includes a plot at the very end[2]. Their metric is loss, which in this case is the inverse accuracy. So Accuracy is 0-100 (0 to 1) where higher is better, loss is just 1-accuracy, so 0 means perfectly accurate. From 0-2k iterations, they went from 1 to 0.6 (a 0.4 gain). Then at 4k iterations they are at a 0.4 loss (a 0.2 gain over 2k iterations). You see how this continues? It is converging towards a loss of 0.2 (accuracy = 80%). This is exactly what I'm talking about. Look at your improvements over some delta (in our case loss/(2k iterations)). It's a second order effect here, meaning it's non-linear.

This nonlinearity shows up everywhere. Going back to the 80/20 rule, it is often applied to coding. 80% of code is written using 20% of the time, but 20% of the code is written with 80% of the time. This should make sense as there are different bottlenecks. We'd be naive to just measure by lines of code (see [0]). A lot of time is spent on debugging, right? And mostly debugging just a few key areas. The reason this is true can derive from a simple fact: not all lines of code are equally as important.

So the other example I mentioned in the previous comment is Fourier Series[3]. That wiki has some nice visualizations and you'll be able to grasp what I'm talking about from them. Pay close attention to that first figure, the middle plot (image 2/17). These are different order approximations to a square wave. Might be hard to see, but as the more complex the wave (higher order) the better approximation you get to that square wave. Pay close attention to the calculations. Do a few yourself! How much work goes into calculating each term? Or rather, each order of approximation. I think you'll get the sense pretty quickly here that every higher order calculation requires you to also do the lower order ones.

As a more realistic example I am the creator of a state of the art image generator (I won't say which one to maintain some anonymity). When training my model the score quickly improves and really only a small amount of time. This training run took approximately 2 weeks wall time (what the clock says, not the GPU). Most of the improvement (via metric) took place in the first 6hrs. I was >90% of the way to my final score within the first day. If you look at the loss function in full, almost everything looks flat. But if you window it to exclude the first 24hrs, the shape reappears! There's a fractal nature to this (Power Distribution!). To put numbers to this, my whole run took 1M iterations and my final score was ~4.0. My first measurement was at 5k and was 180. My next measurement was at 25k and at 26. 15@50k, 9@100k, 6.8@200k, 5@500k, and so on. This is very normal and expected. (Then there's the complexity of [0]. Visually the images improved too. At 5k they were meaningless blobs. By 100k they had the general desired shape and even some detail appeared. By 500k most images resembled my target. At 800k I had SOTA but had could tell things were off. By 1M I thought there was a huge visual improvement from 800k but this is all down to subtle details and there are no measurements that can accurately reflect this)

I am happy to answer more but you're also asking about a complex topic with a lot of depth. One I absolutely love, but just giving you a warning :)

[0] The super short version is no matter what measurement you take you are using a proxy. Even a ruler is a proxy for a meter. It isn't exact. When measuring you approximate the measurement of the ruler which is an approximation of the measurement of a meter. This case is typically very well aligned so the fact that it is a proxy doesn't matter much (if you include your uncertainties). This isn't so simple when you move to more complex metrics like every single one you see in ML. Even something like "accuracy" is not super well defined. Go through a simple dataset like CIFAR-10 and you'll find some errors in labels. You'll also find some more things to think about ;) Table this for now but keep it in the back of your head and let it mature.

[1] https://en.wikipedia.org/wiki/Pareto_principle

[2] https://docs.pytorch.org/tutorials/intermediate/tensorboard_...

[3] https://en.wikipedia.org/wiki/Fourier_series

howardyou 16 hours ago | parent [-]

Thanks for all of that!

If you don't mind, could I talk about it with you more over email? My email address is listed in my profile.