Remix.run Logo
1B nested loop iterations(benjdd.com)
74 points by behnamoh 10 hours ago | 71 comments
mcdeltat 7 hours ago | parent | next [-]

You have to be extremely careful with benchmarks like these. It's very easy to measure and conclude different things. In this case, it really only measures the compiler's ability to optimise the `for` loop construct - nothing else. It doesn't say anything much about the languages' general "speed", whatever that means (which is generally the implication with tests like this).

Some points to consider:

- `for` is not how you'd write an optimised loop in all languages.

- Be VERY suspicious of microbenchmarks on C, C++, and similar - the compilers are smarter than you think and you won't get the code you wrote.

- Languages do arrays differently - added memory effects on performance.

- Languages do numbers differently.

- Summing and modulo is somewhat contrived example. Modulo is one of the slower instructions out there.

maxmcd 8 hours ago | parent | prev | next [-]

Another fave of mine by this person: https://benjdd.com/az/

Waterluvian 3 hours ago | parent [-]

Wow what a creative way to visualize the concept.

I think curves make it tricky to maintain objective perception, but the layout contributes significantly to the idea that internal latency is a different order from external.

superjan 9 hours ago | parent | prev | next [-]

It’s a billion times integer modulus, summed. Div/mod is by far the slowest arithmetic instruction on a CPU, so it underestimates C/rust performance. I guess the author chose that to prevent the C compiler from optimizing the loop away.

bddicken 7 hours ago | parent [-]

Author here: Yeah, the mod was added to reduce risk of loop(s) getting optimized away.

remcob 7 hours ago | parent [-]

Did you verify this through disassembly? These loops can still be replaced by a closed form expression and I wouldn’t be surprised if LLVM figured this out.

Dylan16807 6 hours ago | parent [-]

If it turned the inner loop into a closed form expression, I would expect the outer loop to go through 10k iterations a lot faster than needing half a second.

zamadatix 5 hours ago | parent [-]

Running the C code through Godbolt with the same Clang and -O3 optimizations it looks like you're right, there are still 2 loops being performed. The loops are unrolled to perform 4 iterations at a time but it's otherwise the same. Other than that it didn't do too much with the loops. Hilariously there is a god awful set of shifts and magic numbers... to save a single modulo operation from running once on the line declaring 'r'.

See here for why the Go version performs worse https://news.ycombinator.com/item?id=42251571

catapart 5 hours ago | parent | prev | next [-]

I'm surprised to see javascript so high up on the list! I would have thought PHP was faster, honestly. But I did run the benchmark in my browser and saw that the browser JS in chrome is slightly slower than Deno at ~1.2-1.3s. Even though firefox runs closer to 3 seconds, I'm still surprised that browser js is that much faster than some of those closer to the metal.

paulbgd 4 hours ago | parent [-]

It’s impressive but even more interesting, it could still be faster! Java is JIT at #3, so theoretically any of these languages could get there.

ayaros 7 hours ago | parent | prev | next [-]

I'm working on a web app in JS, the sort of thing I think people on here will enjoy and I'll definitely share it when it's presentable. It's all in JS, including the graphical aspects of it (I'm drawing to a canvas). For per-pixel rendering I have 4 levels of nested loops. I feel like I've done everything in my power to optimize the process, and it's still not at the ideal speed. Really, it ought to be about 5-10 times faster than it is now.

I've been thinking of getting it into WASM somehow - it's really just one function that needs to be replaced by a WASM module - but one of the things which has given me pause are benchmarks like this, as well as other anecdotes, that demonstrate I'd only get a minor speedup from WASM. I guess part of this is due to modern JS engines being so optimized.

Just about all of the relevant code is adding and subtracting integers, but theres's some Boolean logic. There's also some multiplication in there that I could eliminate if I really had to, although at this point I'm not certain it would make a significant difference. Most of the data that contributes to the value of each pixel is spread across a tree of objects, and there's some polymorphism involved too. The point is it's not like I'm just dealing with a bunch of contiguous data in memory that is easily cached by the CPU, which might speed things up considerably. I guess I'm just venting now, lamenting a nightmare of my own creation...

catapart 5 hours ago | parent [-]

For the record - I can't imagine any tangible value in writing a graphics engine (and form library) for an environment that was literally designed as a graphics engine for forms. But, that disclaimer aside, I've considered WASM for graphics for a game engine and I have had some trial success with allocating a shared array buffer, writing to it in WASM and reading from it at a js-friendly rate of 30-60 fps (whatever requestAnimationFrame is serving). The idea being that you treat the shared array buffer as your "screen" and write to it from the WASM as if it were an image buffer, while on the JS side you read it as an image buffer onto the canvas. Gives you a nice, finite, bandwidth and an easy to scale maximum for how much data you can pass through, and neither JS nor WASM ever has to pass any time-critical data.

Of course, this setup leaves a very thin JS client - it's essentially just rendering the same image over and over. So if you've already got everything done in JS, it would mean porting all of your functional code over to something that can compile into WASM. Not a huge lift, but not fun either.

And then, on the other hand, if it's just rendering you're seeing as a bottleneck, JS works pretty well with webGL and webGPU for simple form rendering. A little loop unrolling might be a lot easier than porting the rest of the app.

Anyway, best of luck on it!

dxbydt 4 hours ago | parent | prev | next [-]

Aside - a long time ago, I worked on a high school math problem which involved summing the thousand fractions 0/1, 1/2, 2/3,...upto 999/1000, so I did the obvious iteration.

(display (foldl + 0 (map (lambda (x) (/ x (+ x 1))) (range 0 1000))))

The solution matched and I was happy. Then I thought wouldn't it be neat to do this in a dozen languages and benchmark...but got nowhere with that idea. None of the languages supported fraction addition without jumping through whole bunch of hoops. I wonder if its still the same situation. If someone wants to give this a shot, the answer if you do this for the first ten is 17819/2520.

rotifer 2 hours ago | parent [-]

In Haskell the numerator and denominator of the Rational type are unbounded integers, so one of the (many equivalent) ways of writing it is:

    ghci> sum [i % (i+1) | i <- [0..9]]
    17819 % 2520
    ghci> 
% is the constructor for Rational numbers. I.e. one half is written (and shown as) 1 % 2.

I'll spare people the full result, but:

    ghci> sum [i % (i+1) | i <- [0..999]]
    70755...77483 % 71288...20000
    ghci>
wanderingmind 7 hours ago | parent | prev | next [-]

How in the world is pypy faster than cpython by almost 2 orders of magnitude. Is pypy normally faster in other benchmarks. If so, why? Can some python expert pitch in?

lights0123 6 hours ago | parent | next [-]

PyPy uses just-in-time compilation, so for arithmetic-heavy computations not involving the GC, it should absolutely be faster.

cuchoi 3 hours ago | parent | prev [-]

This is the exact code you would expect pypy to run faster.

iambvk 7 hours ago | parent | prev | next [-]

Golang folks, after looking at the code, I am surprised and don't understand why Go version is slow compared to C/C++/Rust. Can anyone please explain? Thank you.

zamadatix 4 hours ago | parent | next [-]

Go's int defaults to 64 bits on a 64 bit machine while C/Rust/Java are all defaulting to 32 bit on the author's machine.

Editing the code and running local to get a difference factor, it looks like the same Go code with int takes 1.5x the amount of time as the int32 code on my machine. That puts it ~right next to Java (assuming the perf change maps 1:1). Remove the GC and it comes in ~right next to C++.

Looking at the compiler assembly output on Godbolt it's fun to see all the extra panic info and checks Go adds before going to the loop whereas the C just segfaults from blasting forward anyways, sometimes with no detail.

Anyways, that's why I don't like these types of "benchmarks" (micro tests made into a media post). By the time someone actually looks into them to see what's up everyone has already seen the results and moved on.

bddicken 7 hours ago | parent | prev [-]

Author here: Yeah, I was surprised that there doesn't seem to be many options for extra optimizations in the Go compiler. Would be curious if Go experts have more insight or other compiler recommendations.

boyter 7 hours ago | parent [-]

I doubt its the GC kicking in, but you could run it with the following environment variable set just to ensure that its not doing anything.

    GOGC=-1
EDIT: Trying it out quickly shows a small improvement actually, but so small as to likely be noise as I was doing other things on the machine.

    Summary
      ./mainnogc 1000 ran
        1.01 ± 0.06 times faster than ./maingc 1000
sysread 6 hours ago | parent | prev | next [-]

I tried this with elixir, and on my m3 mac, it ran in 6.5s. Perl took over 40s :P

  #!/usr/bin/env elixir
  u =
    case System.argv() do
      [arg] -> String.to_integer(arg)
      _ -> raise "Please provide a valid integer as the first argument."
    end

  r = :rand.uniform(10_000) - 1

  a = :array.new(10_000, default: 0)

  a =
    Enum.reduce(0..9_999, a, fn i, acc ->
      inner_sum =
        Enum.reduce(0..99_999, 0, fn j, sum ->
          sum + rem(j, u)
        end)

      updated_value = :array.get(i, acc) + inner_sum + r
      :array.set(i, updated_value, acc)
    end)

  IO.puts(:array.get(r, a))
tmtvl 6 hours ago | parent [-]

Oh, the programs are supposed to take an integer from the command line? Then I wonder how many of them would fail when given an integer that's just a wee bit too large (like, say (2^129)-1).

hatthew 5 hours ago | parent [-]

I don't think that's a concern. My assumption is that the point of taking user input is to prevent a compile from precomputing the result.

xutopia 9 hours ago | parent | prev | next [-]

It blows my mind to see that speed difference between different languages. In even the slowest language there we have lots of ways we could optimize this contrived code but this is not something that we can always do in the real world. Seeing the visualization here really underscores how fast some languages are.

swatcoder 8 hours ago | parent [-]

> Seeing the visualization here really underscores how fast some languages are.

Yes and no.

It emphasizes that there are some scenarios where language choice can have an outsized effect on performance when running on modern architectures, which is a reminder some people could use or have never really witnessed.

But it's a very particular scenario that relatively few projects will encounter at a scale that meaningfully impacts them.

In particular, C and Rust are much better positioned to instruction pipelining and lack of cache busting because there are no conditionals and little-to-no out-of-loop code to execute. Even with a massive loop like this, that can be a real world scenario in things like data/signal processing and numerical computation. When you're doing those things, language choice can make a huge difference that may not be apparent without understanding how they work or at least seeing visualizations like this.

But even in C and Rust, most applications don't have to do this kind of thing. Most are dominated by smaller loops, with branching that breaks pipelining, complexity that busts cpu cache, calls and jumps that might interfere with both etc. In those much more common cases, the practical performance difference is going to be quite a lot less significant, and the fluency/ergonomics/safety/convenience of working in some other language can easily matter more.

liontwist an hour ago | parent | next [-]

Programming style is as much a determiner of pipeline and cache friendly code as domain. It’s a set of techniques that can be learned.

PaulHoule 7 hours ago | parent | prev | next [-]

I like the extreme boost that PyPy gets here which is most more than you usually get with PyPy, on one hand it shows the potential of PyPy but it is testing just one area where PyPy is strong, other workloads depend on performance that is not well optimized.

My take though is that branchy, old-AI, and heavy code (say keep every fact in a triple store and not as a Python object at the benefit of having an index for every permutation of ?s-?p-?o) that cannot be handled with numpy or similar tools benefits greatly from PyPy, but not 100x.

omoikane 5 hours ago | parent | prev [-]

I thought the emphasis was on the visualization part, and not so much on the operations that are being benchmarked. Most benchmarks are presented as static bar charts or similar, but this particular site presented them as moving circles, which I thought was a very effective illustration of speed.

ramshanker 3 hours ago | parent | prev | next [-]

Personally I have gone max 5 level deep. It worked reasonably well with agressive break condition checking in my specific case.

How deep people have actually written useful nested loops?

davikr 9 hours ago | parent | prev | next [-]

This does not seem like a realistic benchmark target.

mrkeen 9 hours ago | parent | next [-]

Perhaps it's unrealistic in the sense that this is unrealistically-optimisable code.

Real-world code has closures, allocations, and pointer-dererencing. This code is just iterations, additions and modulus.

behnamoh 8 hours ago | parent [-]

> This code is just iterations...

Which is what the average programmer out there writes. Therefore, I think this is actually a good realistic benchmark.

steveBK123 8 hours ago | parent | next [-]

Precisely - its bad code, which is what most code is!

handfuloflight 8 hours ago | parent [-]

What's a Github repository with good code?

intelVISA an hour ago | parent | next [-]

Nobody's ever seen it!

steveBK123 4 hours ago | parent | prev [-]

All code is bad, some is useful

handfuloflight 3 hours ago | parent [-]

Useful is good. Ergo transitive property.

elzbardico 7 hours ago | parent | prev [-]

What the average programmer out there writes is just a tip on the iceberg of what gets executed. And an incredibly small tip.

moralestapia 7 hours ago | parent | prev | next [-]

Nor does it pretend to be.

It's literally just "1 Billion nested loop iterations", methods and results.

You make of that whatever you want.

deanCommie 8 hours ago | parent | prev [-]

The other problem with it is that there are other variables involved besides pure speed, which is CONSISTENCY of speed.

Folks are always surprised to see just how fast Java is able to perform on benchmarks - it has a reputation as such a slow language then how come it's able to execute almost as fast as C++?

But Java's problem isn't execution performance. It's:

* Startup performance. It takes time to instantiate the JVM. That matters for some, not for others.

* Consistent performance. This is the nature of Garbage Collection. It can surprise you when it executes you and drop a performance blip when you least expect it.

Most developers who think they need the fastest performance actually need CONSISTENT performance more than the fastest.

PaulHoule 7 hours ago | parent | next [-]

I even see some video games (say Dome Keeper) that periodically go out to lunch even on a top of the line gaming PC and understand that kind of game is often written in C# which has a garbage collector. Thing is I remember playing games on the much weaker PS Vita that were written in C# but I never remember pausing like that.

VR games are now the frontier of gaming (in terms of the industry outdoing itself) and there you're approaching hard real time requirements because it is so awful to get seasick. I appreciate it.

bob1029 7 hours ago | parent | next [-]

I think it might have more to do with the specific implementation than the tool.

Most incarnations of C# offer a GC.Collect method and an ability to configure the threading model around collections. Used properly, you can keep maximum delays well-bounded. You still have to work your ass off to minimize allocations, but when some do inevitably occur due to framework internals, etc., you don't want to have to clean up 3 hours worth of it at once. Do it every frame or scene change.

MrLeap 3 hours ago | parent | prev [-]

These hitches (in unityland anyways) usually mean the dev is allocating a lot of short lived objects in their render loop. Even little heuristics like keeping the new keyword out of Update() and using object pools prevent the majority of things like this for us over on the unity side.

Dome Keeper's dev Bippinbits is a Godot studio. Some comments on their opinions that I can't read until I dig out my twitter login lol. https://x.com/Bippinbits/status/1702351798302851398

Godot has some perf warts still being churned out, so I don't know what all they've got to do to finesse around them. Even if GDScript was greasy gcc lightning, it's really easy to crunch yourself into some compromises when you're trying to eat your way out of the bottom of ticket mountain before Nextfest or you starve or whatever. Small team gamedev is wild.

vitus 4 hours ago | parent | prev [-]

> * Consistent performance. This is the nature of Garbage Collection. It can surprise you when it executes you and drop a performance blip when you least expect it.

This has been getting much better in Java in recent years. Shenandoah has been the default since Java 15, and we've been seeing further improvements since then.

> Processing thread stacks concurrently gives us reliable sub-millisecond pauses in JDK 17.

https://developers.redhat.com/articles/2021/09/16/shenandoah...

obituary_latte 7 hours ago | parent | prev | next [-]

The animation is confusing...is one trip to the end and back the 1B iterations? I thought maybe they were counting up and once the slowest reached the end they would all stop but that didn't seem to be it. Maybe if it did (and there was a counter on each showing how many bounces they made), it'd be a little more clear.

Interesting nonetheless.

ayaros 7 hours ago | parent [-]

Looks like they are just bouncing back and forth at relative speeds to each other; the speed of each ball is all that matters. I also found this diagram confusing at first.

JohnMakin 8 hours ago | parent | prev | next [-]

Looking at the source code, is the random value generated before the main loop iterations to stop the compiler from shortcutting all the iterations? Does that work on some of the better compilers out there? The random value is only generated once. I would like to see a similar benchmark on creating/destroying objects in memory.

neonsunset 8 hours ago | parent | next [-]

If you are interested in looking into the details of microbenchmarks of this type, there is always https://benchmarksgame-team.pages.debian.net/benchmarksgame/... which has clearly defined methodology and allows to understand wall-clock time vs CPU cost too.

The benchmark code linked in the submission unfortunately suffers from the mistakes that happen whenever you write a benchmark for the first time. The visuals are pretty but might as well have been based on something more comprehensive.

(also, please stop measuring performance with loops that perform no work and fibonacci sequences, this was bad 30 years ago, this is bad today too)

JohnMakin 8 hours ago | parent [-]

Cool, thanks! In another lifetime I had an internship at an embedded software company which was to write tests and optimize compute or throughput for various chips and chip firmware, and reading the functions here sparked some alarm bells in my head. Still a fun idea though and cool graphics.

swatcoder 8 hours ago | parent | prev [-]

Yeah, the random value prevents compile-time pre-calculation and the print makes sure the loops actually get run and values actually get calculated (as modern compilers are wont to simply omit code that clearly doesn't have consequence).

MrLeap 3 hours ago | parent | prev | next [-]

I would have loved to see C# on this list. Wonder if he's accepting PR's.

neonsunset 3 hours ago | parent [-]

There are...a lot of PRs (and 4 as of this moment which add C#).

o11c 8 hours ago | parent | prev | next [-]

Note that these benchmarks ignore object-related stuff, which is usually the expensive part.

fbn79 8 hours ago | parent | prev | next [-]

Would be nice to know max RAM usage by each implementation.

pbw 7 hours ago | parent | prev | next [-]

I wrote this to probe the question of why “slow” languages are popular and valid when used in conduction with fast ones: https://tobeva.com/articles/beyond/

eiathom 6 hours ago | parent | prev | next [-]

It's interesting to imagine the head scratching going on in this thread. All these clever people, stumped. 'How can this be!???'. Quite the post, bravo OP.

But seriously, other comments have this nailed. This 'benchmark' isn't comparing apples with apples. There are too many quirks, optimisations and design decisions (of a compiler) that have not being taken into account.

Also, as others have repeated, this isn't indicative of actual usage - real world usage. In that regard, this type of 'benchmark' isn't scientific at all, and worse, is misleading a naive reader.

Dylan16807 6 hours ago | parent [-]

While it's not (necessarily) indicative of the languages as a whole, it looks pretty apples to apples to me.

These are very simple programs with no fancy tricks going on.

hatthew 5 hours ago | parent [-]

It's comparing a granny smith apple to a mcintosh apple by hitting them with a baseball bat and inferring the texture of the apple from that. It uses them in a way that almost nobody does in real life, and is a vague proxy to only one of several important factors. If you use this test as a basis for deciding which you would like, you're going to have a bad time, whereas you'll be fine if you merely look at what other people like.

Dylan16807 4 hours ago | parent [-]

Is the "way" it's using things affecting performance significantly?

If not then it's a pretty good proxy to one or two dimensions of performance, and a mix of similar benchmarks could paint a good fraction of a reasonable picture.

hatthew 3 hours ago | parent [-]

If you want to buy a bushel of apples and hit them with an array of objects at varying velocities and then try to construct a model that can use the resulting splatter patterns to infer the texture of the apple as experienced by humans, you're welcome to do so.

Personally, I'm going to bite the apple.

Dylan16807 3 hours ago | parent [-]

I do not accept your analogy.

This program is a perfectly reasonable "bite". It's not doing weird stuff.

hatthew an hour ago | parent [-]

The program is a useless toy that nobody would write and expect to be performant in real life. I can write an optimized version of the algorithm that prints the exact same output (barring edge cases) without looping. With this new algorithm (really just a single expression), the C program is 500x faster than before and the python script is 5000x faster than before. So python is only 10x slower than C in this example (and I'd imagine that this is measuring more overhead and less arithmetic).

Presumably the C compiler has a relative advantage at optimizing the looped algorithm vs the arithmetic algorithm. How much? Is C 100x faster than python, or was the compiler able to get closer to the non-looping algorithm than python? How close? Is it something about the arithmetic or the looping that's natively faster in C? What part of it was optimized by the compiler? Since nobody does large loops in python, would python be more comparable if we introduced a C-backed library like numpy?

What other objects do you think we should bash this apple with?

Dylan16807 27 minutes ago | parent [-]

Do you think a slightly more complex piece of math that can't be replaced with an equation would have significantly different performance from the benchmark?

I don't. So the simple math just makes things easier, it doesn't harm results.

Your suggestion to skip the math entirely is the apple-smash.

> Since nobody does large loops in python

Oh I doubt that.

Numpy would be fine as a different entry.

hatthew 5 minutes ago | parent [-]

I don't know whether or not a more complex piece of math would change the outcome, so I would err on the side of not making assumptions until I had done more benchmarks and investigated the inner workings more. You probably know more about C than I do and might be able to make that assumption more confidently, but I think most people know less than either of us, and will read too much into a single statistic without context.

Nobody who cares about performance enough to look at a benchmark like this would do large loops in pure python.

Dylan16807 2 minutes ago | parent [-]

> Nobody who cares about performance enough to look at a benchmark like this would do large loops in pure python.

Because they already know it would be very slow, and putting specific numbers on that is helpful for them and for non-veterans too. Even with some wiggle room for "microbenchmarks are weird".

fifilura 8 hours ago | parent | prev [-]

Doing Python or R right means using vector/matrix operations in e.g. numpy. Those will be faster.

Just think twice before using a for loop. Most of the time it is not needed.

nickforr 5 hours ago | parent | next [-]

If you write the R code using vectorised operations it’s significant orders of magnitude faster (148 seconds to 1.5 milliseconds on posit.cloud).

The nested loop becomes: j <- sum((seq_len(100000) - 1) %% u) a <- a + j + r

MooseBurger 8 hours ago | parent | prev [-]

I disagree w.r.t. python. You shouldn't need a numerical computing library to do a for loop generally.

fifilura 7 hours ago | parent [-]

Kick the habit. For loops are what "goto" was 20 years ago. Error prone and not possible to abstract for example for parallel execution.

hatthew 4 hours ago | parent [-]

I think the comparison to goto is a little extreme, but actually not too crazy. I could see a future where the accepted best practice is to use parallelized operations for everything, and only use sequential loops in cases where the underlying concept is inherently sequential (gradient descent, iterative hashing, etc.).