▲ | Optimizing FizzBuzz in Rust(github.com) | ||||||||||||||||||||||
30 points by Bogdanp 3 days ago | 16 comments | |||||||||||||||||||||||
▲ | joshka 2 days ago | parent | next [-] | ||||||||||||||||||||||
If you're going to the effort of writing a procmacro, you may as well output a string from the macro instead of code. If you're going idiomatic rust, then you might instead output a type that has a display impl rather than generating code that writes to stdout. | |||||||||||||||||||||||
▲ | jasonjmcghee 3 days ago | parent | prev | next [-] | ||||||||||||||||||||||
Reminds me of the famous thread on stack overflow. I'll link the rust one directly, but one cpp answer claims 283 GB/s - and others are in the ballpark of 50GB/s. The rust one claims around 3GB/s https://codegolf.stackexchange.com/a/217455 You can take this much further! I think throughput is a great way to measure it. Things like pre-allocation, no branching, constants, simd, etc | |||||||||||||||||||||||
▲ | ainiriand 3 days ago | parent | prev | next [-] | ||||||||||||||||||||||
In my opinion a more accurate measure when you go down to the micro seconds level is TSC directly from the CPU. I've built a benchmark tool for that https://github.com/sh4ka/hft-benchmarks Also I think that CPU pining could help in this context but perhaps I need to check the code in my machine first. | |||||||||||||||||||||||
| |||||||||||||||||||||||
▲ | Arnavion 2 days ago | parent | prev | next [-] | ||||||||||||||||||||||
If OP is looking for ideas, there are two intermediate steps between the extremes of "write every line to stdout" and "build up a buffer of the whole output and then write it to stdout". 1. `stdout().lock()` and `writeln!()` to that. By default using `print*!()` will write to `stdout()` which takes a process-wide lock each time. (Funnily enough they use .lock() in the "build up a buffer of the whole output" section, just to do one .write_all() call, which is the one time they don't need to use .lock() because Stdout's impl of write_all() will only take the lock once anyway.) 2. Wrap the locked stdout in a `BufWriter` and `writeln!()` to that. It won't flush on every line, but it also won't buffer the entire output, so it's a middle point between speed and memory usage. --- For the final proc macro approach, there is the option to unroll the loop in the generated code, and the option to generate a &'static str literal of the output. | |||||||||||||||||||||||
▲ | atiedebee 2 days ago | parent | prev | next [-] | ||||||||||||||||||||||
> But the obvious possibilities almost certainly won't be performant: integer modulo is a single CPU instruction [...] Yes it is a single instruction, but that is not indicative of the actual performance. Modulo on x86 is done through the div instruction which takes tens of cycles. When you compile the code you'll likely see a multiply + shift instead because you modulo by a constant. | |||||||||||||||||||||||
▲ | hyperhello 3 days ago | parent | prev | next [-] | ||||||||||||||||||||||
Maybe I’m missing something but can’t you unroll it very easily by 15 prints at a time? That would skip the modulo checks entirely, and you could actually cache everything but the last two or three digits. | |||||||||||||||||||||||
| |||||||||||||||||||||||
▲ | bjoli 2 days ago | parent | prev | next [-] | ||||||||||||||||||||||
I remember writing it in high school end ending up using a wheel (circular data structure) to avoid any modulo at all. Then my HS teacher said that it should be extensible so I wrote a wheel generator. Despite writing things in scheme I ended up being the fastest. It is no magic bullet, but if you only want the regular fizz buzz it is a simple way to just about double the speed. | |||||||||||||||||||||||
| |||||||||||||||||||||||
▲ | Etherlord87 2 days ago | parent | prev [-] | ||||||||||||||||||||||
> At this point, I'm out of ideas. The most impactful move would probably be to switch to a faster terminal... but I'm already running Ghostty! I thought it was a pretty performant terminal to being with! But what is the point? Why do you want to optimize the display? If you want to be able to fizz-buzz for millions of numbers, then you want to... Well realistically only compute them just before they are displayed. | |||||||||||||||||||||||
|