| ▲ | MathMonkeyMan 5 days ago |
| Chandler Carruth told a similar story in one of his talks. He met Ken Thompson and saw beautiful C code for the first time because he had encountered a performance problem in a service. The service had to choose a policy to enforce (or something) based on the incoming request. It was taking too long to match the criteria of each potential policy against the request. Ken wrote a finite automata based pattern matcher that would simultaneously advance through all of the policies' predicates. It was perfect, and it was much faster than the existing code. Then somebody noticed that 99.9% of requests matched a particular policy, so they changed the existing code to just check that predicate first, and the code sped up a zillion times, much more than with Ken's solution. |
|
| ▲ | snvzz 4 days ago | parent | next [-] |
| My take is that code is disposable, and often needs to be written to figure out what is really needed, then disposed off. Learning this is worth writing the otherwise useless code. |
| |
| ▲ | BigJ1211 4 days ago | parent | next [-] | | This is why I strongly adhere to WET¹, you discover so much about the domain the first time you write something. By the time you have it 'working' you realize you should've written vastly different code to solve or cut-off edge cases. That's why these days I just start writing code, get it to 'it works!' level. To then throw it all away and start again. Takes a little more effort, but the payoff is far less problematic code and tech debt. ¹: Write Everything Twice | | |
| ▲ | wizzwizz4 4 days ago | parent [-] | | Don't do WET with your entire system, though, or you're likely to end up with the Second System Effect. | | |
| |
| ▲ | 4 days ago | parent | prev | next [-] | | [deleted] | |
| ▲ | 5pl1n73r 4 days ago | parent | prev | next [-] | | My reaction to the article title was recalling how old assembly teachers would drill into our heads to make sure we have proof of 200% gains before implementing an optimization. However the article shows how sometimes such planning doesn't work. | |
| ▲ | Cthulhu_ 4 days ago | parent | prev | next [-] | | Or one should spend a little more time at first to fully understand the problem, in both the mentioned case and the OP, use realistic data. | | |
| ▲ | TillE 4 days ago | parent | next [-] | | Writing code is very very often a big part of understanding any new problem. It's just that you should be quite comfortable discarding that code. Refactor mercilessly and all that. | |
| ▲ | scott_w 4 days ago | parent | prev | next [-] | | Writing code that solves the problem is a huge way of proving you understand the problem. | |
| ▲ | yetihehe 4 days ago | parent | prev [-] | | Sometimes you can't have realistic data without writing some code first, that will allow you to gather that data. |
| |
| ▲ | ddalex 4 days ago | parent | prev [-] | | I came to the same conclusion months ago with the advent of coding LLMs.... I used to treat code as precious, carefully saving, cataloguing and referencing git commits and SQL scripts and command line histories, because they were painful to write, and thus they are valuable. I mean, they were, because when I had the same need again, I could find previous instances and reuse stuff. Now ? I just dump the problem into an LLM and it spits up a script that solves it, and then I throw away the script because it horribly bugged for any other case then my particular problem, but hey, I just solved my problem |
|
|
| ▲ | tasn 5 days ago | parent | prev | next [-] |
| This is such a great anecdote, thanks for sharing! Somehow relatedly, I still remember the first time I heard about profile-guided optimization which is essentially the same but for all of your code at once (well, same idea, not sure it's aggressive enough to reach the same result as the anecdote you shared). |
| |
| ▲ | scottlamb 4 days ago | parent [-] | | > Somehow relatedly, I still remember the first time I heard about profile-guided optimization which is essentially the same but for all of your code at once (well, same idea, not sure it's aggressive enough to reach the same result as the anecdote you shared). Exactly: profile-guided optimization is pretty awesome if you have the right infrastructure. You can get maybe 15% speedup on your program without going deep into all the code. But it has limits. IIUC, it gathers information about which code is hot, including which branches are taken, and enables certain compiler optimizations based on that. [1] But it's not going to make huge algorithm changes, it's not going to change data structure definitions [2], and it's not going to prepare good microbenchmark input for you to compare your new algorithm against the current one. Also, this article is about Java, and profile-guided optimization is something Java just has baked into its JIT already. [1] I don't know exactly which, and they likely vary by compiler version. But I think a variety of stuff to better use cache: breaking code apart into likely-"hot" vs likely-"cold" pages, deciding which loops are worth aligning, and where inlining is worthwhile. Also where branches should be optimized for likely vs unlikely vs unpredictable (conditional instructions). When it's worth autovectorizing. But it's not as good at SIMD as good as a human who puts two weeks into it, and SIMD usage is constrained by the inability to change data structures. [2] in theory, in Rust's default #[repr(Rust)] it could reorder a struct's members, but it's not going to say change from array-of-struct to struct-of-array format or hashmap to btree. |
|
|
| ▲ | RedNifre 4 days ago | parent | prev | next [-] |
| Okay, but how about checking the 99.9% policy first and if it doesn't match, run Ken's solution? |
| |
| ▲ | fifilura 4 days ago | parent | next [-] | | You'd be better off with some stupid code that junior devs (or Grug brained senior devs) could easily understand for the 0.1% cases. | | |
| ▲ | smcin 4 days ago | parent | next [-] | | (In that specific case, with that specific very skewed data distribution, yes. But not necessarily in the general case. So "they" would be better off with that solution, but not a generalized "you"). | |
| ▲ | username135 4 days ago | parent | prev [-] | | Then how do they learn | | |
| ▲ | mannykannot 4 days ago | parent | next [-] | | My guess is that they would be more likely to “fix” and “improve” Ken’s algorithm than understand it (Jon Bentley mentioned, in passing, a similar case where this was the outcome.) | |
| ▲ | chii 4 days ago | parent | prev [-] | | They can learn finite automata from computer science text books. | | |
| ▲ | username135 2 days ago | parent [-] | | True, i suppose, but theres simply no substitute for contextual, on the job learning. imo |
|
|
| |
| ▲ | TrapLord_Rhodo 4 days ago | parent | prev | next [-] | | That's what he said they did. that's the joke. >so they changed the existing code to just check that predicate first, and the code sped up a zillion times, much more than with Ken's solution. but since you've now spent all this time developing some beautiful solution for .01% of the actual data flow. Not the best use of dev time, essentially. | |
| ▲ | lispitillo 4 days ago | parent | prev | next [-] | | If a case doesn't match, then speeding up the remaining 0.1% is not going to change much the overall speed. Hence, a non optimized algorithm maybe enough. But when speed and speed variance is critical, then optimization is a must. | |
| ▲ | hinkley 4 days ago | parent | prev [-] | | The devil is in the details. A pair of my coworkers spent almost two years working on our page generation to split it onto a static part and a dynamic part. Similar to Amazon’s landing pages. Static skeleton, but lots of lazy loaded blocks. When they finally finished it, the work was mediocre for two reasons. One, they ended up splitting one request into several (but usually two), and they showed average response times dropping 20% but but neglected to show the overall request rate also went up by more than 10%. They changed the denominator. Rookie mistake. Meanwhile, one of my favorite coworkers smelled a caching opportunity in one of our libraries that we use absolutely everywhere, and I discovered a whole can of worms around it. The call was too expensive which was most of the reason a cache seemed desirable. My team had been practicing Libraries Over Frameworks, and we had been chipping away for years at a NIH framework a few of them and a lot of people who left had created in-house. At some point we lost sight of how we had sped up the “meat“ of workloads at the expense of higher setup time initializing these library functions over, and over, and over again. Example: the logging code depended on the feature toggle code, which depended on a smaller module that depended on a smaller one still. Most modules depended on the logging and the feature toggle modules. Those leaf node modules were being initialized more than fifty times per request. Three months later I had legitimately cut page load time by 20% by running down bad code patterns. 1/3 of that gain was two dumb mistakes. In one we were reinitializing a third party library once per instantiation instead of once per process. So on the day they shipped, they claimed 20% response reduction (which was less than the proposal expected to achieve), but a lot of their upside came from avoiding work I’d already made cheaper. They missed that the last of my changes went out in the same release, so their real contribution was 11%, but because of the extra requests it only resulted in a 3% reduction in cluster size, after three man years of work. Versus 20% and 7% respectively for my three months of work spread over five. Always pay attention to the raw numbers and not just the percentages in benchmarks, kids. |
|
|
| ▲ | ralegh 5 days ago | parent | prev | next [-] |
| This is fine assuming the popular request types don’t change, but arguably if both new versions of matching are sufficiently fast then I would prefer Ken’s long term as the other could become slow again if the distribution of request types changes. |
| |
| ▲ | sfilmeyer 5 days ago | parent | next [-] | | As a counterpoint, what fraction of the future engineers who will touch the project are likely to be able to competently edit the finite automata based version without introducing bugs and what fraction will be able to competently edit the if statement that checks the particular policy? | | |
| ▲ | mikepurvis 5 days ago | parent [-] | | A further question mark is whether any of this has sufficient instrumentation to be able to notice and act on a change of and when it occurs. |
| |
| ▲ | andrepd 5 days ago | parent | prev | next [-] | | Nonsense. The pre-check can literally be one line (if common_case {fast_path()} else {slow_path()}), and thus enabling or disabling it is dead simple and obvious if the problem changes in the future. Lines of thinking like that are part of the reason most modern software is so sloooow :) | | |
| ▲ | Rendello 5 days ago | parent | next [-] | | This situation where two paths produce the same output but one is optimized is the easiest case in property-based testing, as the property is just: normal(x) == optimized(x)
| | |
| ▲ | Stratoscope 5 days ago | parent | next [-] | | I have sometimes done just this. First I write the simplest possible brute force code, something that any competent programmer can look at and say, "Yes, this may be slow, but it is straightforward and obvious that it will handle all cases." Then I write the optimized code and use a test like yours to compare the results between the simple code and the optimized code. One time I needed to write a function to search for a specific pattern in a binary file and change it. So I wrote the brute force code as a first step, the same code that anyone would probably write as a simple solution. It worked the first time, and a couple of people reviewed the code and said "yep, even if it's slow, it is correct." But this code took more than a second to run! Of course I thought about optimizing it with Boyer-Moore or the like. Then I went, "Hold on to your horses. This isn't something like a web page load where one second matters. It's part of a build process that only runs a few times a day and already takes several minutes to run. One extra second is nothing!" In the wise words of Kenny Rogers in The Gambler: You got to know when to hold 'em,
know when to fold 'em
Know when to walk away
and know when to run
| | |
| ▲ | mark-r 4 days ago | parent | next [-] | | Knuth's famous quote about premature optimization captures this perfectly. "Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%." | |
| ▲ | scott_w 4 days ago | parent | prev | next [-] | | Also true for deciding whether to write code at all! About 15 years ago I was working with a junior who'd spent 3 hours trying to automate some data editing until I said "mate, you can just edit it all by hand in about an hour!" | | |
| ▲ | taeric 4 days ago | parent | next [-] | | I had a similar one where a billing mistake had happened in a system and the other devs were trying to find out how to determine all of the bad values. I asked if they had simply looked at the top N that could fit on their screen sorted by value. The incorrect values were the obvious outliers. Only about 4 of them. They had been blocked on identifying them for about an hour. | |
| ▲ | VBprogrammer 4 days ago | parent | prev | next [-] | | Funny enough, with LLMs this trade off may well have flipped. For simple tasks like given a string like X, format it like Y, they work amazingly well. | | |
| ▲ | scott_w 4 days ago | parent [-] | | Absolutely. Even back then, trying to script it would sometimes be fine. I've had plenty of situations where I'd try to write a script, give it 30 minutes then say "fuck it, I'm hand-fixing it." The same probably applies to LLMs, you just need to set a cut-off point before it stops being worthwhile. |
| |
| ▲ | lettuceconstant 4 days ago | parent | prev [-] | | Sure, but I have to obtain my dopamine somehow. |
| |
| ▲ | throwaway-kg8d5 4 days ago | parent | prev | next [-] | | Very true. I would have gone with Every coder knows
the secret to optimizin’
is knowing what to throw away
and knowing what to keep
| |
| ▲ | 4 days ago | parent | prev [-] | | [deleted] |
| |
| ▲ | andrepd 4 days ago | parent | prev [-] | | Absolutely! Proptesting is a great but underutilised feature. I use it a lot wherever I'm in a language which makes it easy (and with a mature library for that), like Haskell, Ocaml, or Rust. | | |
| ▲ | Rendello 4 days ago | parent [-] | | The simple cases are great demos, and then the complex cases melt your brain! But still worth it a lot of the time. |
|
| |
| ▲ | vlovich123 5 days ago | parent | prev | next [-] | | Hyperoptimizing for the fast path today and ignoring that hardware and usage patterns change is the reason modern software is so slooow :) A more robust strategy would be at least be to check if the rule was the same as the previous one (or a small hash table) so that the system is self-healing. Ken’s solution is at least robust and by that property I would prefer it since it’s just as fast but doesn’t have any weird tail latencies where the requests out of your cache distribution are as fast as the ones in. | | |
| ▲ | Jean-Papoulos 4 days ago | parent | next [-] | | You were shown an example of exactly why this thinking is incorrect but you still insist... Also, it's trivial to keep Ken's implementation as the slow path. If request patterns change, dig up the new fast path and put the old one in Ken's slow path code. Most of the performance will still come from the initial `if`. | | |
| ▲ | vlovich123 4 days ago | parent [-] | | It’s ungenerous to assume I would be against the if statement + Ken’s. But Ken’s approach is critically important and the “if” statement should just be a 1 entry cache instead of being hard coded. Getting this stuff right in a future proofed durable way is actually quite hard even when you notice the opportunity. |
| |
| ▲ | necovek 5 days ago | parent | prev [-] | | Nobody is hyperoptimizing the fast path today. Ken's solution was stated to have been slower than the alternative optimization. | | |
| ▲ | akie 5 days ago | parent [-] | | Ken's solution optimized the general case, basically everything that doesn't match the if-statement. |
|
| |
| ▲ | ants_everywhere 5 days ago | parent | prev [-] | | You can even track the request statistics live and disable the fast path if the distribution of requests changes significantly. |
| |
| ▲ | scott_w 4 days ago | parent | prev [-] | | I think you missed the point. Ken's version wasn't removed, it was simply prepended with something like: if value in most_common_range:
take_fast_path()
else:
use_kens_solution()
|
|
|
| ▲ | sylware 4 days ago | parent | prev | next [-] |
| I am writing assembly and often you have many code paths/data structures which "fit". Each combinaison of code paths/data structures will favor some specific usage/data in spite of the others. And this is not real "optimization" since usually those code paths have roughly "the same" cost. The bottom of this: it is what you think of semantics and real-life usage of this code which will drive those "choices". That said, you can still have generic optimizations which will benefit nearly all (for instance quick sort). |
|
| ▲ | underdeserver 4 days ago | parent | prev | next [-] |
| This is a good place to mention profile-guided optimization. In PGO you profile your application running in the real world (via automatic instrumentation) and the compiler uses the output to make optimization decisions. I'm not sure what exactly they use it for, but I imagine it could be used for loop unrolling, switch case ordering, branch prediction optimization, memory ordering and more. |
| |
| ▲ | menaerus 4 days ago | parent [-] | | What if your "application" is a server-side code hosting multitude of different types of workloads, e.g. databases? I was never really convinced by the PGO benefits for such open-ended and complex workloads. | | |
| ▲ | ltbarcly3 4 days ago | parent [-] | | PGO operates at the cpu branch level. The main benefit, if i understand correctly, is to reduce branch mispredictions which lead to pipeline stalls. Theres absolutely no reason to think that the specific workload of a system generally makes random/arbitrary/heuristic be the best way to pick a most likely branch for a given branching instruction. For example, when scanning data you might check if the current value is equal to a test value, and branch on that to either code that handles the match, or else code that moves to the next entry to test that. This may be extremely likely to match (95%) as in a hash table, or very likely to not match (string search). A compiler can't tell which is true, and your workload literally has no impact on this unless it is pathological. | | |
| ▲ | menaerus 3 days ago | parent [-] | | I don't follow. If I have an ingestion-heavy workload then my hash-map will certainly be stressed in a completely different way than if I have a read-only workload. The data that PGO collects during the former will be different than the data that it collects during the later, so, how is it not that the workload doesn't impact the PGO optimizations? Perhaps I misunderstood you. | | |
| ▲ | ltbarcly3 3 days ago | parent [-] | | That is not how hashtables work. Hash tables effectively randomize keys into buckets. Whether there is a hash collision is independent of the workload, and only has to do with the number of entries in the hash table vs the threshold to resize it. This is what I meant by pathological data. If you had weird data that somehow, over and over, got hash tables right up to their fill limit and just stayed there then you would have slightly more collisions than an average use case. However you would have to construct weird data to do this. It wouldn't be "more ingestion". The thing PGO optimized would be individual branches in the hash table code, so for example the branching instruction to see if the key in a bucket is == to the key you are looking for. | | |
| ▲ | menaerus 2 days ago | parent [-] | | > Whether there is a hash collision is independent of the workload, In read-only workload, searching through the key-value space there can't be a collision, no? | | |
| ▲ | ltbarcly3 2 days ago | parent [-] | | I think you should read and understand how a hash table works before you have an opinion on how they work. | | |
| ▲ | menaerus 2 days ago | parent [-] | | You're incredibly rude and also clueless but I'm nonetheless giving you a benefit of a doubt to understand if there is something that I'm missing. But you're right, this is now way over the top for your head. | | |
| ▲ | ltbarcly3 2 days ago | parent [-] | | I'm sorry for being rude. To address your question, yes, in fact all hash table collisions happen during the search key matching phase, if you study the collision resolution strategies on the wikipedia page you will see this. https://en.wikipedia.org/wiki/Hash_table#Collision_resolutio... | | |
| ▲ | menaerus a day ago | parent [-] | | Ok, I can see how I didn't manage to fully explain myself - by replying to your "collisions" argument I actually meant something more than that and while it seemed obvious in my head what I wanted to say, I can agree that my wording was imprecise. Sorry about that. What I pictured in my head is that the ingestion workloads in contrast to the read-only workloads are different in the fact that collisions in former can trigger hash-map resizing/rehashing and which is why I say that these are two distinctive workloads. Moving bunch of data to and from the buckets. So, my hypothesis is the following: if I PGO the implementation using the dataset (volume, data_distribution) that it triggers a handful of such resizing/rehashing operations, and we observe the positive improvements in wall-time runtime performance, what are the chances that using the same binary but now over a completely different dataset will preserve the same runtime performance improvements? What you're saying, if I understand correctly, is that the improvements will be the same regardless of the dataset (and even type of workload) and I am saying I am not so sure about that. |
|
|
|
|
|
|
|
|
|
|
| ▲ | n3t 4 days ago | parent | prev [-] |
| Anyone has a link to the talk? |
| |