I consider myself rather smart and good at what I do. It's nice to have a look at problems like these once in a while, to remind myself of how little I know, and how much closer I am to the average than to the top.

▲ TrackerFF 4 hours ago | parent | next [-]

Well it is a specialized problem. If you've never worked on anything similar previously, it is going to take time. Don't even need to interview for selective billion dollar companies like Anthropic to encounter these types of problems - after college I interviewed for various electronics/hardware companies where you'd get asked to optimize low-level code - which would have looked quite foreign, if you had never actually worked on such problems before.

▲

Onavo 3 hours ago | parent [-]

If you ask an EE to debug react state management code without prior exposure they won't do too well either. But on the other hand they can easily pick up most of it after a week long crash course while training a performance engineer who can optimize code for a specific architecture would take months.

▲

2 hours ago | parent | next [-]

[deleted]

▲

ignoramous 2 hours ago | parent | prev [-]

> EE to debug react state management ... easily pick up most of it after a week long crash course while training a performance engineer ... would take months

Isn't that mostly because as you go up the abstraction layer, tools and docs to teach yourself the tricks of trade fast are in abundance (let alone a popular layer like React)? Which inturn is likely a function of incentives and opportunities.

	▲	fny 30 minutes ago \| parent [-]
		It's because the higher up the stack you go, tools become more declarative and literate. Calling sort is far easier than understanding the algorithm for example.

▲ fergie 4 hours ago | parent | prev | next [-]

I'm 30 years in, and literally don't understand the question.

▲ WithinReason 30 minutes ago | parent | next [-]

After a quick look this is can be seen as a low level GPU/TPU optimization problem where you have to consider the throughput and depth of different arithmetic pipelines. If you want to hire people who understand how to do that you unfortunately have to give them such a convoluted task and emulate the relevant parts of HW. (In reality this is probably more like TPU since it has scalar pipelines, but the optimization methods are not that different)

The task is to parallelize tree traversal, which is embarrassingly unparallel so it's tricky.

▲ 31 minutes ago | parent | prev | next [-]

[deleted]

▲ mike_hearn 3 hours ago | parent | prev | next [-]

The question isn't clearly written down anywhere, that's why. Presumably actual candidates would have been given more info over the phone or email. Part of the "challenge" is reverse engineering their Python; unclear if that's intentional.

If you look at the top of perf_takehome.py then there is a brief comment saying the challenge is to optimize a kernel. Kernel in GPU land means a program that computes on data in parallel, it's not an OS kernel:

    Optimize the kernel (in KernelBuilder.build_kernel) as much as possible in the
    available time, as measured by test_kernel_cycles on a frozen separate copy
    of the simulator.

However, this kernel doesn't run on an actual GPU. It runs on a little interpreter for a custom assembly language written in Python. Thus you will be optimizing the program built in-memory by the function on this line:

https://github.com/anthropics/original_performance_takehome/...

This function is described only as:

    Like reference_kernel2 but building actual instructions.
    Scalar implementation using only scalar ALU and load/store.

The KernelBuilder class has some fields like "instrs" but we can't immediately see what they're meant to be because this is Python and types are optional. Nonetheless we can see that instructions are being added to a list, and below we can see the test_kernel_cycles function that runs the interpreter on the program. So our mission is to change the build_kernel function to make a better program. And it says this is an assembly version of the python function reference_kernel2 which is found in problem.py.

What exactly is this kernel doing? The reference_kernel2 function doesn't explain itself either - it's some sort of parallel tree walk. Let's put that to one side for a second and explore the machine, which is defined in problem.py. The machine itself is also largely undocumented, but there's a brief description in a docstring on line 66.

At this point it helps to understand the design of exotic processors. The emulator is for a fictional CPU that uses a VLIW SIMD ISA. Normal programmers will never encounter such a chip. Intel tried to make such a machine decades ago and it never took off, since then the concept has been largely dead. I believe it's still used in some mobile DSPs like Qualcomm's Hexagon. Notably, NVIDIA PTX is not such an ISA so this seems to have been chosen just to make things harder. As the comment explains, in a VLIW machine multiple instructions are packed together into a "slot" and executed in parallel. In a normal CPU the hardware reads a serial stream of instructions and works out just in time which can be executed in parallel, using fancy out-of-order circuitry. In a VLIW machine that's done ahead of time by the compiler or (in this case) the humble programmer, you. But this isn't just a VLIW machine, it's also multi-core, and multi-"engine", so there are multiple levels of execution going on. And it's SIMD, meaning each instruction can itself operate on multiple bits of data simultaneously.

This machine doesn't have registers or cache but it does have "scratch space", and so you can use the vector instructions to load data into a series of 32 bit scratch words and then do things on them in parallel. And multiple vector instructions can also run in parallel. "Broadcasting a scalar" in SIMD-speak means taking a single value and repeating it over multiple scratch space slots (or register subwords in a real machine), so you take e.g. 0xFF and get 0xFFFFFFFFFFFFFFFF.

And that's it, that's all we get. As the code says: "This comment is not meant to be full ISA documentation though, for the rest you should look through the simulator code". Possible point of confusion: real ISAs are serialized to bytes but this one is just Python tuples. The code is only partially typed; sometimes you're just left guessing.

So to recap, the problem is to optimize an undocumented program expressed in undocumented data structures returned by a Python function whose result is interpreted by a partly documented Python class that simulates a fictional exotic CPU architecture using an abandoned design that gives a lot of parallel computational capacity, but which requires all parallelism to be statically declared ahead of time, whilst simultaneously reverse engineering the Python that does all this.

Does that help? Sounds like a fun exercise :)

Edit: I just checked and Google TPUs are much more VLIW like so perhaps this simulator is designed to match a TPU. I know Anthropic rely on TPUs for serving and have done some optimization for them.

▲

forgotpwd16 2 hours ago | parent | next [-]

This is nice writeup. Thanks. Another commenter said will've taken them 2h just to sketch out ideas; sans LLMs will've taken me more than 2h just to collect all this info let alone start optimizing it.

	▲	mike_hearn 2 hours ago \| parent [-]
		It took me about 10 minutes to generate that writeup the old fashioned 100% organic way, because one of the things that's unspecified is whether you're allowed to use AI to help solve it! So I assumed as it's a job interview question you're not allowed, but now I see other comments saying it was allowed. That would let you get much further. I think I'd be able to make some progress optimizing this program in two hours but probably not much. I'm not a performance engineer but have designed exotic emulated CPU architectures before, so that helps a lot.

▲

carschno 2 hours ago | parent | prev | next [-]

On the one hand, this exercise probably reflects a realistic task. Daily engineering work comprises a lot of reverse engineering and debugging of messy code. On the other hand, this does not seem very suitable as an isolated assignment. The lack of code base-specific context has a lot of potential for frustration. I wonder what they really tested on the candidates, and whether this was what they wanted to filter for.

▲

dist-epoch 20 minutes ago | parent | prev | next [-]

> but which requires all parallelism to be statically declared ahead of time

this is what all specialized chips like TPU/Cerebras require today, and it allows for better optimization than a generic CPU since you can "waste" 30 min figuring out the perfect routing/sequencing of operations, instead of doing it in the CPU in nanoseconds/cycles

another benefit is you can throw away all the CPU out-of-order/branch prediction logic and put useful matrix multipliers in it's place

▲

fergie an hour ago | parent | prev [-]

Wow! Thanks for the explanation :)

▲ 4 hours ago | parent | prev | next [-]

[deleted]

▲ PeterStuer 3 hours ago | parent | prev | next [-]

Which part exactly are ypu having trouble with?

- Optimize the kernel (in KernelBuilder.build_kernel) as much as possible in the available time, as measured by test_kernel_cycles on a frozen separate copy of the simulator

▲ bsder an hour ago | parent | prev | next [-]

Since it's a CPU, you start with the idea that there is an ALU and spiral outward from that. That gives you something concrete to wrap your head around while you climb up the abstraction levels.

However, when I hit "scratch_write" and it wasn't in the Machine class and it wasn't coming from some Decorator and it was getting defined and deleted by a member function ... I stopped. That's paying lip service to the variable typing that is scattered around and actively hampers even basic IDE usage. Probably the typing was added by AI/LLM after the fact, and it missed that unusual usage. The Python convention used to be that those kinds of variables got declared as "_scratch_write" with a leading underscore to flag that they were "private/internal".

That was the gigantic red "We write shitty code" signal or worse "We don't care about wasting your time" signal. Human review should have flagged that.

Shame. I was kinda looking forward to the technical problem, but I'm not going to spend a bunch of time using grep to untangle garbage code to get at it.

I suspect everything would actually be much clearer if you wrote it in SystemVerilog and tested with Cocotb. Let's see if their LLMs can handle that porting job. HAH!

▲ measurablefunc 3 hours ago | parent | prev [-]

Generate instructions for their simulator to compute some numbers (hashes) in whatever is considered the memory of their "machine"¹. I didn't see any places where they actually disallow cheating b/c it says they only check the final state of the memory² so seems like if you know the final state you could just "load" the final state into memory. The cycle count is supposedly the LLM figuring out the fewest number of instructions to compute the final state but again, it's not clear what they're actually measuring b/c if you know the final state you can cheat & there is no way to tell how they're prompting the LLM to avoid the answers leaking into the prompt.

¹https://github.com/anthropics/original_performance_takehome/...

²https://github.com/anthropics/original_performance_takehome/...

▲

saagarjha 3 hours ago | parent [-]

Well, they read your code in the actual hiring loop.

▲

measurablefunc 3 hours ago | parent [-]

My point still stands. I don't know what the LLM is doing so my guess is it's cheating unless there is evidence to the contrary.

▲

red75prime 2 hours ago | parent | next [-]

I guess your answer to "Try to run Claude Code on your own 'ill-defined' problem" would be "I'm not interested." Correct? I think we can stop here then.

▲

KeplerBoy 40 minutes ago | parent | prev | next [-]

Well that's certainly a challenge when you use LLMs for this test driven style of programming.

▲

saagarjha 3 hours ago | parent | prev [-]

Why do you assume it’s cheating?

▲

measurablefunc an hour ago | parent [-]

Because it's a well know failure mode of neural networks & scalar valued optimization problems in general: https://www.nature.com/articles/s42256-020-00257-z

	▲	saagarjha 12 minutes ago \| parent \| next [-]
		Again, you can just read the code
	▲	red75prime 21 minutes ago \| parent \| prev [-]
		And? Anthropic is not aware of this 2020 paper? The problem is not solvable?

▲ mangatmodi an hour ago | parent | prev | next [-]

Smart is different than the knowledge. If you learn about these concepts andwork on these problems, then you will be able to solve them.

It's not about you being average, just a different knowledge set.

▲ chistev an hour ago | parent | prev | next [-]

What we know is a drop, what we don't know is an ocean.

▲ xenihn 4 hours ago | parent | prev | next [-]

It comes with test suites, so that gives you a base to start from. You can at the very least do trial-and-error and come up with some heuristics on the fly. You're at a huge disadvantage to someone who has some familiarity but can convincingly play it off as being a newcomer, though.

▲ apsurd 5 hours ago | parent | prev [-]

disagree. nobody has a monopoly on what metric makes someone good. I don't understand all this leet code optimization. actually i do understand it, but it's a game that will attract game optimizers.

the hot take is, there are other games.

▲

tuetuopay 3 hours ago | parent | next [-]

This is the opposite of leet code.

Yes, this applies to some simulated imaginary CPU with an artificial problem. Except that the job asked here is exactly the core of what a performance engineer will do at anthropic: optimize kernels for their fleet of GPUs. Is it simplified? Yes! (e.g. the simulator does not restrict memory access patterns)

This is a real-world problem adapted to a lab setting that can fit in one's head in a matter of hours. Leetcode would have you reimplement the hashmap used in there.

▲

saagarjha 3 hours ago | parent | prev | next [-]

This is explicitly not Leetcode, in fact its goal is to attract optimizers

▲

sevenzero 4 hours ago | parent | prev [-]

Also leetcode does not really provide insight into ones ability to design business solutions. Whether it be system design, just some small feature implementation or communication skills within a team. Its just optimizers jerking each other off on some cryptic problems 99.999999999% of developers will never see in real life. Maybe it would've been useful like 30 years ago, but all commonly used languages have all these fancy algorithms baked into their stdlib, why would I ever have to implement them myself?

▲

lbreakjai 4 hours ago | parent | next [-]

But this is an interview problem at Anthropic, not at your local CRUD factory. They _are_ looking for the optimizers, because they _are_ working on cryptic problems the 99.9999% of us will never encounter.

▲

thorncorona 4 hours ago | parent | prev [-]

Or more likely, the commonality is how you're applying your software skills?

In every other field it's helpful to understand the basics. I don't think software is the exception here.

	▲	sevenzero 4 hours ago \| parent [-]
		Understanding basics is very different to being able to memorize algorithms. I really dont see why I'd ever have to implement stuff like quicksort myself somewhere. Yes I know what recursion is, yes I know what quick sort is, so if I ever need it I know what to look for. Which was good enough throughout my career.