Nit,

ARM processors primarily use a modified Harvard architecture, including the raspberry pi pico.

NooneAtAll3 18 hours ago | parent | next [-]

this isn't about Harvard/VonNeuman split/no-split between i-cache and d-cache

I think this post is more about... compute in memory? if I got it right?

	▲	nyrikki 15 hours ago \| parent \| next [-]
		Here is John Backus' original paper[0], which is an easy read, but note what he calls functional programming_ has nothing to do with lambda calculus, Haskel etc... it is the APL family. He is absolutely one of IBM's historical rockstars. IMHO they are invoking him to sell their NorthPole chips which have on-die memory distributed between the processing components and probably has value. > In its simplest form a von Neumann computer has three parts: a central processing unit (or CPU), a store, and a connecting tube that can transmit a single word between the CPU and the store (and send an address to the store). I propose to call this tube the yon Neumann bottleneck. The task of a program is to change the contents of the store in some major way; when one considers that this task must be accomplished entirely by pumping single words back and forth through the von Neumann bottleneck, the reason for its name becomes clear. IMHO IBM is invoking John Backus' work to sell what may be an absolutely great product but are really just ASICs and don't relate to his machine or programming language limits. [0] https://dl.acm.org/doi/pdf/10.1145/359576.359579
	▲	danudey 17 hours ago \| parent \| prev [-]
		Sort of? It's about locality of data; this has often been a bottleneck, which is why we have CPU caches to keep data extremely close to the CPU cores with practically zero latency and throughput limitations compared to fetching from main memory. Unfortunately now we're shuffling terabytes of data through our algorithms and the CPU spends a huge amount of its time waiting for the next batch of data to come in through the pipe. This is, IIRC, part of why Apple's M-series chips are as performant as they are: they not only have a unified memory architecture which eliminates the need to copy data from CPU main memory to GPU or NPU main memory to operate on it (and then copy the result back) but the RAM being on the package means that it's slightly "more local" and the memory channels can be optimized for the system they're going to be connected to.

▲

bobmcnamara 18 hours ago | parent | prev | next [-]

Nit: RP2040 is a Von Neumann. There's only one AHB port on the m0.

Edit: see also ARM7TDMI, Cortex-m0/0+/1, and probably a few others. All the big stuff is modified Harvard or very rarely pure Harvard.

	▲	nyrikki 16 hours ago \| parent [-]
		You are correct I should have specified pico2 That said AVH-lite is called lite because it is a simplified form of the arm norm. The RP2350 can issue one fetch and one load/store per cycle, and that is that almost everything called a CPU and not a MCU will have ABH5 or better. The “von Neumann bottleneck” was (when I went to school) that the CPU cannot simultaneously fetch an instruction and read/write data from or to memory. That doesn’t apply to smartphones, PCs or servers even in the intel world due to instruction caches etc… It is just old man yells at clouds

▲

ajross 16 hours ago | parent | prev [-]

That's valid jargon but from the wrong layer of the stack. A Harvard bus is about the separation of the "instruction" memory from "data" memory so that (pipelined) instructions can fetch from both in parallel. And in practice it's implemented in the L1 (and sometimes L2) cache, where you have separate icache/dcache blocks in front of a conceptually unified[1] memory space.

The "Von Neumann architecture" is the more basic idea that all the computation state outside the processor exists as a linear range of memory addresses which can be accessed randomly.

And the (largely correct) argument in the linked article is that ML computation is a poor fit for Von Neumann machines, as all the work needed to present that unified picture of memory to all the individual devices is largely wasted since (1) very little computation is actually done on individual fetches and (2) the connections between all the neurons are highly structured in practice (specific tensor rows and columns always go to the same places), so a simpler architecture might be a better use of die space.

[1] Not actually unified, because there's a page translation, IO-MMUs, fabric mappings and security boundaries all over the place that prevents different pieces of hardware from actually seeing the same memory. But that's the idea anyway.