Remix.run Logo
thehk 5 hours ago

> ESP32-S31 is particularly well suited for edge AI and machine learning workloads, including neural network inference

Any way to know what kind of performance one could expect running e.g. a depth anything model on there?

kcb 7 minutes ago | parent | next [-]

A real example https://github.com/OHF-Voice/micro-wake-word

mattalex 2 hours ago | parent | prev | next [-]

Regarding specifically depth anything: You're not running this on a microcontroller. In general, CNNs still reign supreme on microcontrollers since you have a way lower peak memory demand which is what usually kills you. Here in this case you have a couple of _kilobytes_ of SRAM, potentially extendable to a couple of megabytes of PSRAM.

Even for small CNNs you often need to do some quite complex interleaving of layers (i.e. running parts of layer 1 and layer 2 in parallel interleaved to take advantage of the downsampling of CNNs) to keep performance and memory impact reasonable (see e.g. https://openreview.net/pdf?id=2O8qbyxH6X).

Think more "image classifier" less "run an image to image transformer". For depth anything, a single layer's activation is probably significantly larger than the available SRAM (I think it is (224/16)^2 patches each with activations [48, 96, 192, 384] for depth anything small: You aren't running this.)

otterdude 5 hours ago | parent | prev | next [-]

I was wondering this as well. What exactly makes this a good AI chip vs others.

Unless they're not listing a major feature in their spec, a dual core 320Mhz microcontroller is not bad but youre not going to be running any kind of vision model on it, at least very fast.

porridgeraisin 3 hours ago | parent | prev | next [-]

Memory is the main constraint. You have what, 8mb of psram.

Compute wise you can manage. You can do quantisation and run a small 10-15 layer CNN perhaps. Image classification is possible. Keep in mind the channel count and input resolution cannot be high since memory will be a problem. You can maybe do face _detection_, "is my cat on my keyboard" classification as well maybe.

Audio, you can do a lot more. Wake word detection happens on _much_ smaller accelerators inside iphones. In this one you can do slightly heavier classifications. Maybe speaker identification "which member of family" or maybe "which dog is barking"

asadm 2 hours ago | parent | prev [-]

nope. not happening. at most YOLO or mayyybe FastDepth