| ▲ | thehk 5 hours ago | |
> ESP32-S31 is particularly well suited for edge AI and machine learning workloads, including neural network inference Any way to know what kind of performance one could expect running e.g. a depth anything model on there? | ||
| ▲ | kcb 7 minutes ago | parent | next [-] | |
A real example https://github.com/OHF-Voice/micro-wake-word | ||
| ▲ | mattalex 2 hours ago | parent | prev | next [-] | |
Regarding specifically depth anything: You're not running this on a microcontroller. In general, CNNs still reign supreme on microcontrollers since you have a way lower peak memory demand which is what usually kills you. Here in this case you have a couple of _kilobytes_ of SRAM, potentially extendable to a couple of megabytes of PSRAM. Even for small CNNs you often need to do some quite complex interleaving of layers (i.e. running parts of layer 1 and layer 2 in parallel interleaved to take advantage of the downsampling of CNNs) to keep performance and memory impact reasonable (see e.g. https://openreview.net/pdf?id=2O8qbyxH6X). Think more "image classifier" less "run an image to image transformer". For depth anything, a single layer's activation is probably significantly larger than the available SRAM (I think it is (224/16)^2 patches each with activations [48, 96, 192, 384] for depth anything small: You aren't running this.) | ||
| ▲ | otterdude 5 hours ago | parent | prev | next [-] | |
I was wondering this as well. What exactly makes this a good AI chip vs others. Unless they're not listing a major feature in their spec, a dual core 320Mhz microcontroller is not bad but youre not going to be running any kind of vision model on it, at least very fast. | ||
| ▲ | porridgeraisin 3 hours ago | parent | prev | next [-] | |
Memory is the main constraint. You have what, 8mb of psram. Compute wise you can manage. You can do quantisation and run a small 10-15 layer CNN perhaps. Image classification is possible. Keep in mind the channel count and input resolution cannot be high since memory will be a problem. You can maybe do face _detection_, "is my cat on my keyboard" classification as well maybe. Audio, you can do a lot more. Wake word detection happens on _much_ smaller accelerators inside iphones. In this one you can do slightly heavier classifications. Maybe speaker identification "which member of family" or maybe "which dog is barking" | ||
| ▲ | asadm 2 hours ago | parent | prev [-] | |
nope. not happening. at most YOLO or mayyybe FastDepth | ||