Batching is definitely the right answer for some offline / throughput-only cases, but it was not the right tradeoff here.

This pipeline is processing live camera frames and displaying/streaming annotated output, so latency and frame freshness matter. Increasing batch size would add queueing latency and tends to make the output older, especially when the sensor is producing frames continuously.

The “multithreading” here is not treating the NPU like a CPU in the usual sense. The RK3588S NPU is exposed as 3 cores, and RKNN supports using separate contexts with `rknn_dup_context` and assigning them with `rknn_set_core_mask`. The point was to keep the 3 NPU cores fed while capture, RGA preprocessing, inference, and display are pipelined.

In the single-context loop I was seeing ~31 FPS. With one context per NPU core and pipelined frame handling, it reaches the camera ceiling, around 42–46 FPS depending on the mode. So in this particular real-time streaming setup, parallel contexts/core masks were the practical way to saturate the hardware without adding batch latency.

▲

stefan_ 3 hours ago | parent [-]

Again with it. You have two cameras, so you can batch 2 already with no latency hit. In fact less latency because the fake multithreading is gone.

(You are not even measuring latency correctly)

	▲	alebal123bal 3 hours ago \| parent [-]
		[flagged]