| ▲ | docheinestages 2 hours ago | |||||||
We really need a "memory arena" to serve two important purposes: 1. List all the known agent memory projects (of which there are hundreds)\ 2. Objectively compare and score them both against each other and vanilla harnesses like Claude Code Only then can I have the cognitive capacity to decide which one makes sense for me. | ||||||||
| ▲ | oleksiibond 37 minutes ago | parent | next [-] | |||||||
Agreed, and point number two is the tricky one. Creating a list of tasks is easy; evaluating them is not. You need a consistent task set, a "clean slate" control (i.e., Claude code without memory is your proper control) and an evaluation criteria which differentiates "uses fewer tokens" from "produces better results," otherwise you end up with vendors evaluating their own work. Currently constructing a repeatable test harness for PMB: Fixed task, with/without memory, repeated N times, giving number of tokens/turns/passed/not passed with a subjective quality score too. Would be happy to share the task set and evaluation criteria for testing on anyone else's memory server or clean slate control, not just mine. | ||||||||
| ▲ | cyanydeez 2 hours ago | parent | prev [-] | |||||||
every time I see these memory agents, all I can think about is context bloat and posioning. We know humans have trouble with memories from a different realm: to "remember" something of significance, the human brain reconstructs the entire experience, which is why they're so easy to influence. That seems to be what most of these systems are doing: amplifying erros and hallucinations more than anything else. | ||||||||
| ||||||||