Remix.run Logo
gkanellopoulos 8 hours ago

Your gradual transition from v1 to v3 is a common pattern I've seen elsewhere. Project teams usually start with retrieval then face recall quality problems at which point they start wondering whether or not to let the LLM and its context window take over. This is a natural and instinctive approach but imho there are two issues with that approach. Firstly the LLM decides what matters and that decision is irreversible and secondly it does not scale well over time. A couple of months after when the user will ask "What did I ask you to remind me about X 3 months ago?" the summary might have rotated by then that detail out.

I agree that there is a fundamental issue with v1-style retrieval but in my view is not the scoring formula but the fact that similarity search mixes semantically related with really valuable data. For example a memory about "surfing last weekend" and a memory about "wanting to surf one day in Hawaii" will both score high in the question "What outdoor activities do I like?". However, in the question "what did I do last weekend?" only one is useful while both will appear in the injected context. One way that might solve this issue is by introducing more retrieval dimensions like keyword-matching (BM25), entity-aware scoring and temporal signals in order to then determine which memories are truly relevant to the users question. This of course adds up during ingestion but in general, async ingestion is underrated. Generally speaking users expect near-instant responses while ingestion can be slower.

If I may ask, have you done any benchmarking on the v3 approach? It would be interesting to see how a v3-style solution handles factual vs general preference questions. This usually is a tricky one for memory systems.

Egeozin 2 hours ago | parent [-]

Thx, ‘similarity search mixing semantically related data with genuinely valuable data’ and about this ‘adding up during ingestion’ are exactly why we moved from v1 to v2 for companion and conversational use cases. In this domain, scratchpad-like systems work well, and there’s usually no need to over-engineer retrieval.

I think v3 is categorically different. First, the LLM decides what matters, and we believe that scales better than having the engineer impose too much structure upfront and fail to create the right environment for the model, which was part of the limitation in v1. Second, this does not need to be irreversible if you support it with a simple harness, in our case, git and worktrees. V3 is also more applicable to companions or agents that require stronger problem-solving capabilities, such as coding.

We plan to publish our benchmarking results soon, so others can evaluate the approach for themselves.