| |
| ▲ | mattmanser 4 days ago | parent | next [-] | | Everyone complains that when you compact the context, Claude tends to get stupid Which as far as I understand it is summarizing the context with a smaller model. Am I misunderstanding you, as the practical experience of most people seem to contradict your results. | | |
| ▲ | NitpickLawyer 4 days ago | parent [-] | | One key insight I have from having worked on this from the early stages of LLMs (before chatgpt came out) is that the current crop of LLM clients or "agentic clients" don't log/write/keep track of success over time. It's more of a "shoot and forget" environment right now, and that's why a lot of people are getting vastly different results. Hell, even week to week on the same tasks you get different results (see the recent claude getting dumber drama). Once we start to see that kind of self feedback going in next iterations (w/ possible training runs between sessions, "dreaming" stage from og RL, distilling a session, grabbing key insights, storing them, surfacing them at next inference, etc) then we'll see true progress in this space. The problem is that a lot of people work on these things in silos. The industry is much more geared towards quick returns now, having to show something now, rather than building strong fo0undations based on real data. Kind of an analogy to early linux dev. We need our own Linus, it would seem :) | | |
| ▲ | ako 4 days ago | parent | next [-] | | I’ve experimented with feature chats, so start a new chat for every change, just like a feature branch. At the end of a chat I’ll have it summarize the the feature chat and save it as a markdown document in the project, so the knowledge is still available for next chats. Seems to work well. You can also ask the llm at the end of a feature chat to prepare a prompt to start the next feature chat so it can determine what knowledge is important to communicate to the next feature chat. Summarizing a chat also helps getting rid of wrong info, as you’ll often trial and error towards the right solution. You don’t want these incorrect approaches to leak into the context of the next feature chat, maybe just add the “don’t dos” into a guidelines and rules document so it will avoid it in the future. | | |
| ▲ | rufasterisco 4 days ago | parent | next [-] | | i too have discovered that feature chats are surely a winner (as well as a pre-requirement for parallelization) in a similar vein, i match github project issues to md files committed to repo essentially, the github issue content is just a link to the md file in the repo
also, epics are folders with links (+ a readme that gets updated after each task) i am very happy about it too it's also very fast and handy to reference either from claude using @
.ie: did you consider what has been done @ other major improvements that worked for me were
- DOC_INDEX.md build around the concept of "read this if you are working on X (infra, db, frontend, domain, ....)"
- COMMON_TASKS.md (if you need to do X read Y, if you need to add a new frontend component read HOW_TO_ADD_A_COMPONENT.md ) common tasks tend to be increase quality when they are epxpressed in a checklist format | |
| ▲ | dpkirchner 4 days ago | parent | prev [-] | | I ask the bot to come up with a list of "don't dos"/lessons learned based on what went right or required lots of edits. Then I have it merge them in to an ongoing list. It works OK. |
| |
| ▲ | CuriouslyC 3 days ago | parent | prev | next [-] | | The difference between agents and LLMs is that agents are easy to tune online, because unlike LLMs they're 95% systems software. The prompts, the tools, the retrieval system, the information curation/annotation, context injection, etc. I have a project that's still in early stages that can monitor queries in clickhouse for agent failures, group/aggregate into post mortem classes, then do system paramter optimization on retrieval /document annotation system and invoke DSPy on low efficacy prompts. | |
| ▲ | troupo 4 days ago | parent | prev [-] | | > don't log/write/keep track of success over time. How do you define success of a model's run? | | |
| ▲ | NitpickLawyer 4 days ago | parent [-] | | Lots of ways. You could do binary thumbs up/down. You could do a feedback session. You could look at signals like "acceptance rate" (for a pr?) or "how many feedback messages did the user send in this session", and so on. My point was more on tracking these signals over time. And using them to improve the client, not just the model (most model providers probably track this already). | | |
| ▲ | troupo 4 days ago | parent [-] | | Ah. Yes! My somewhat terse/bitter question was because yesterday Claude would continue claim to have created a "production-ready" solution which was completely entirely wrong. I would've loved to have the feedback loop you describe |
|
|
|
| |
| ▲ | CuriouslyC 4 days ago | parent | prev | next [-] | | There's a large body of research on context pruning/rewriting (I know because I'm knee deep in benchmarks in release prep for my context compiler), definitely don't ad hoc this. | | |
| ▲ | spariev 4 days ago | parent [-] | | Care to give some pointers on what to look at? Looks like I will be doing something similar soon so that would be much appreciated | | |
| ▲ | CuriouslyC 4 days ago | parent [-] | | Just ask chat gpt about state of the art in context pruning and other methods to optimize the context being provided to a LLM, it's a good research helper. The right mental model is that it's basically like RAG in reverse, instead of trying to select and rank from a data set, you're trying to select and rank from context given a budget. |
|
| |
| ▲ | ixsploit 4 days ago | parent | prev [-] | | I do something similar and I have the best results of not having a history at all, but setting the context new with every invokation. |
|