Remix.run Logo
simonw 6 hours ago

I really feel this bit:

> With agentic coding, part of what makes the models work today is knowing the mistakes. If you steer it back to an earlier state, you want the tool to remember what went wrong. There is, for lack of a better word, value in failures. As humans we might also benefit from knowing the paths that did not lead us anywhere, but for machines this is critical information. You notice this when you are trying to compress the conversation history. Discarding the paths that led you astray means that the model will try the same mistakes again.

I've been trying to find the best ways to record and publish my coding agent sessions so I can link to them in commit messages, because increasingly the work I do IS those agent sessions.

Claude Code defaults to expiring those records after 30 days! Here's how to turn that off: https://simonwillison.net/2025/Oct/22/claude-code-logs/

I share most of my coding agent sessions through copying and pasting my terminal session like this: https://gistpreview.github.io/?9b48fd3f8b99a204ba2180af785c8... - via this tool: https://simonwillison.net/2025/Oct/23/claude-code-for-web-vi...

Recently been building new timeline sharing tools that render the session logs directly - here's my Codex CLI one (showing the transcript from when I built it): https://tools.simonwillison.net/codex-timeline?url=https%3A%...

And my similar tool for Claude Code: https://tools.simonwillison.net/claude-code-timeline?url=htt...

What I really want it first class support for this from the coding agent tools themselves. Give me a "share a link to this session" button!

vunderba 4 hours ago | parent | next [-]

When I find myself in a situation where I’ve been hammering an LLM and it keeps veering down unproductive paths - trying poor solutions or applying fixes that make no difference but eventually we do arrive at the correct answer, the result is often a massive 100+ KB running context.

To help mitigate this in the future I'll often prompt:

  “Why did it take so long to arrive at the solution? What did you do wrong?”
Then I follow up with:

  “In a single paragraph, describe the category of problem and a recommended approach for diagnosing and solving it in the future.”
I then add this summary to either the relevant MD file (CHANGING_CSS_LAYOUTS.md, DATA_PERSISTENCE.md, etc) or more generally to the DISCOVERIES.md file, which is linked from my CLAUDE.md under:

  - When resolving challenging directives, refresh yourself with: docs/DISCOVERIES.md - it contains useful lessons learned and discoveries made during development.
I don't think linking to an entire commit full of errors/failures is necessarily a good idea - feels like it would quickly lead to the proverbial poisoning of the well.
itsgrimetime 44 minutes ago | parent | next [-]

Yep - this has worked well for me too. I do it a little differently:

I have a /review-sessions command & a "parse-sessions" skill that tells Claude how to parse the session logs from ~/.claude/projects/, then it classifies the issues and proposes new skills, changes to CLAUDE.md, etc. based on what common issues it saw.

I've tried something similar to DISCOVERIES.md (a structured "knowledge base" of assumptions that were proven wrong, things that were tried, etc.) but haven't had luck keeping this from getting filled with obvious things (that the code itself describes) or slightly-incorrect things, or just too large in general.

johnsmith1840 4 hours ago | parent | prev [-]

When you get stuck in a loop it's best to remove all code back to a point it didn't have problems. If you continue debugging in that hammering failure loop you get TONS of random future bugs.

anamexis 2 hours ago | parent [-]

I've had good luck doing something like this first (but more specific to the issue at hand):

We are getting stuck in an unproductive loop. I am going to discard all of this work and start over from scratch. Write a prompt for a new coding assistant to accomplish this task, noting what pitfalls to avoid.

YesBox 3 hours ago | parent | prev | next [-]

Over time, do you think this process could lock you into an inflexible state?

I'm reminded of the trade off between automation and manual work. Automation crystalizes process, and thus the system as a whole loses it's ability to adapt in a dynamic environment.

simonw 3 hours ago | parent [-]

Nothing about this feels inflexible to me at the moment - I'm evolving the way I use these tools on a daily basis, constantly discovering new tricks that work.

Just this morning I found out that I can tell Claude Code how to use my shot-scraper CLI tool to debug JavaScript and it will start doing exactly that:

  you can run javascript against the page using:
  shot-scraper javascript /tmp/output.html \
  'document.body.innerHTML.slice(0, 100)'
  - try that
Transcript: https://gistpreview.github.io/?1d5f524616bef403cdde4bc92da5b... - background: https://simonwillison.net/2025/Dec/22/claude-chrome-cloudfla...
CuriouslyC 4 hours ago | parent | prev | next [-]

You can export all agent traces to otel, either directly or via output logging. Then just dump it in clickhouse with metadata such as repo, git user, cwd, etc.

You can do evals and give agents long term memory with the exact same infrastructure a lot of people already have to manage ops. No need to retool, just use what's available properly.

btown 4 hours ago | parent [-]

With great love to your comment, this has the same vibes as the infamous 2007 Dropbox comment: https://news.ycombinator.com/item?id=9224

I'd also argue that the context for an agent message is not the commit/release for the codebase on which it was run, but often a commit/release that is yet to be set up. So there's a bit of apples-to-oranges in terms of release tagging for the log/trace.

It's a really interesting problem to solve, because you could in theory try to retroactively find which LLM session, potentially from days prior, matches a commit that just hit a central repository. You could automatically connect the LLM session to the PR that incorporated the resulting code.

Though, might this discourage developers from openly iterating with their LLM agent, if there's a panopticon around their whole back-and-forth with the agent?

Someone can, and should, create a plug-and-play system here with the right permission model that empowers everyone, including the Programmer-Archaeologists (to borrow shamelessly from Vernor Vinge) who are brought in to "un-vibe the vibe code" and benefit from understanding the context and evolution.

But I don't think that "just dump it in clickhouse" is a viable solution for most folks out there, even if they have the infrastructure and experience with OTel stacks.

CuriouslyC 3 hours ago | parent [-]

I get where you're coming from, having wrestled with Codex/CC to get it to actually emit everything needed to even do proper evals.

From a "correct solution" standpoint having one source of truth for evals, agent memory, prompt history, etc is the right path. We already have the infra to do it well, we just need to smooth out the path. The thing that bugs me is people inventing half solutions that seem rooted in ignorance or the desire to "capture" users, and seeing those solutions get traction/mindshare.

NeutralForest 6 hours ago | parent | prev | next [-]

I think we already have the tools but no the communication between those? Instead of having actions taken and failures as commit messages, you should have wide-events like logs with all the context, failures, tools used, steps taken... Those logs could be used as checkpoints to go back as well and you could refer back to the specific action ID you walked back to when encountering an error.

In turn, this could all be plain-text and be made accessible, through version control in a repo or in a central logging platform.

pigpop 5 hours ago | parent [-]

I'm currently experimenting with trying to do this through documentation and project planning. Two core practices I use are a docs/roadmap/ directory with an ordered list of milestone documents and a /docs/retros/ directory with dated retrospectives for each session. I'm considering adding architectural decision records as a dedicated space for documenting how things evolve. The quote fta could be handled by the ADR records if they included notes on alternatives that were tried and why they didn't work as part of the justification for the decision that was made.

The trouble with this quickly becomes finding the right ones to include in the current working session. For milestones and retros it's simple: include the current milestone and the last X retros that are relevant but even then you may sometimes want specific information from older retros. With ADR documents you'd have to find the relevant ones somehow and the same goes for any other additional documentation that gets added.

There is clearly a need for some standardization and learning which techniques work best as well as potential for building a system that makes it easy for both you and the LLM to find the correct information for the current task.

neutronicus 4 hours ago | parent | prev | next [-]

Emacs gptel just produces md or org files.

Of course the agentic capabilities are very much on a roll-your-own-in-elisp basis.

karthink 3 hours ago | parent [-]

> agentic capabilities are very much on a roll-your-own-in-elisp basis

I use gptel-agent[1] when I want agentic capabilities. It includes tools and supports sub-agents, but I haven't added support for Claude skills folders yet. Rolling back the chat is trivial (just move up or modify the chat buffer), rolling back changes to files needs some work.

[1] https://github.com/karthink/gptel-agent

stacktraceyo 6 hours ago | parent | prev | next [-]

I’d like to make something like this but in the background. So I can better search my history of sessions. Basically start creating my own knowledge base of sorts

simonw 5 hours ago | parent [-]

Running "rg" in your ~/.claude/ directory is a good starting point, but it's pretty inconvenient without a nicer UI for viewing the results.

the_mitsuhiko 5 hours ago | parent | next [-]

Amp represents threads in the UI and an agent can search and reference its own history. That's for instance also how the handoff feature leverages that functionality. It's an interesting system and I quite like it, but because it's not integrated into either github or git, it is sufficiently awkward that I don't leverage it enough.

simonw 5 hours ago | parent | prev [-]

... this inspired me to try using a "rg --pre" script to help reformat my JSONL sessions for a better experience. This prototype seems to work reasonably well: https://gist.github.com/simonw/b34ab140438d8ffd9a8b0fd1f8b5a...

Use it like this:

  cd ~/.claude/projects
  rg --pre cc_pre.py 'search term here'
agumonkey 3 hours ago | parent | prev | next [-]

there's some research into context layering so you can split / reuse previous chunks of context

ps: your context log apps are very very fun

ashot 3 hours ago | parent | prev | next [-]

Checkout codecast.sh

kgwxd 5 hours ago | parent | prev | next [-]

> There is, for lack of a better word, value in failures

Learning? Isn't that what these things are supposedly doing?

simonw 4 hours ago | parent | next [-]

LLMs notoriously don't learn anything - they reset to a blank slate every time you start a new conversation.

If you want them to learn you have to actively set them up to do that. The simplest mechanism is to use a coding agent tool like Claude Code and frequently remind it to make notes for itself, or to look at its own commit history, or to search for examples in the codebase that is available to it.

the_mitsuhiko 5 hours ago | parent | prev | next [-]

If by "these things" you mean large language models: they are not learning. Famously so, that's part of the problem.

mock-possum 3 hours ago | parent | prev [-]

No, we’re the ones who are learning.

There’s some utility to instructing them to ‘remember’ via writing to CLAUDE.md or similar, and instructing them to ‘recall’ by reading what they wrote later.

But they’ll rarely if even do it on their own.

0_____0 5 hours ago | parent | prev [-]

"all my losses is lessons"