Remix.run Logo
tlarkworthy 4 days ago

Recently tried out the new GEPA algorithm for prompt evolution with great results. I think using LLMs to write their own prompt and analyze their trajectories is pretty neat once appropriate guardrails are in place

https://arxiv.org/abs/2507.19457

https://observablehq.com/@tomlarkworthy/gepa

I guess GEPA is still preprint and before this survey but I recommend taking a look due to it's simplicity

LakshyAAAgrawal 3 days ago | parent | next [-]

Dear Tom, Thanks a lot for trying out GEPA and writing about your experience in the blog!

koakuma-chan 4 days ago | parent | prev [-]

Do you mind sharing which tasks you achieved great results on?

tlarkworthy 4 days ago | parent [-]

It's all written up and linked in the notebook and executable in your browser (if you dare to insert your OPEN_AI_KEY, but my results are included assuming you won't).

The evals were coding observable notebook challenges, simple things like create a drop down, but to solve you need to know the observable standard library and some of the unique syntax like "viewof".

There is a table of the cases here https://observablehq.com/@tomlarkworthy/robocoop-eval#cell-2...

So it's important the prompt encodes enough of the programming model. The seed prompt did not, but the reflect function managed to figure it all out. At the top of the notebook is the final optimized prompt which has done a fair bit of research to figure out the programming model using web search.

hnuser123456 4 days ago | parent [-]

Thanks for the writeup. I wonder if it would be plausible to run this kind of self-optimization for a wider variety of problem sets, to generate "context pathways" for various tasks that are all optimized, and maybe even learn patterns from multiple prompt optimizations to generalize.

tlarkworthy 4 days ago | parent [-]

the prompt I would like to optimize is the reflection prompt

`You are a prompt‑engineer AI. You will be improving the performance of a prompt by considering recent executions of that prompt against a variate of tasks that were asked by a user. You need to look for ways to improve the SCORE by considering recent executions using that prompt and doing web research on the domain.

Your tasks is to improve the CURRENT PROMPT. You will be given traces of several TASKS using the CURRENT PROMPT and then respond only with the text of the improved using the improve_prompt tool`; const research_msg = `Generate some ideas on how how this prompt might be improved, perhaps using web research\nCURRENT PROMPT:\n${prompt}\n${trace}`

source: https://observablehq.com/@tomlarkworthy/gepa#reflectFn

but I would need quite a few distinct tasks to do that and task setup is the laborious part (getting quicker now I optimized the notebook coding agent).