Remix.run Logo
crustycoder 3 days ago

He's also missed a major step, which is to feed your skill into the LLM and ask it to critique it - after all, it's the LLM that's going to act on it, so asking it to assess first is kinda important. I've done that for his skills, here's the assessment:

==========

  Bottom line
  Against the agentskills.io guidance, they look more like workflow specs than polished agent skills.
  The largest gap is not correctness. It is skill design discipline:

  # stronger descriptions,
  # lighter defaults,
  # less mandatory process,
  # better degraded-mode handling,
  # clearer evidence that the skills were refined through trigger/output evals.

  Skill           Score/10
  write-a-prd          5.4
  prd-to-issues        6.8
  issues-to-tasks      6.0
  code-review          7.6
  final-audit          6.3
==========

LLM metaprogramming is extremely important, I've just finished a LLM-assisted design doc authoring session where the recommendations of the LLM are "Don't use a LLM for that part, it won't be reliable enough".

fergonco 3 days ago | parent | next [-]

> "Don't use a LLM for that part, it won't be reliable enough".

You should now ask if the LLM is reliable enough when it says that.

Jokes aside, how is this a major step he is missing? He is using those skills to be more efficient. How important is going against agentskills.io guidance?

crustycoder 3 days ago | parent [-]

Because he's asking the LLM to interpret those instructions to drive his process. If the skills are poorly defined or incomplete then the process will be as well, and the LLM may misinterpret, choose to ignore, or add its own parts.

Skills are just another kind of programming, albeit at a pretty abstract level. A good initial review process for a Skill is to ask the LLM what it thinks the Skill means and where it thinks there are holes. Just writing it and then running it isn't sufficient.

Another tip is to give the Skill the same input in multiple new sessions - to stop state carryover - collect the output from each session and then feed it back into the LLM and ask it to assess where and why the output was different.

hansmayer 3 days ago | parent [-]

Oh dear, I thought you were merely sarcastic in your first comment. But you seem to have been fully converted to the LLM-religion, and actually believe they actually "think" or "know" anything?

crustycoder 2 days ago | parent [-]

People have applied "think" to the actions of software for decades. Of course it LLM's don't "think" in the human sense, but "What the output of the model indicates in an approximate way about its current internal state" is a bit long winded...

hansmayer 2 days ago | parent [-]

Maybe people who dont understand technology did, I can see that - my granpa also thought the computer was thinking when the windows hourglass showed up. Today maybe its the case again with the folks who dont know anything about it - you know that meme - ChatGPT always gives me correct answers for the domains I am not an expert in!

mpalmer 3 days ago | parent | prev | next [-]

LLMs do not have special or unique insight into how best to prompt them. Not in the slightest.

https://aphyr.com/posts/411-the-future-of-everything-is-lies...

crustycoder 3 days ago | parent [-]

"Not in the slightest" is an overreach, the paper the second level down from that link doesn't really support the conclusion in the blog post - the paper is much more nuanced.

Are they going to fib to you sometimes? Yes of course, but that doesn't mean there's no value in behavioural metaqueries.

Like most new tech, the discussion tends to polarise into "Best thing evah!" and "Utter shite!" The truth is somewhere in between.

mpalmer 3 days ago | parent | next [-]

You're retreating from your position. You started at "major step" and "extremely important", and you've arrived at "there's not no value".

crustycoder 2 days ago | parent [-]

Picking phrases from what I said and deliberately misquoting them out of context does not make you right.

mpalmer 2 days ago | parent [-]

How exactly did I misquote you?

crustycoder 2 days ago | parent [-]

Go figure it out, it will be a useful challenge for you.

hansmayer 3 days ago | parent | prev [-]

> Like most new tech

It's nothing like "most new tech". Most new tech tends to be adopted early by young people and experienced techies. In this case it is mostly the opposite: The teens absolutely hate it, probably because the shitty AI content does not inspire the young mind, and the experienced techies see it for what it is. I've never seen such "new tech" which was cheered on by the proverbial average "boomers" (i.e. old people doing "office jobs", not the literal age bracket) and despised by the young folks and experienced experts of all ages.

alchemism 2 days ago | parent | next [-]

Judging from Claude Code and the sheer number of “Make Your Favorite Anime Crush Into An AI” SaaSes on the market, I’d posit that both the young and experienced are quite enthusiastic about the new tech.

hansmayer 2 days ago | parent [-]

If you had kids, or friends and family with kids, you wouldn't be making false conclusions based on some weird proxy "metric".

crustycoder 2 days ago | parent | prev [-]

You clearly missed the "The truth is somewhere in between" bit.

hansmayer 2 days ago | parent [-]

No mate, this tech is marketed as superintelligence. Nation of PhDs in a datacentet. Yadda,yadda,yadda. No in-betweens please. Why is it not delivering after so many years and hundreds of billions in investment?

crustycoder 2 days ago | parent [-]

Name me a new bit of tech that hasn't been hyped beyond reasonable bounds. And yes, this is one of the worst examples. But saying it doesn't have its uses isn't reasonable either.

hansmayer 2 days ago | parent [-]

None was hyped like this ever before. What are you talking about? Mac was about "it just works" (and it f*ing did), iPhone was "a phone, an iPod and Internet access device". Need more? Microsoft Excel - actually more powerful if you know the tool compared to the bullshit machine. C#, the programming language: "Java done right". And it bloody was! What is in common: None of these techs were hyped beyond reasonable doubt. They were hyped a bit, but not to the level of bullshit LLMs. And none of these techs claimed to do incredible stuff only to underdeliver. After so much money burnt, yes I want to see that nation of PhDs. I want to see AI "writing all the code" in six months (Anthropic claimed this in January this year). Enough of bullshit and people being told they are stupid for not knowing how to win the lottery system and comparing lottery systems. Show me the superintelligence or shut the f. up.

swingboy 3 days ago | parent | prev | next [-]

Do these scores actually mean anything? Isn’t the LLM just making up something? If you ran the exact same prompt through 10 times would you get those same scores every single time?

grey-area 3 days ago | parent | next [-]

Yes I'd be interested in that answer too - these scores are most likely just generated in an arbitrary way, given how LLMs work. Given how they work in generating text it didn't actually keep a score and add to it each time it found a plus point in the skill as a human might in evaluating something.

At this point I'd discount most advice given by people using LLMs, because most of them don't recognise the inadequacies and failure modes of these machines (like the OP here) and just assume that because output is superficially convincing it is correct and based on something.

Do these skills meaningfully improve performance? Should we even need them when interacting with LLMs?

crustycoder 3 days ago | parent [-]

They aren't arbitrary, as I said earlier I got the LLM to de a detailed analysis first, then summarise. If I was doing this "properly" for something I was doing myself I'd go through the LLM summary point by point and challenge anything I didn't think was right and fix things in the skill where I thought it was correct.

You aren't going to have much success with LLMs if you don't understand that their primary goal is to produce plausible and coherent responses rather than ones that are necessarily correct (although they may be - hopefully).

And yes, Skills *do* make a significant difference to performance, in exactly the same way that well written prompts do - because that's all they really are. If you just throw something at a LLM and tell it "do something with this" it will, but it probably won't be what you want and it will probably be different each time you ask.

https://agentskills.io/home

hansmayer 3 days ago | parent | next [-]

> They aren't arbitrary, as I said earlier I got the LLM to de a detailed analysis first, then summarise

I think you still owe us an explanation as to how the score is constructed...

crustycoder 2 days ago | parent | next [-]

I don't owe you anything. If you want to go find out, go do it yourself.

You could even ask a LLM to help you if you,like...

hansmayer 2 days ago | parent [-]

> You could even ask a LLM to help you if you,like...

Attempt at humour?

bdangubic 3 days ago | parent | prev [-]

   random_decimal(0,10);
hansmayer 3 days ago | parent [-]

Yeah, I imagine too :) . But if they used floats, would they score 9.11 higher than 9.9 ? :)

grey-area 3 days ago | parent | prev [-]

It would be interesting to see one of these evals and how it generated the score, to work out whether it is in fact arbitrary or based on some scale of points.

I found the summary above devoid of useful advice, what did you see as useful advice in it?

> if you don't understand that their primary goal is to produce plausible and coherent responses rather than ones that are necessarily correct (although they may be - hopefully).

If you really believe this you should perhaps re-evaluate the trust you appear to place in the conclusions of LLMs, particularly about their own workings and what makes a good skill or prompt for them.

crustycoder 3 days ago | parent [-]

> It would be interesting to see one of these evals and how it generated the score, to work out whether it is in fact arbitrary or based on some scale of points.

So go repeat the exercise yourself. I've already said this was a short-enough-to-post rollup of a much longer LLM assessment of the skills and that while most of the points were fair, some were questionable. If you were doing this "for real" you'd need to assess the full response point-by-point and decide which ones were valid.

> If you really believe this you should perhaps re-evaluate the trust you appear to place in the conclusions of LLMs, particularly about their own workings and what makes a good skill or prompt for them.

What on earth are you on about? The whole point of of the sentence you were replying to was that you can't blindly trust what comes out of them.

grey-area 3 days ago | parent [-]

I'm saying that your agreement that they produce plausible but sometimes false text is contradicted by the trust you seem to have in their output and self-analysis, which is plausible but unlikely to be correct.

crustycoder 2 days ago | parent [-]

Yes of course there's a risk it may still be incorrect but querying the LLM with the limited facilities it provides for introspection is more likely to have at least some connection with facts than the alternative that some people use, which is to simply guess as to why it produced the output it did.

If you have an alternative approach, please share.

crustycoder 3 days ago | parent | prev [-]

No of course you wouldn't because LLMs are nondeterministic. But the scores would likely be in the same ballpark. The scores I posted are the result of a much more detailed analysis done by the LLM, which was far too long to post. I eyeballed it, most of the points seemed fair so I asked it to summarise and convert into scores.

jddj 3 days ago | parent | prev | next [-]

Is your premise here that LLMs have a unique or enhanced insight into how LLMs work best?

crustycoder 3 days ago | parent | next [-]

I wouldn't go that far but the only way I've found so far of getting a reasonable insight into why a LLM has chosen to do something is to ask it.

alexwebb2 3 days ago | parent | prev [-]

Not OP but I’d back that assertion.

When the model that’s interpreting it is the same model that’s going to be executing it, they share the same latent space state at the outset.

So this is essentially asking whether models are able to answer questions about context they’re given, and of course the answer is yes.

didgeoridoo 3 days ago | parent [-]

There is no evidence of this. Evals are quite different from "self-evals". The only robust way of determining if LLM instructions are "good" is to run them through the intended model lots of times and see if you consistently get the result you want. Asking the model if the instructions are good shows a very deep misunderstanding of how LLMs work.

alexwebb2 2 days ago | parent | next [-]

You're misunderstanding my assertion.

When you give prompt P to model M, when your goal is for the model to actually execute those instructions, the model will be in state S.

When you give the same prompt to the same model, when your goal is for the model to introspect on those instructions, the model is still in state S. It's the exact same input, and therefore the exact same model state as the starting point.

Introspection-mode state only diverges from execution-mode state at the point at which you subsequently give it an introspection command.

At that point, asking the model to e.g. note any ambiguities about the task at hand is exactly equivalent to asking it to evaluate any input, and there is overwhelming evidence that frontier models do this very well, and have for some time.

Asking the model, while it's in state S, to introspect and surface any points of confusion or ambiguities it's experiencing about what it's being asked to do, is an extremely valuable part of the prompt engineering toolkit.

I didn't, and don't, assert that "asking the model if the instructions are good" is a replacement for evals – that's a strawman argument you seem to be constructing on your own and misattributing to me.

mpalmer 2 days ago | parent | next [-]

    At that point, asking the model to e.g. note any ambiguities about the task at hand is exactly equivalent to asking it to evaluate any input
This point is load-bearing for your position, and it is completely wrong.

Prompt P at state S leads to a new state SP'. The "common jumping off point" you describe is effectively useless, because we instantly diverge from it by using different prompts.

And even if it weren't useless for that reason, LLMs don't "query" their "state" in the way that humans reflect on their state of mind.

The idea that hallucinations are somehow less likely because you're asking meta-questions about LLM output is completely without basis

alexwebb2 2 days ago | parent [-]

> The idea that hallucinations are somehow less likely because you're asking meta-questions about LLM output is completely without basis

Not sure who you're replying to here – this is not a claim I made.

mpalmer 2 days ago | parent [-]

That's fair, but I'm not sure why you chose to address the one part of my comment that isn't responsive to your points.

crustycoder 2 days ago | parent | prev [-]

Nicely put. I haven't seen anyone say that the introspection abilities of LLMs are up to much, but claiming that it's completely impossible to get a glimpse behind the curtain is untrue.

crustycoder 3 days ago | parent | prev [-]

Is that based on your "deep understanding" of how LLMs work or have you actually tried it? If you watch the execution trace of a Skill in action, you can see that it's doing exactly this inspection when the skill runs - how could it possibly work any other way?

Skills are just textual instructions, LLMs are perfectly capable of spotting inconsistencies, gaps and contradictions in them. Is that sufficient to create a good skill? No, of course not, you need to actually test them. To use an analogy, asking a LLM to critique a skill is like running lint on C code first to pick up egregious problems, running testcases is vital.

hansmayer 3 days ago | parent [-]

> you can see that it's doing exactly this inspection when the skill runs

I mean how do you know what does it exactly do? Because of the text it outputs?

crustycoder 2 days ago | parent [-]

"exactly this inspection" != "what does it exactly do"

hansmayer 2 days ago | parent [-]

Please read your own sentence again. Because you litterally said the opposite.

crustycoder 2 days ago | parent [-]

I'd tell you to read it again, but you seem to be struggling.

hansmayer 2 days ago | parent [-]

Did I write this: "you can see that it's doing exactly this inspection when the skill runs" ?

So, yeah - read what you wrote again.

hansmayer 3 days ago | parent | prev | next [-]

You gotta love the randomly assigned score, like if LLM is actually able to measure anything. But then again, we now call a blob of text a "skill", so I guess it matches the overall bullshit pattern.

grey-area 3 days ago | parent | prev | next [-]

What does this even mean? It looks like typical LLM bloviation to me: 'skill design discipline', 'stronger descriptions' and 'lighter defaults'??!? This is meaningless pablum masquerading as advice.

What specifically would this cause you to actually do to improve the skills in question? How would you measure that improvement in a non hand-wavy way? What do these scores mean and how were they calculated?

Or perhaps you would ask your LLM how it would improve these skills? It will of course some up with some changes, but are they the right changes and how would you know?

hansmayer 3 days ago | parent | next [-]

Great points, but I imagine it's a bit too heavy on the rigorousness requirement for the LLM crowd. The folks are high on this stuff and I am beginning to notice it's like trying to get a heavy pothead or crackhead of off their stuff. Don't you see it - if you just wave your hands a lot, and tell the LLM to be serious about it, the scores will just appear :) It's true in their own frame of reference.

crustycoder 2 days ago | parent | prev | next [-]

I'm not going to repeat myself, I've already explained the context to you - funny how you seem to have ignored that. If you want to find out, do the experiment yourself.

skydhash 3 days ago | parent | prev [-]

It’s all vibes based, we are not trying to be scientific here. /s

I discard most LLM advice and skills because either a script is better (as the work is routine enough) or it could be expressed better with bullet points (generating tickets).

threecheese 2 days ago | parent | prev [-]

Go even further, and add this into the skill-creator skill, and let the agent improve the skill regularly. I do this with determinism, and have my skills try to identify steps which can be scripted.