Remix.run Logo
grey-area 3 days ago

Yes I'd be interested in that answer too - these scores are most likely just generated in an arbitrary way, given how LLMs work. Given how they work in generating text it didn't actually keep a score and add to it each time it found a plus point in the skill as a human might in evaluating something.

At this point I'd discount most advice given by people using LLMs, because most of them don't recognise the inadequacies and failure modes of these machines (like the OP here) and just assume that because output is superficially convincing it is correct and based on something.

Do these skills meaningfully improve performance? Should we even need them when interacting with LLMs?

crustycoder 3 days ago | parent [-]

They aren't arbitrary, as I said earlier I got the LLM to de a detailed analysis first, then summarise. If I was doing this "properly" for something I was doing myself I'd go through the LLM summary point by point and challenge anything I didn't think was right and fix things in the skill where I thought it was correct.

You aren't going to have much success with LLMs if you don't understand that their primary goal is to produce plausible and coherent responses rather than ones that are necessarily correct (although they may be - hopefully).

And yes, Skills *do* make a significant difference to performance, in exactly the same way that well written prompts do - because that's all they really are. If you just throw something at a LLM and tell it "do something with this" it will, but it probably won't be what you want and it will probably be different each time you ask.

https://agentskills.io/home

hansmayer 3 days ago | parent | next [-]

> They aren't arbitrary, as I said earlier I got the LLM to de a detailed analysis first, then summarise

I think you still owe us an explanation as to how the score is constructed...

crustycoder 2 days ago | parent | next [-]

I don't owe you anything. If you want to go find out, go do it yourself.

You could even ask a LLM to help you if you,like...

hansmayer 2 days ago | parent [-]

> You could even ask a LLM to help you if you,like...

Attempt at humour?

bdangubic 3 days ago | parent | prev [-]

   random_decimal(0,10);
hansmayer 3 days ago | parent [-]

Yeah, I imagine too :) . But if they used floats, would they score 9.11 higher than 9.9 ? :)

grey-area 3 days ago | parent | prev [-]

It would be interesting to see one of these evals and how it generated the score, to work out whether it is in fact arbitrary or based on some scale of points.

I found the summary above devoid of useful advice, what did you see as useful advice in it?

> if you don't understand that their primary goal is to produce plausible and coherent responses rather than ones that are necessarily correct (although they may be - hopefully).

If you really believe this you should perhaps re-evaluate the trust you appear to place in the conclusions of LLMs, particularly about their own workings and what makes a good skill or prompt for them.

crustycoder 3 days ago | parent [-]

> It would be interesting to see one of these evals and how it generated the score, to work out whether it is in fact arbitrary or based on some scale of points.

So go repeat the exercise yourself. I've already said this was a short-enough-to-post rollup of a much longer LLM assessment of the skills and that while most of the points were fair, some were questionable. If you were doing this "for real" you'd need to assess the full response point-by-point and decide which ones were valid.

> If you really believe this you should perhaps re-evaluate the trust you appear to place in the conclusions of LLMs, particularly about their own workings and what makes a good skill or prompt for them.

What on earth are you on about? The whole point of of the sentence you were replying to was that you can't blindly trust what comes out of them.

grey-area 3 days ago | parent [-]

I'm saying that your agreement that they produce plausible but sometimes false text is contradicted by the trust you seem to have in their output and self-analysis, which is plausible but unlikely to be correct.

crustycoder 2 days ago | parent [-]

Yes of course there's a risk it may still be incorrect but querying the LLM with the limited facilities it provides for introspection is more likely to have at least some connection with facts than the alternative that some people use, which is to simply guess as to why it produced the output it did.

If you have an alternative approach, please share.