I was working on a similar project. I wanted a way to goldfish my decks against many kinds of decks in a pod. It would never be perfect, but enough to get an idea of: 1. How many turns did it take on average to hit 2,3,4,5,6 mana 2. How many threats did I remove? 3. How often did I not have enough card draw to keep my hand full?

I don't think there's a perfect way to do this, but I think trying to play 100 games with a deck and getting basic info like this would be super valuable.

▲

spullara 2 hours ago | parent | next [-]

Have your LLM write a simulation of the deck rather so it can play 10,000 games in a second. I think that is a lot better for gold fishing and not nearly as expensive :)

https://github.com/spullara/mtg-reanimator

I have also tried evaluating LLMs for playing the game and have found them to be really terrible at it, even the SoTA ones. They would probably be a lot better inside an environment where the rules are enforced strictly like MTG Arena rather than them having to understand the rules and play correctly on their own. The 3rd LLM acting as judge helps but even it is wrong a lot of the time.

https://github.com/spullara/mtgeval

▲

GregorStocks 2 hours ago | parent [-]

Yeah, that's why I'm using XMage for my project - it has real rules enforcement.

	▲	spullara an hour ago \| parent [-]
		I was really hoping they could play the game like a human does. Sadly they aren't that close :)

▲

GregorStocks 2 hours ago | parent | prev [-]

XMage has non-LLM-based built in AIs, just using regular old if-then logic. Getting them to play against each other with no human interaction is the first thing I built. https://www.youtube.com/watch?v=a1W5VmbpwmY is an example with two of those guys plus Sleepy and Potato no-op players - they do a fine job with straightforward decks.

You could clone mage-bench https://github.com/GregorStocks/mage-bench and add a new config like https://github.com/GregorStocks/mage-bench/blob/master/confi... pointing at the deck you want to test, and then do `make run CONFIG=my-config`. The logs will get dumped in ~/.mage-bench/logs and you can do analysis on them after the fact with Python or whatever. https://github.com/GregorStocks/mage-bench/tree/master/scrip... has various examples of varying quality levels.

You could also use LLMs, just passing a different `type` in the config file. But then you'd be spending real money for slower gameplay and probably-worse results.

	▲	benbayard an hour ago \| parent [-]
		This is super helpful, thank you!