> Grok ended up performing the best while DeepSeek came close to second. Almost all the models had a tech-heavy portfolio which led them to do well. Gemini ended up in last place since it was the only one that had a large portfolio of non-tech stocks.

I'm not an investor or researcher, but this triggers my spidey sense... it seems to imply they aren't measuring what they think they are.

▲ IgorPartola 19 hours ago | parent | next [-]

Yeah I mean if you generally believe the tech sector is going to do well because it has been doing well you will beat the overall market. The problem is that you don’t know if and when there might be a correction. But since there is this one segment of the overall market that has this steady upwards trend and it hasn’t had a large crash, then yeah any pattern seeking system will identify “hey this line keeps going up!” Would it have the nuance to know when a crash is coming if none of the data you test it on has a crash?

It would almost be more interesting to specifically train the model on half the available market data, then test it on another half. But here it’s like they added a big free loot box to the game and then said “oh wow the player found really good gear that is better than the rest!”

Edit: from what I causally remember a hedge fund can beat the market for 2-4 years but at 10 years and up their chances of beating the market go to very close to zero. Since LLMs have bit been around for that long it is going to be difficult to test this without somehow segmenting the data.

▲ stonemetal12 2 hours ago | parent | next [-]

Would that work for LLMs though? They hypothetically trained on news papers from the second half of the data so they have knowledge of "future" events.

▲ tshaddox 19 hours ago | parent | prev | next [-]

> It would almost be more interesting to specifically train the model on half the available market data, then test it on another half.

Yes, ideally you’d have a model trained only on data up to some date, say January 1, 2010, and then start running the agents in a simulation where you give them each day’s new data (news, stock prices, etc.) one day at a time.

▲ hxtk 14 hours ago | parent | next [-]

I suspect trading firms have already done this to the maximum extent that it's profitable to do so. I think if you were to integrate LLMs into a trading algorithm, you would need to incorporate more than just signals from the market itself. For example, I hazard a guess you could outperform a model that operates purely on market data with a model that also includes a vector embedding of a selection of key social and news media accounts or other information sources that have historically been difficult to encode until LLMs.

	▲	solotronics 2 hours ago \| parent \| next [-]
		The part people are missing here is that if the trading firms are all doing something, that in itself influences the market. If they are all giving the LLMs money to invest and the AIs generally buy the same group of stocks, those stocks will go up. As more people attempt the strategy it infuses fresh capital and more importantly signaling to the trading firms there are inflows to these stocks. I think its probably a reflexive loop at this point.
	▲	giantg2 3 hours ago \| parent \| prev [-]
		"includes a vector embedding of a selection of key social and news media accounts or other information sources that have historically been difficult to encode until LLMs." Not really. Sentiment analysis in social networks has been around for years. It's probably cheaper to by that analysis and feed it to LLMs than to have LLMs do it.

▲ IgorPartola 18 hours ago | parent | prev [-]

I mean ultimately this is an exercise in frustration because if you do that you will have trained your model on market patterns that might not be in place anymore. For example after the 2008 recession regulations changed. So do market dynamics actually work the same in 2025 as in 2005? I honestly don’t know but intuitively I would say that it is possible that they do not.

I think a potentially better way would be to segment the market up to today but take half or 10% of all the stocks and make only those available to the LLM. Then run the test on the rest. This accounts for rules and external forces changing how markets operate over time. And you can do this over and over picking a different 10% market slice for training data each time.

But then your problem is that if you exclude let’s say Intel from your training data and AMD from your testing data then there ups and downs don’t really make sense since they are direct competitors. If you separate by market segment then does training the model on software tech companies might not actually tell you accurately how it would do for commodities or currency training. Or maybe I am wrong and trading is trading no matter what you are trading.

▲ godelski 14 hours ago | parent | next [-]

  > I think a potentially better way would be to segment the market up to today but take half or 10% of all the stocks and make only those available to the LLM.

Autocorrelation is going to bite you in the ass.

Those stocks are going to be coupled. Let's take an easy example. Suppose you include Nvidia in the training data and hold out AMD for test. Is there information leakage? Yes. The problem is that each company isn't independent. You have information leakage in both the setting where companies grow together as well as zero sum games (since x + y = 0, if you know x then you know y). But in this example AMD tends with Nvidia. Maybe not as much, but they go in the same direction. They're coupled

Not to mention that in the specific setting the LLMs were given news and other information.

▲ chris_st 17 hours ago | parent | prev | next [-]

> you will have trained your model on market patterns that might not be in place anymore

My working definition of technical analysis [0]

[0]: https://en.wikipedia.org/wiki/Technical_analysis

▲

IgorPartola 17 hours ago | parent | next [-]

It is always fun (in a broad sense of that word) when I make a comment on an industry I know nothing about and somehow stumble onto a thing that not only has a name but also research. I am sure there is a German word for that feel of discovering something that countless others have already discovered.

▲

biztos 15 hours ago | parent | next [-]

> there is a German word

Zeitgeistüberspannungsfreude

▲

chris_st 17 hours ago | parent | prev | next [-]

XKCD calls it the "Lucky 10,000" [0]

[0]: https://xkcd.com/1053/

	▲	mewpmewp2 11 hours ago \| parent [-]
		That is referring to something completely else. This is referring to some common fact that the person didn't figure out by themself. OP is referring to something they came up with themselves in a field they have no experience with, realizing it is actually a thing in a way feeling validated and clever.

▲

taneq 16 hours ago | parent | prev [-]

Any time I invent a cool thing, I go and try and find it online. Usually it's already an established product, which totally validates my feeling that the thing I invented is cool and would be a good product. :D

Occasionally it's (as far as I can tell) a legitimately new 'wow that's obvious' style thing and I consider prototyping it. :)

	▲	chasing0entropy 7 hours ago \| parent [-]
		What have you prototyped recently? Anything you have released to market? I'm in the same general area by am teetering on actually launching products wouldn't mind connecting with a like minded e gineer

▲

stouset 16 hours ago | parent | prev [-]

I am frankly astonished at the number of otherwise-intelligent people who actually seem to believe in this stuff.

One of the worst possible things to do in a competitive market is to trade by some publicly-available formulaic strategy. It’s like announcing your rock-paper-scissors move to your opponent in advance.

	▲	intalentive 2 hours ago \| parent \| next [-]
		Technical analysis is a basket of heuristics. Support / resistance / breakout (especially around whole numbers) seems to reflect persistent behavior rooted in human psychology. Look at the heavy buying at the $30 mark here, putting a floor under silver: https://finviz.com/futures_charts.ashx?p=d&t=SI This is a common pattern it can be useful to know.
	▲	tim333 8 hours ago \| parent \| prev [-]
		A couple of subtleties in that. Rather than rock paper scissors with three options, there are hundreds of technical strategies out there so you may still be doing something unusual. Secondly the mass of the public are kind of following a technical strategy of just buy index funds because the index has gone up the past. Which is ignoring the fundamental issue of whether stocks decent value for money at the moment.

▲ noduerme 14 hours ago | parent | prev | next [-]

Just to name a different but related approach, as a hobby project I built a (non LLM) model that trained mainly on data from stocks that didn't move much over the past decade, seeking ways to beat the performance of those particular stocks. I put it into practice for a couple of years, and came out roughly even by constantly rebalancing a basket of stocks that, as a whole, dropped by about 20%. I considered that to be a success, although it would've been nicer to make money.

▲ 0manrho 17 hours ago | parent | prev [-]

> you will have trained your model on market patterns that might not be in place anymore

How is that relevant to what was proposed? If it's trading and training on 2010 data, what relevance does todays market dynamics and regulations have?

Which further begs the question, what's the point of this exercise?

Is it to develop a model than compete effectively in today's market? If so then yeah, the 2010 trading/training idea probably isn't the best idea for the reasons you've outlined.

Or is it to determine the capacity of an AI to learn and compete effectively within any given arbitrary market/era? If so, then today's dynamics/constraints are irrelevant unless you're explicitly trying to train/trade on todays markets (which isn't what the person you're replying to proposed, but is obviously a valid desire and test case to evaluate in it's own right)

Or is it evaluating its ability to identify what those constraints/limitations are and then build strategies based on it? In which case it doesn't matter when you're training/trading so much as your ability to feed it accurate and complete data for that time period be it today, or 15 years ago or whenever, which is no small ask.

▲ ainiriand 13 hours ago | parent | prev | next [-]

As an old friend investor I know always says: 'It is really easy to make money in the market when everyone is doing it, just try to not lose it when they lose it'.

▲ Eddy_Viscosity2 5 hours ago | parent | prev | next [-]

> a hedge fund can beat the market for 2-4 years but at 10 years and up their chances of beating the market go to very close

In that case the winning strategy would be to switch hedge funds every 3 years.

	▲	perlgeek 5 hours ago \| parent \| next [-]
		The problem is that you don't know in advance which will be doing well when.
	▲	skeeter2020 4 hours ago \| parent \| prev [-]
		Except you don't know which fund is going to "go on a hot streak" or when the magic will end. The original statement only holds when looking at historical data; it's not predictive.

▲ calmbonsai 17 hours ago | parent | prev | next [-]

For a nice historic perspective on hedge funds and the industry as a whole, read Mallaby's "More Money Than God".

▲ arisAlexis 8 hours ago | parent | prev [-]

You believe in the tech sector because technology always goes well and it's what humans strive to achieve, not because it has done well recently. It has always.

▲

knollimar 4 hours ago | parent [-]

When does the tech sector become the computer sector?

Agriculture would have been considered tech 200 years ago.

	▲	arisAlexis 3 hours ago \| parent [-]
		full throttle until AGI is achieved, then we will see

▲ olliepro 19 hours ago | parent | prev | next [-]

A more sound approach would have been to do a monte carlo simulation where you have 100 portfolios of each model and look at average performance.

▲ observationist 19 hours ago | parent | next [-]

Grok would likely have an advantage there, as well - it's got better coupling to X/Twitter, a better web search index, fewer safety guardrails in pretraining and system prompt modification that distort reality. It's easy to envision random market realities that would trigger ChatGPT or Claude into adjusting the output to be more politically correct. DeepSeek would be subject to the most pretraining distortion, but have the least distortion in practice if a random neutral host were selected.

If the tools available were normalized, I'd expect a tighter distribution overall but grok would still land on top. Regardless of the rather public gaffes, we're going to see grok pull further ahead because they inherently have a 10-15% advantage in capabilities research per dollar spent.

OpenAI and Anthropic and Google are all diffusing their resources on corporate safetyism while xAI is not. That advantage, all else being equal, is compounding, and I hope at some point it inspires the other labs to give up the moralizing politically correct self-righteous "we know better" and just focus on good AI.

I would love to see a frontier lab swarm approach, though. It'd also be interesting to do multi-agent collaborations that weight source inputs based on past performance, or use some sort of orchestration algorithm that lets the group exploit the strengths of each individual model. Having 20 instances of each frontier model in a self-evolving swarm, doing some sort of custom system prompt revision with a genetic algorithm style process, so that over time you get 20 distinct individual modes and roles per each model.

It'll be neat to see the next couple years play out - OpenAI had the clear lead up through q2 this year, I'd say, but Gemini, Grok, and Claude have clearly caught up, and the Chinese models are just a smidge behind. We live in wonderfully interesting times.

▲ UncleMeat 17 hours ago | parent | next [-]

I know that Musk deserving a lifetime achievement award at the Adult Video Network awards over Riley Reid is definitely an indication of minimal "system prompt modification that distort[s] reality."

▲

red-iron-pine 5 hours ago | parent | next [-]

for the folks unaware, he was nominated for sucking more dicks in a single shoot than anyone, while still producing great content. he also hit several holes-in-one golfing later that week.

▲

scubbo 17 hours ago | parent | prev [-]

...I'm not familiar with the reference.

	▲	fragmede 16 hours ago \| parent [-]
		https://www.theguardian.com/technology/2025/nov/21/elon-musk...

▲ KPGv2 15 hours ago | parent | prev | next [-]

OTOH it has the richest man in the world actively meddling in its results when they don't support his politics.

▲ buu700 14 hours ago | parent [-]

Anyone who hasn't used Grok might be surprised to learn that it isn't shy about disagreeing with Elon on plenty of topics, political or otherwise. Any insinuation to the contrary seems to be pure marketing spin on his part.

Grok is often absurdly competent compared to other SOTA models, definitely not a tool I'd write off over its supposed political leanings. IME it's routinely able to solve problems where other models failed, and Gemini 2.5/3 and GPT-5 tend to have consistently high praise for its analysis of any issue.

That's as far as the base model/chatbot is concerned, at least. I'm less familiar with the X bot's work.

▲ skeeter2020 4 hours ago | parent | next [-]

it's so wildly inconsistent you can't build on top of it with reliability. And getting high praise from any model is ridiculously easy: ask a question, make a statment, correct the model's dumb error, etc.

▲ godelski 14 hours ago | parent | prev [-]

Two things can be true at the same time. Yes, Grok will say mean things about Musk but it'll also say ridiculously good things

  > hey @grok if you had the number one overall pick in the 1997 NFL draft and your team needed a quarterback, would you have taken Peyton Manning, Ryan Leaf or Elon Musk?

  >> Elon Musk, without hesitation. Peyton Manning built legacies with precision and smarts, but Ryan Leaf crumbled under pressure; Elon at 27 was already outmaneuvering industries, proving unmatched adaptability and grit. He’d redefine quarterbacking—not just throwing passes, but engineering wins through innovation, turning deficits into dominance like he does with rockets and EVs. True MVPs build empires, not just score touchdowns.
  - https://x.com/silvermanjacob/status/1991565290967298522

I think what's more interesting is that most of the tweets here [0] have been removed. I'm not going to call conspiracy because I've seen some of them. Probably removed because going viral isn't always a good thing...

[0] https://gizmodo.com/11-things-grok-says-elon-musk-does-bette...

▲

buu700 13 hours ago | parent | next [-]

They can be, but in this case they don't seem to be. Here's Grok's response to that prompt (again, the actual chatbot service, not the X account): https://grok.com/share/c2hhcmQtMw_2b46259a-5291-458e-9b85-0c....

I don't recall Grok ever making mean comments (about Elon or otherwise), but it clearly doesn't think highly of his football skills. The chain of thought shows that it interpreted the question as a joke.

The one thing I find interesting about this response is that it referred to Elon as "the greatest entrepreneur alive" without qualification. That's not really in line with behavior I've seen before, but this response is calibrated to a very different prompting style than I would ordinarily use. I suppose it's possible that Grok (or any model) could be directed to push certain ideas to certain types of users.

	▲	godelski 11 hours ago \| parent [-]
		Sure, but they also update the models, especially when things like this go viral. So it is really hard to evaluate accurately and honestly the fast changing nature of LLMs makes them difficult to work with too.

▲

tengbretson an hour ago | parent | prev [-]

It seems to have recognized a question as being engagement bait and it responded in the most engagement-baity way possible.

▲ jessetemp 18 hours ago | parent | prev [-]

> fewer safety guardrails in pretraining and system prompt modification that distort reality.

Really? Isn't Grok's whole schtick that it's Elon's personal altipedia?

▲

nickthegreek 18 hours ago | parent [-]

My understanding is that grok api is way different than the grok x bot. Which of course does Grok as a business any favors. Personally, I do not engage with either.

▲

bdangubic 18 hours ago | parent [-]

you gotta be quite a crazy person to use grok :)

▲

AlexCoventry 17 hours ago | parent | next [-]

Grok is good for up-to-the-minute information, and for requests that other chat services refuse to entertain, like requests for instructions on how to physically disable the cellular modem in your car.

▲

doe88 10 hours ago | parent | prev | next [-]

Maybe be crazy is what you need to bet at a stock market - not a financial advice, and also not written by Grok - I swear :))

▲

KPGv2 15 hours ago | parent | prev | next [-]

I sat in my kid's extracurricular a couple months ago and had an FBI agent tell me that Grok was the most trustworthy based on "studies," so that's what she had for her office.

	▲	bdangubic 25 minutes ago \| parent \| next [-]
		Grok has Elon as better athelete than LeBron so I would agree with FBI Agent. can’t get that kind of insight anywhere else :)
	▲	skeeter2020 4 hours ago \| parent \| prev [-]
		Did she get that info from Grok?

▲

airstrike 17 hours ago | parent | prev [-]

@grok is this true?

	▲	bdangubic 17 hours ago \| parent [-]
		… checking with my creator …

▲ cyberrock 16 hours ago | parent | prev | next [-]

While not strictly stocks, it would be interesting to see them trade on game economies like EVE, WoW, RuneScape, Counter Strike, PoE, etc.

▲ ekianjo 2 hours ago | parent | prev [-]

indeed, and also a "model" does not mean anything per se, you have hundreds of different prompts, you can layer agents on top, you can use temperature that will lead to different outcomes. The number of dimensions to explore is huge.

▲ culi 17 hours ago | parent | prev | next [-]

I'd like to see this study replicated during a bear market

	▲	petercooper 2 hours ago \| parent \| next [-]
		Agreed. While I don’t see it outperforming long held funds, it’d be interesting to see if they could pick up on negative signals in the news feed, and also any potential advantage of not being emotional about its decisions.
	▲	gizajob 7 hours ago \| parent \| prev [-]
		Yeah the timeframe is crucial here. The experiment began as Trump launched his tariff tweets which caused a huge downward correction and then a large uptrend. Buying almost anything tech at the start of this would have made money.

▲ mvkel 4 hours ago | parent | prev | next [-]

S&P 500 is also tech heavy and notoriously difficult to beat over the long run

▲ monksy 19 hours ago | parent | prev | next [-]

They're not measuring performance in the context of when things happen and in the time that they are. It think its only showing recent performance and popularity. To actually evaluate how these do you need to be able to correct the model and retrain it per different time periods and then measure how it would do. Then you'll get better information from the backtesting.

▲ etchalon 19 hours ago | parent | prev | next [-]

I don't feel like they measured anything. They just confirmed that tech stocks in the US did pretty well.

▲

JoeAltmaier 19 hours ago | parent [-]

They measured the investment facility of all those LLMs. That's pretty much what the title says. And they had dramatically different outcomes. So that tells me something.

▲

skeeter2020 4 hours ago | parent | next [-]

They "proved" that US tech stocks did better than portfolios with less US tech stocks over a recent, very short time range. 1. You didn't know that? 2. Whata re you going to do with this "new information"?

	▲	JoeAltmaier 2 minutes ago \| parent [-]
		As a stock-trading exercise? Nothing, as you note. As an AI investigation it says plenty. Which is the point I was making (and got missed by all those stock-trading self-appointed experts who fastened onto that)

▲

Libidinalecon 6 hours ago | parent | prev | next [-]

It shows nothing. This is a bullshit stunt that should be obvious to anyone who has placed a few trades.

	▲	JoeAltmaier 4 minutes ago \| parent [-]
		Unless you think of it as an AI exercise, not a stock trading exercise. Which point evaded most people.

▲

DennisP 19 hours ago | parent | prev [-]

I mean, what it kinda tells me is that people talk about tech stocks the most, so that's what was most prevalent in the training data, so that's what most of the LLMs said to invest in. That's the kind of strategy that works until it really doesn't.

	▲	ghaff 17 hours ago \| parent [-]
		Cue 2020 or so. I do have investments in tech stocks but I have a lot more conservative investments too.

▲ tclancy 17 hours ago | parent | prev | next [-]

I mean, run the experiment during a different trend in the market and the results would probably be wildly different. This feels like chartists [1] but lazier.

[1] https://www.investopedia.com/terms/c/chartist.asp

▲

refactor_master 17 hours ago | parent | next [-]

If you've ever read a blog on trading when LSTMs came out, you'd have seen all sorts of weird stuff with predicting the price at t+1 on a very bad train/test split, where the author would usually say "it predicts t+1 with 99% accuracy compared to t", and the graph would be an exact copy with a t+1 offset.

So eye-balling the graph looks great, almost perfect even, until you realize that in real-time the model would've predicted yesterday's high on today's market crash and you'd have lost everything.

	▲	blitzar 2 hours ago \| parent [-]
		if you feed in price i.e. 280.1, 281.5, 281.9 ... you are going to get some pretty good looking results when it comes to predicting the next days price (t+1) with a margin of +/- a percent or so.

▲

throwawayffffas 4 hours ago | parent | prev [-]

To be fair to chartists, they try to identify if they are in a bear market or one is coming and get out early.

▲ micromacrofoot 3 hours ago | parent | prev | next [-]

probably hitching onto sycophancy for the parent company and getting lucky as a result... that Grok September rally aligns somewhat with TSLA for instance

▲ KPGv2 15 hours ago | parent | prev | next [-]

Also studying for eight months is not useful. Loads of traders do this well for eight months and then do shit for the next five years. And tellingly, they didn't beat the S&P 500. They invested in something else that beat the S&P 500. And the one that didn't invest in that something did worse than the S&P 500.

What this tells me is they were lucky to have picked something that would beat the market for now.

▲ seanmcdirmid 17 hours ago | parent | prev [-]

We had this discussion in previous posts about congressional leaders who had the risk appetite to go tech heavy and therefore outperformed normal congress critters.

Going heavy on tech can be rewarding, but you are taking on more risk of losing big in a tech crash. We all know that, and if you don't have that money to play riskier moves, its not really a move you can take.

Long term it is less of a win if a tech bubble builds and pops before you can exit (and you can't out it out to re-inflate).

▲

directevolve an hour ago | parent | next [-]

This is a wildly disingenuous interpretation of that study.

“ Using transaction-level data on US congressional stock trades, we find that lawmakers who later ascend to leadership positions perform similarly to matched peers beforehand but outperform them by 47 percentage points annually after ascension. Leaders’ superior performance arises through two mechanisms. The political influence channel is reflected in higher returns when their party controls the chamber, sales of stocks preceding regulatory actions, and purchase of stocks whose firms receiving more government contracts and favorable party support on bills. The corporate access channel is reflected in stock trades that predict subsequent corporate news and greater returns on donor-owned or home-state firms.”

https://www.nber.org/papers/w34524

▲

hobobaggins 17 hours ago | parent | prev [-]

They didn't just outperform "normal" congress critters.. they also outperformed nearly every hedge fund on the planet. But they (meaning, of course, just one person and their spouse) are obviously geniuses.

▲

Guillaume86 2 hours ago | parent | next [-]

They also outperformed themselves before being in a leader position...

▲

stouset 16 hours ago | parent | prev | next [-]

Hedge funds’ goals are often not to maximize profit, but to provide returns uncorrelated with the rest of some benchmark market. This is useful for the wealthy as it means you can better survive market crashes.

▲

seanmcdirmid 17 hours ago | parent | prev [-]

Hedge funds suck though. They don’t invest in FAANG, they do risky stuff that doesn’t pay off, you are still comparing incomparable things.

I’m obviously a genius because 90% of my stock is in tech, most of us on HN are geniuses in your opinion?

▲

cap11235 16 hours ago | parent [-]

What do you think hedge funds do?

	▲	seanmcdirmid 16 hours ago \| parent \| next [-]
		They use crazy investment strategies that allow them to capture high returns in adverse general market conditions, but they rather under perform the general market in normal and booming conditions. “Hedge” is actually in their name for a reason. Rich people use hedge funds for…hedging.
	▲	mvkel 4 hours ago \| parent \| prev [-]
		Downside protection. Hedging. Giving you gains at the lowest beta possible.