Can you walk through why you think the random outer loop would interfere?

If the inner loop's behaviour is predictable no matter the outer loop, then because the branch predictor is keyed by instruction address, it can be predicted. Only the inner loop's history is considered.

Or maybe I'm misunderstanding what code you're imagining?

▲

fc417fc802 3 days ago | parent [-]

It seems likely I've fundamentally misunderstood gselect or some other aspect here.

If you only look at the history of a single address independently then don't shallow nested loops with dependent behavior completely break the entire scheme?

Whereas with global history I think it mostly works. Maybe? But in that case what happens when the inner branch is trivially predictable but shallow and enclosed by ones that aren't?

In the linked article the branch always has the same address. AMD starts gradually degrading at 30k. Doesn't that indicate 16 bits of history? So for a single trivially predictable inner branch wouldn't you expect an initial 14 mispredictions? That seems like a lot.

My (I suspect very flawed) mental model here is global branch history with a 16 bit shift register and a 16 bit table, the latter keyed on hash( register ) ^ hash( address ) to explain the observed behavior.

▲

mcdeltat 2 days ago | parent [-]

Ahh maybe I am the one who is misunderstanding... I'm by no means an expert on this (and it is quite complex)

	▲	fc417fc802 2 days ago \| parent [-]
		Looking at the linked slides again my mental model is what it refers to as "gshare". A tournament predictor between that and "local" addresses the two scenarios I posed but still has obvious failure modes. Local: if( rng() ){ if( true ){ ... }} Global: if( f ){} if( !f ){} Trashed state: if( rng() ){} if( f ){} if( !f ){} But notice that the third happens naturally (ie no need for an RNG) any time the history depth doesn't match up nicely with the looping pattern. Hence my initial question about how real world implementations determine how many layers to pay attention to. You could solve it with a tree structure but do hardware implementers go that far?