In those gl/dx games (built for non-specific hardware) all the textures and shaders are compiled either during the game's build OR before you get into the scene. Many console systems, particularly Nintendo, do that precompilation specifically for the hardware GPU that is inside the console. That is not known to the emulator in advance (unless someone publishes a shader compilation alongside the rom...) so when the shader is referenced in the scene, it needs to be compiled in runtime to work on the emulated graphics system (translated from nintendo-hardware shader code to direct-x, vulkan or openGL then further into the vendor-specific shader)

Most modern emulators implement a shader cache which stores those shaders as they are encountered so that this "compilation stutter" only happens once per shader - but modern titles can have hundreds or thousands of shaders and that means on a playthrough you're pretty much encountering it consistently. Breath of the Wild was one that stands out as a game where you basically had to run it with precompiled shader caches as it was borderline unplayable without it.

Ubershaders act like fallback shaders - using an off the shelf precompiled "particle" shader vs the actual one, while the actual one is compiled for use next time - this prevents the stutter at a cost of visual fidelity. If you see an explosion in a game, it will be a generic explosion shader vs the actual one used in the game, until it is available in the shader cache.

▲

zeta0134 5 months ago | parent | next [-]

That's not quite how ubsershaders work. They're a "fallback" shader in the sense that they rather inefficiently implement the entire pipeline, but they do implement the entire pipeline. The shader being compiled in another thread will be more efficient as it uses only the logic needed for whatever configuration the game is calling up. But the visual result is identical in the ubsershaders case, that's the whole point. If you want, and your host system is powerful enough, you can turn ubsershaders on all the time and disable the entire compilation thread and associated cache.

I believe the term was coined by Dolphin team, who did a pretty good high level writeup of the feature here:

https://dolphin-emu.org/blog/2017/07/30/ubershaders/

▲

tom_ 5 months ago | parent [-]

It predates dolphin's use of it, though don't ask me by how long. Here's a reference to the term from 2008: https://realtimecollisiondetection.net/blog/?p=73

▲

Jasper_ 4 months ago | parent [-]

Ubershader actually has three different opposite meanings, unfortunately.

The classic usage is a single source shader which is specialized using #define's and compiled down to hundreds of shaders. This is what Christer uses in that blog post above (and Aras does as well in his ubershader blog post)

Dolphin used it to mean a single source shader that used runtime branches to cover all the bases as a fallback while a specialized shader was compiled behind the scenes.

The even more modern usage now is a single source shader that only uses runtime branches to cover all the features, without any specialization behind the scenes, and that's what Dario means here.

	▲	zeta0134 4 months ago \| parent \| next [-]
		Ah, then my correction probably does not stand, and I'll need to look deeper into it. Thanks for the explanation! This jargon really gets out of hand at times. :P But I don't mind being wrong if I learn from it.
	▲	Y_Y 4 months ago \| parent \| prev [-]
		Thank you. This is my favorite kind of comment. There are lots of "technical" terms which manage to acquire similar but distinct uses (today I was contending with "agent" and "prompt"). Keeping them straight in your own head, and recognizing when others don't is as valuable as it is unappreciated.

▲

derefr 4 months ago | parent | prev [-]

So how about:

1. A global, networked shader cache — where when any instance of the emulator encounters a new shader, it compiles it, and then pushes the KV-pair (ROM hash, target platform, console shader object-code hash)=(target-platform shader object-code) into some KV server somewhere; and some async process comes along periodically to pack all so-far-submitted KV entries with a given (ROM hash, target platform) prefix into shader-cache packfiles. On first load of a game, the emulator fetches the packfile if it exists, and loads the KV pairs from it into the emulator's local KV cache. (In theory, the emulator could also offer the option to fetch global-shader-cache-KV-store "WAL segment" files — chunks of arbitrary global-shader-cache KV writes — as they're published on a 15-minute-ly basis. Or KV entries for given (ROM hash, target) prefixes could be put into message-queue topics named after those prefixes, to which running instances of the emulator could subscribe. These optimizations might be helpful when e.g. many people are playing a just-released ROMhack, where no single person has yet run through the whole game to get it in the cache yet. Though, mind you, the ROMhack's shaders could already be cached into the global store before release, if the ROMhacker used the emulator during development... or if they knew about this, and were considerate enough to use some tool created by the emulator dev to explicitly compile + submit their raw shader project files into the global KV store.)

2. Have the emulator (or some separate tool) "mine out" all the [statically-specified] shaders embedded from the ROM, as a one-time process. (Probably not just a binwalk, because arbitrary compression. Instead, think: a concolic execution of the ROM, that is looking for any call to the "load main-memory region into VRAM as shader" GPU instruction — where there is a symbolically-emulated memory with regions that either have concrete or abstract values. If the RAM region referenced in this "load as shader" instruction is statically determinable — and the memory in that region has a statically-determinable value on a given code-path — then capture that RAM region.) Precompile all shaders discovered this way create a "perfect" KV cachefile for the game. Publish this into a DHT (or just a central database) under the ROM's hash. (Think: OpenSubtitles.org)

Mind you, I think the best strategy would actually combine the two approaches — solution #2 can virtually eliminate stutter with a single pre-processing step, but it doesn't allow for caching of dynamically-procedurally-generated shaders. Solution #1 still has stutter for at least one player, one time, for each encountered shader — but it handles the case of dynamic shaders.

▲

tgtweak 4 months ago | parent [-]

The issue with "mining" the rom for shaders is they're not defined in a consistent way across the games. Games using the same base engine could have their shaders stored and referencable in a consistent way - but that would not be a catch-all solution for all games, which means you still need a fallback. Most of the ROMs are obfuscated and it's not possible, at least not currently, to "know" all the shaders being used in the game before the hardware calls to load the shader come through the game binary while it's running.

The best experience so far is downloading an additional shader cache alongside the ROM - in some rom formats these can also be included alongside the ROM which acts like a dictionary and can facilitate loading into the emulator vs having to add it as a "mod" for that specific game. Adding this to a DHT type network for "shader sharing" would be great but might open the door to some abuse (shaders are run at hardware level and there are some examples of malicious shaders out there) - Plus you'd be exposing the games you're playing to the dht network.

Anyway - Just a succinct example of the level of effort that goes into making an emulator "just work".

▲

derefr 4 months ago | parent [-]

> The issue with "mining" the rom for shaders is they're not defined in a consistent way across the games.

I don't want to be snippy, but — I don't think you understood the rest of the paragraph you're attempting to rebut here, since this is exactly (part of) what I said myself. (I wouldn't blame you if you didn't understand it; the concept of "concolic execution" is probably familiar to maybe ~50000 people worldwide, most of them people doing capital-S Serious static-analysis for work in cryptanalysis, automated code verification, etc.)

To re-explain without the jargon: you wouldn't be "mining" the shaders as data-at-rest; rather, you'd be "running" the ROM under a semi-symbolic (symbolic+concrete — concolic!) interpreter, one that traverses all possible code-paths "just enough" times to see all "categories of states" (think: a loop's base-case vs its inductive case vs its breakout case.) You'd do this so that, for each "path of states" that reaches an instruction that tells the console's GPU "this here memory, this is a shader now", the interpreter could:

1. look back at the path that reached the instruction;

2. reconstruct a sample (i.e. with all irrelevant non-branch-determinant values fixed to placeholders) concrete execution trace; and then

3. concretely "replay" that execution trace, using the emulator itself (but with no IO peripherals hooked up, and at maximum speed, and with no need for cycle-accurate timed waits since inter-core scheduling is pre-determined in the trace);

4. which would, as a side-effect, "construct" each piece of shader object-code into memory — at a place where the interpreter is expecting it, given the symbolic "formula" node that the interpreter saw passed into the instruction ("formula node": an AST subtree built out of SSA-instruction branch-nodes and static-value leaf-nodes, referenced by a versioned Single-Static-Information cell, aliasable into slices within CPU-register ADTs, or into a layered+sparse memory-cell interval-tree ADT);

5. so that the interpreter can then pause concrete emulation at the same "load this as a shader" instruction; reach into the emulator's memory where the "formula node" said to look; and grab the shader object-code out.

If you know how the AFL fuzzer works, you could think of this as combining "smart fuzzing" (i.e. looking at the binary and using it to efficiently discover the "constraint path" of branch-comparison value-ranges that reaches each possible state); with a graph-path-search query that "directs" the fuzzer down only paths that reach states we're interested in (i.e. states that reach a GPU shader load instruction); and with an observing time-travelling debugger/tracer connected, to then actually execute the discovered "interesting" paths up to the "interesting" point, to snapshot the execution state at that point and extract "interesting" data from it.

---

Or, at least, that's how it works in the ideal case.

(In the non-ideal case, it's something you can't resolve because the "formula" contains nodes that reference things the interpreter can't concretely emulate without combinatoric state-space explosion — e.g. "what was loaded from this save file created by an earlier run of the game process"; or maybe "what could possibly be in RAM here" when the game uses multiple threads and IPC, and relies on the console OS to pre-emptively schedule those threads, so that "when a message arrives to a thread's IPC inbox" becomes non-deterministic. So this wouldn't work for every game. But it could work for some. And perhaps more, if you can have your concolic interpreter present a more-stable-than-reality world by e.g. "coercing processors into a fake linear clock that always pulses across the multiple CPU cores in a strict order each cycle"; or "presenting a version of the console's OS/BIOS that does pre-emptive thread scheduling deterministically"; etc.)

	▲	jamesgeck0 4 months ago \| parent [-]
		Many games that have unique shaders all the way up to the end of the game. Reconstructing the path of states required to get an individual sample from the end of a 100 hour JRPG in this manner seems like it would be hilariously computationally expensive?