Remix.run Logo
abixb 7 hours ago

As someone with barebones understanding of "world models," how does this differ from sophisticated game engines that generate three-dimensional worlds? Is it simply the adaptation of transformer architecture in generating the 3-D world v/s using a static/predictable script as in game engines (learned dynamics vs deterministic simulation mimicking 'generation')? Would love an explanation from SMEs.

whizzter 7 hours ago | parent | next [-]

Games are still mostly polygon based due to tooling (Even Unreal Nanite is a special variation of handling polygons), some engines have tried voxels (Teardown, Minecraft genererates polygons and would fall in the previous category as far as rendering goes) or even implict surface modes by composing SDF'y primitives (Dreams on Playstation and more recently unbound.io).

All of these have fairly "exact" representations, and generation techniques are also often fairly "exact" in trying to create worlds that won't break physics engines(big part) or rendering engines, often hand-crafted algorithms but nothing really that really stopped neural networks from being used on a higher level.

One important detail in most generation systems in games is that they are often built to be controllable to work with game-logic (think how Minecraft generates the world to include biomes,villages,etc) or more or less artist controllable.

3d scanning has often relied on point-clouds, but were heavy, full of holes,etc and have been infeasible for direct rendering for long so many methods were developed to make decent polygon meshes.

Nerf's and Gaussian splatting(GS) started appearing a few years back, these are more "approximate" and totally ignore polygon generation instead relying on quantization of the world into NN-matrix-"fields"(NERF) or fuzzy-point-clouds (GS), visually these have been impressive since they managed to capture "real" images well.

This system is built on GS since that probably meshed fairly well with neural network token and diffusion techniques for encoding inputs (images, texts).

They do mention mesh exports (there has been some research into polygon generation from GS).

If the system scales to huge worlds this could change game-dev, and there seems to be some aim with the control methods, but it'd probably require more control and world/asset management since you need predictability with existing things to produce in the long term (same as with code agents).

ehnto an hour ago | parent | next [-]

Your later point is what makes me think this doesn't have comprehensive legs, just niche usage.

A typical game has thousands of hand placed nodes in 3D space, that do things like place lights, trigger story beats, account for physics and collisions etc. That wouldn't change with Gaussian splats, but if you needed to edit the world then even with deterministic generation, the whole world might change, and all your gameplay nodes are now misplaced.

That doesn't matter for some games, but I think it does matter for most.

AlexisArgyriou an hour ago | parent | prev [-]

You could in theory combine point clouds and Nanite: cull sub-pixel points and generate geometry on the fly by filling the voids between remaining points with polygons. The main issue is bandwidth, GPUs are barely able to handle Nanite; and this would be at least an order of magnitude more complex to do at runtime. Nanite is doing a lot of offline precomputation, storing some sort of intermediate models etc.

mountainriver 7 hours ago | parent | prev | next [-]

The model is predicting what the state of the world would look like after a given action.

Along with entertainment, they can be used for simulation training for robots. And allow for imagining potential trajectories

7 hours ago | parent | next [-]
[deleted]
echelon 7 hours ago | parent | prev | next [-]

Marble is not that type of world model. It generates static Gaussian Splat assets that you can render using 3D libraries.

ghayes 7 hours ago | parent | prev [-]

Whenever I see these and play with models like this (and the demos on this page), the movement in the world always feel like a dolly zoom. Things in the distance tend to stay in the distance, even as the camera moves in that direction, and only the local area changes features.

[0] https://en.wikipedia.org/wiki/Dolly_zoom

echelon 7 hours ago | parent | prev [-]

This "world model" is Image to Gaussian Splat. This is a static render that a web-based Gaussian Splat viewer then renders.

Other "world model"s are Image + (keyboard input) to Video or Streaming Images, that effectively function like a game engine / video hybrid.