Remix.run Logo
A file format uncracked for 20 years(landaire.net)
247 points by todsacerdoti 11 days ago | 43 comments
amiga386 3 hours ago | parent | next [-]

So if I understand this right:

* common.lin contains filenames, so that filename-expansion code in the game can work. But the offsets and sizes associated with the files are garbage

* <filename>.lin contains a stream of every byte read from every file, while loading the level <filename>. The stream is then compressed in 16k chunks by zlib.

* There is no indication in that stream of which real file was being read, nor the length of each read, nor what seeking was done (if any). All that metadata is gone.

* The only way to recover this metadata is to run the game code and log the exact sequence of file opens, seeks, reads.

* Alternatively, extract all that Unreal object loader code from the game and reimplement it yourself, so that you can let the contents of the stream drive the correct reading of the stream. The code should be deterministic.

This sounds pretty hellish for the game developers, and I bet the debug versions of their game _ignored_ <filename>.lin and used the real source files, but _wrote_ <filename>.lin immediately after every load... any change to the Unreal objects could alter how they were read, and if the data streamed didn't perfectly match up with what was in the real files, you'd be toast.

It reminds me of the extreme optimisation that Farbrausch did for .kkrieger -- they built a single binary, then ran and played it under instrumentation, and _any_ code path that wasn't taken was deleted from the binary to make it smaller. They forgot to take any damage in that playthrough, so all the code that applies damage to the player was deleted. Oops!

maccard 12 minutes ago | parent | next [-]

Unreal has gotten better since then, but you still need the actual game code to load the asset correctly. It’s a major pain in the ecosystem.

cedws 2 hours ago | parent | prev | next [-]

Could you explain a bit more about that code path optimisation? Why wouldn’t the compiler eliminate dead code? It seems like a very haphazard blunt force optimisation method.

anamexis 2 hours ago | parent [-]

The compiler can’t determine which code paths are never used in practice at runtime.

cedws an hour ago | parent [-]

I see. It sounds like it would be a source of countless headaches, and I don’t think I’d ever want to do something that risks breaking the program like that, but I guess that’s why I’m not a game programmer.

dagmx 16 minutes ago | parent [-]

Kkrieger specifically is a demo scene app with the goal of being as small as possible. It’s not indicative of overall game development practices as a whole.

debugnik 3 hours ago | parent | prev [-]

About .kkrieger's trimming, I had only heard that they forgot to press up on the main menu so it doesn't work, not about the damage thing.

mcdeltat 8 hours ago | parent | prev | next [-]

> Compressing data means you save space on the disc... If you conveniently ignore the fact that common.lin is duplicated in each map's directory and is the same for every map I tested, which kinda negates part of this.

This is an interesting thing I've noticed about game dev, it seems to sometimes live in a weird space of optimisation requirements vs hackiness. Where you'll have stuff like using instruction data as audio to save space, but then forget to compile in release mode or something. Really odd juxtaposition of near-genius-level optimisation with naive inefficiency. I'm assuming it's because, while there may be strict performance requirements, the devs are under the pump and there's so much going on that silly stuff ends up happening?

bargainbin 3 hours ago | parent | next [-]

Exactly that - once it’s shipped it’s shipped. Doesn’t matter if the code is “clean” or “maintainable” or whatever.

The longer it’s not released for sale, the more debt you’re incurring paying the staff.

I’ve worked with a few ex-game devs and they’re always great devs, specifically at optimising. They’re not great at the “forward maintainability” aspect though because they’ve largely never had experience having to do it.

richardfey 7 hours ago | parent | prev | next [-]

This might be an optimisation to avoid disc seeks on wildly far apart distances, which would introduce more latency.

landr0id 7 hours ago | parent | next [-]

For this file in particular I'm unsure.

common.lin is a separate file which I believe is supposed to contain data common to all levels _before_ the level is loaded.

There's a single exported object that all levels of the game have called `MyLevel`. The game attempts to load this and it triggers a load of the level data and all its unique dependencies. The common.lin file is a snapshot of everything read before this export. AFAIK this is deterministic so it should be the exact same across all maps but I've not tested all levels.

When loading a level, the training level for instance contains two distinct parts. Part 1 of the map loads 0_0_2_Training.lin, and the second part loads 0_0_3_Training.lin. These parts are completely independent -- loading the second part does not require loading the first. It does a complete re-launch of the game using the Xbox's XLaunchNewImage API, so all prior memory I think should be evicted but maybe there's some flag I'm unaware of. That is to say, I'm fairly confident they are mutually exclusive.

So basically the game launches, looks In the "Training" map folder for common.lin, opens a HANDLE, then looks for whichever section it's loading, grabs a HANDLE, then starts reading common.lin and <map_part>.lin.

There's multiple parts, but only one common.lin in each map folder. So no matter what it's not going to be laid out in a contiguous disc region for common.lin leading into <map_part>.lin. Part 1 may be right after common.lin, but if you're loading any other part you'll have to make a seek.

I don't know enough about optical media seek times to say if semi-near locality is noticeably better for the worst case than the files being on complete opposite sector ranges of the disc.

richardfey 6 hours ago | parent | next [-]

They were doing this kind of optical media seek times tests/optimisations for PS1 games, like Crash Bandicoot. You certainly have more and better context than me on this console/game, I just mentioned it in case it wasn't considered.

By the way, could the nonsensical offsets be checksums instead?

Nice reverse engineering work and analysis there!

ralferoo 5 hours ago | parent [-]

IIRC the average seek time across optical media is around 120ms, so ideally you want all reads to be linear.

I remember one game I worked on, I spent months optimising loading, especially boot flow, to ensure that every file the game was going to load was the very next file on the disk, or else the next file was an optionally loaded file that could be skipped (as reading and ignoring was quicker than seeking). For the few non-deterministic cases where order couldn't be predicted (e.g. music loaded from a different thread), I preloaded a bunch of assets up front so that the rest of the assets were deterministic.

One fun thing we often did around this era is eschew filenames and instead hash the name. If we were loading a file directly from C code, we'd use the preprocessor the hash the code via some complicated macros, so the final call would be compiled like LoadAsset(0x184e49da) but still retain a run-time hasher for cases where the filename was generated dynamically. This seems like a weird optimisation, but actually avoiding the directory scan and filename comparisons can save a lot of unnecessary seeking / CPU operations, especially for multi-level directories. The "file table" then just became a list of disk offset and lengths, with a few gaps because the hash table size was a little bigger than the number of files to avoid hash conflicts. Ironically, on one title I worked on we had the same modulo for about 2 years in development, and just before launch we needed to change it twice in a week due to conflicts!

rswail 4 hours ago | parent [-]

This reminds me of Mel:

    Mel's job was to re-write
    the blackjack program for the RPC-4000.
    (Port?  What does that mean?)
    The new computer had a one-plus-one
    addressing scheme,
    in which each machine instruction,
    in addition to the operation code
    and the address of the needed operand,
    had a second address that indicated where, on the revolving drum,
    the next instruction was located.
https://users.cs.utah.edu/~elb/folklore/mel.html
oarsinsync 4 hours ago | parent | prev [-]

ISO9660 has support for something that resembles hard links - IE, a file can exist in multiple places in the directory structure, but always point to the same underlying data blocks on disc.

I think XISO is derived from ISO9660, so may have the same properties?

Cthulhu_ 2 hours ago | parent | prev [-]

Definitely could be a factor; I know of a programmer who works at a Dutch company that mainly does ports of AAA games (he may be on here too, hello!), he once wrote a comment or forum post about how he developed an algorithm to put data on a disk in the order that it was needed to minimize disk seeks. Spinny disks benefit greatly from reading data linearly.

rusk 8 hours ago | parent | prev | next [-]

There was a running theme in mythic quest about the engineers sweating over the system while monetisation just went bolted on a casino.

Also happened in GTA5 [0] there was a ridiculous loading glitch that was quite well documented on here a while ago. Also a monetisation bolt on.

So you have competing departments one of whom must justify itself by producing a heavily after my system. And another one which is licensed to generate revenue at any cost……

[0] https://news.ycombinator.com/item?id=26296339

ramses0 an hour ago | parent | next [-]

There was one similar issue with DOOM framerate, I'm assuming an intern got tasked with adding the "blink the LED on the fancy mouse" code (due to a marketing partnership) and it absolutely _trashes_ the framerate!

https://www.reddit.com/r/Doom/comments/bnsy4o/psa_deactivate...

avereveard 7 hours ago | parent | prev [-]

There's also relative pain scales

Loading happen once per session and is less painful than frame stuttering all game, for example, so given a tight deadline one would get prioritized over the other

Orygin 6 hours ago | parent | next [-]

I tried playing GTAO when it was free, and oh boy. Loading for 10 minutes, arrive into the game and see you're not with your friends. So 10 more minutes to load into their server. Then you start a mission and 10 more minutes of loading. The server disconnected? 10 minutes load to go back without your friend. Join your friend? guessed it: 10 more minutes of loading. For a billion dollar game, it's insane I spent more time loading than playing the game. Imagine how many more $$ they could have gotten if players could double their play time.

rusk 5 hours ago | parent [-]

Put me right off the game.

GranPC 7 hours ago | parent | prev [-]

Loading in GTA Online absolutely does not happen once per session. It happens before and after every mission and activity. I am not sure whether it's a full load/was also affected by that bug, but I can certainly tell you that around 20% of my GTAO "playtime" consisted of staring at a load screen.

monero-xmr 8 hours ago | parent | prev [-]

And passion to deliver. Engineers will kill themselves for a game release for no extra money and far less salary than their abilities would demand at a bigcorp. But they love it so they do it, and hack as best they can, to get their art into the world.

LunicLynx 12 hours ago | parent | prev | next [-]

The Xbox had strong requirements for loading times. This is probably a linear (lin) record of how the data was loaded unoptimized from the disk. And just written to a file.

So in this file seek doesn’t do anything because seek kills the requirement of 45 sec per loading screen.

Instead the logic is as follows: check if a .lin file exists. Yes: open a handle to it and only read from it with fread, what ever currently is at the current file position. No: while reading any file write the read bytes to a .lin file in the order they are read.

This gives a highlyy optimized .lin file which can be read from the disk into memory, without creating a better dedicated loading mechanism.

So if your really would like to unpack this. The first file being read is most likely the key, as it dictates what comes next. If it is a level model, then the position of the player in it might affect which other files to load etc.

In short it’s not a file format in the classical sense, it’s a linear stream of game data.

landr0id 12 hours ago | parent [-]

>This is probably a linear (lin) record of how the data was loaded unoptimized from the disk.

Yes, it's buried deep in the details but it's basically just every byte read being written in a linear stream to an output file.

I don't know which stage of grief this is, but since I wrote this blog post I've now ported my IDA debugger scripts to a dedicated QEMU plugin which logs all I/O operations and some other metadata. I tried using this technique to statically rewrite files by basically following DataLoad (with unique identifier) -> Seek -> Read patterns.

There's some annoying nuance to deal with (like seeking backwards implying that data was read, tested, then discarded) but I got this working. Unfortunately some object types encode absolute offsets in them that need to be touched up, so a couple of object types fail to load correctly in external tooling and the PC build of the game.

Now I'm just using this data to completely reimplement the game engine's loading logic from scratch using a custom IO stream which checks the incoming IO operation (seek/read) against what I logged from the game engine to ensure a 1:1 match with how the game loads data.

WatchDog 10 hours ago | parent [-]

Have you done any analysis of what proportion of the lin file is being read in total?

You stated in the blog post, that your goal is to try and find unused content, however if as described, the file is just a record of how the game loads the data, then it won't contain any hidden unused assets, since unused assets would never have been read from the original unoptimised file, and thus never written to this optimized file.

landr0id 10 hours ago | parent [-]

I agree and don't think there's any unused data. For `common.lin` for instance my parser reads it basically to the end and there's some small amount of data that's unused. I never actually quantified the amount but I'm fairly certain it's <100 bytes. Probably a bug in there.

The goal post has shifted so far beyond my original intentional at this point. The devs working on the EnhancedSC mod have a strong desire to port some Xbox assets/maps to PC, so I'm mostly doing it at this point as an attempt to help them out.

*On second thought, there's definitely some unused scripting functionality. Script functions which are unused are still included in their parent classes and are loaded if the parent object is loaded, even if never directly called. Whether or not any of this is interesting is another story.

Textures and models though will definitely not be present unless they're used in some non-visible way.

blixt 5 hours ago | parent | prev | next [-]

The quirks of field values not matching expectations reminds me of a rabbit hole when I was reverse engineering the Starbound engine[1] and eventually figured out the game was using a flawed implementation of SHA-256 hashing and had to create a replica of it [2]. Originally I used Python [3] which is a really nice language for reverse engineering data formats thanks to its flexibility.

[1] Starbounded was supposed to become an editor: https://github.com/blixt/starbounded

[2] https://github.com/blixt/starbound-sha256

[3] https://github.com/blixt/py-starbound

tapia 7 hours ago | parent | prev | next [-]

I'm always amazed by people doing reverse engineering of some country formats. There's a binary format that I've been wanting to reverse engineer, but I don't know exactly how to start. It's for the result file of a proprietary finite element program. Could anyone point me to some resources and also what are the basics that I need to learn to achieve this?

LunicLynx 4 hours ago | parent | next [-]

The way I do it is looking for markers. Most files have some kind of magic number in the beginning. So these can valuable to recognize.

The next part is always looking into the values of 32 bit or 64 bit integers, if their value is higher than 0 but less then the files size they often are offsets into the file, meaning they address specific parts.

Another recommendation is to understand what you are looking for. For games, you are most likely looking for meshes and textures. For meshes in 3D every vertex of a mesh is most likely represented by 3 floats / doubles. If you see clusters of 3 floats with sensical values (e.g. without an +/-E component) its likely that your looking at real floats.

When looking for textures it can help to adjust the view on the data to the same resolution of the data your looking for. For example, if you are looking for a 8bit alpha map with a resolution of 64 x 64 then try to get 64 bytes in a row in your hex editor, you might be lucky to see the pattern show up.

For save games I can only reiterated what has been mentioned before. Look for unique specific values in the file as integers. For example how much gold you have.

I used these technics to reverse engineer: * Diablo 2 save games * World of Worcraft adt chunks * .NET Assembly files (I would recommend reading the ECMA specification though) * jade format of Beyond good and evil

Ah yes, invest in a good hex editor of course. For me Hex Workshop has been part of this journey.

tralarpa 6 hours ago | parent | prev | next [-]

There are two approaches (sometimes mixed):

(a) you reverse engineer the application writing or reading the file. Even without fully understanding the application it can give you valuable information about the format (e.g. "The application calls fwrite in a for loop ten times, maybe those are related to the ten elements that I see on the screen").

(b) you reverse engineer only the file. For example, you change one value in the application and compare the resulting output file. Or the opposite way: you change one value in the file and see what happens in the application when you load it.

ashdnazg 5 hours ago | parent | prev | next [-]

The bare basics are working with a hex editor and understanding data types - ints, floats, null-terminated strings, length-prefixed strings etc.

I'd recommend taking a well documented binary file format (Doom WAD file?), go over the documentation, and see that you manage to see the individual values in the hex editor.

Now, after you have a feel for how things might look in hex, look at your own file. Start by saving an empty project from your program and identifying the header, maybe it's compressed?

If it's not, change a tiny thing in the program and save again, compare the files to see what changed. Or alternatively change the file a tiny bit and load it.

Write a parser and add things as you learn more. If the file isn't intentionally obfuscated, it should probably be just a matter of persevering until you can parse the entire file.

mrgaro 5 hours ago | parent | prev | next [-]

It helps tremendously if you have a programming background as usually the developers behind the original format didn't have any need to make things harder than they need to be. Because of this, you can often guess how the format works, aka. "If I was the original developer, how would I do this?"

iberator 7 hours ago | parent | prev [-]

> country format

Country ?! What's the meaning

fainpul 6 hours ago | parent [-]

Might be "binary format", autocorrected.

antonvs 20 minutes ago | parent [-]

That's really penguin

vivzkestrel 11 hours ago | parent | prev | next [-]

what about splinter cell conviction, 15 yrs and nobody has figured out its map file format .unr that uses custom unreal engine 2.x. It even has a tool that lets you unpack its UMD files https://github.com/wcolding/UMDModTemplate The library on github requires this tool unumd https://www.gildor.org/smf/index.php/topic,458.msg15196.html... The same tool also works for blacklist. I would like to change the type of enemy spawned in the map but I cannot find any assistance on it. UEExplorer doesnt work because it is some kinda custom map file

Dwedit 15 hours ago | parent | prev | next [-]

Lowercase 'x' is always your dead giveaway that it's ZLIB.

tomaytotomato 4 hours ago | parent | prev | next [-]

Loved Splinter Cell

I wonder if any of the original devs will stumble upon the author's article and then remember why they did those weird file offsets.

There was a difference in the PC and Xbox versions, so it will be interesting to find out if the author sees any snippets or missing game assets in the Xbox version.

noufalibrahim 9 hours ago | parent | prev | next [-]

This is an interesting post. I've been spending time on a hobby project[1] that requires reading some old archives and game asset files. I didn't have to do any reverse engineering since it's already done by others and documented on on the moddingwiki. However, I did implement the algorithms myself to work with the assets.

It's an interesting rabbit hole to go down into and this post makes me appreciate the way in which this kind of forensic analysis is done.

1: https://eye-of-the-gopher.github.io/

rawling 15 hours ago | parent | prev | next [-]

Weird that the one of these with no interaction got bumped rather than https://news.ycombinator.com/item?id=45842851

hcs 12 hours ago | parent [-]

I saw the other when it was first posted so it must have made the front page (or second page)

harrylepotter 12 hours ago | parent | prev [-]

ironically this was the game that enabled the savegame exploit with the bert and ernie fonts if i recall correctly