Remix.run Logo
DannyBee 4 hours ago

As someone who has reverse engineers hundreds of random file formats of all kinds over the years, the comment that suggests understanding the code is generally spot on.

You can basically divide the world into read/write/write-only formats and read-only formats.

For read/write/write-only formats, usually the in-memory data structures were written first, and then the serialization/deserialization code. So it almost always more useful to see how the code works, than try to just figure out what random bytes in the file mean. A not insignificant percent of the time, the serialization/deserialization code is fairly straightforward - read some bytes, maybe decompress them, maybe checksum them and compare to a checksum field, shove in right place in memory structure/create a class using it, move on.

Different parts of the program may read different parts of a file, but again, usually a given part of the deserialization/serialization code is fairly understandable.

Read-only formats are scattershot. Lots of reasons. I'll just cover a few. First, because the code doesn't usually have both the writing and reading, you have less of a point of reference for what reading code is doing. Second, they are not uncommonly memory mapped serializations of in-memory structures. But not necessarily even for the current platform. So it may even make perfect sense and be easy to undersatnd on some platform, but on your platform, the code is doing weird conversions and such. This is essentially a variant of "the format was designed before the code". Lots and lots more issues.

I still would start by trying to understand deserialization code rather the format directly in this case, but it often is significantly harder to get a handle on.

There are commonalities in some types of programs (IE you will find commonalities between games using the same engine, etc), but if you are talking "in general", the above is the best "in general" i can give you.

One other tip - it is common to expect things to be logical and make sense - you can even see an example in this very article. Don't expect this.

For example, data fields that don't make sense or are broken, but the program doesn't use it so it doesn't matter, checksums that don't actually check anything, signed/verified files where the signing key is changeable easily, encryption where the key is hardcoded or stored in the file, you name it.

Most folks verify that their program works, they don't usually go look and verify that everything written/read makes any sense.

tptacek 24 minutes ago | parent [-]

It has been a minute since I routinely did this kind of work, but I have to mention this because it's fun:

You can do something in between reverse-engineering the code and reverse-engineering the format if you can instrument the reader: attach breakpoints on every basic block in the reader, load a file, take a baseline trace of what gets hit, then vary bytes in the file and diff the new trace against the baseline.

It's a pretty fun tool to write, too.