| ▲ | tapia 9 hours ago | ||||||||||||||||||||||
I'm always amazed by people doing reverse engineering of some country formats. There's a binary format that I've been wanting to reverse engineer, but I don't know exactly how to start. It's for the result file of a proprietary finite element program. Could anyone point me to some resources and also what are the basics that I need to learn to achieve this? | |||||||||||||||||||||||
| ▲ | jcranmer an hour ago | parent | next [-] | ||||||||||||||||||||||
The most important resource you'll need is a hex editor that can let you drop at a cursor and see what the value is at the cursor for all the basic datatypes (u8/u16/u32/u64, float, double, at minimum). Something like 010 Editor or ImHex. If it's a really simple format, since you appear to have the ability to generate arbitrary file contents using the program, you can get some mileage by generating a suite of small contents with few changes between them. I reverse engineered the DSP sphere blueprint format by generating a blueprint with one node, then the same node located elsewhere, then two nodes, then two nodes and one frame between them, etc. But this process is really only possible for the simplest formats; I'd gander that most reverse-engineered file formats are heavily based on decompilation of the deserialization code. A lot of binary file formats end up being some form of "container" format--essentially, a file contains some form of directory mapping an item ID to a location in the file, and the contents of that is in some other binary format. It's worth first checking if this is the case, and matching against known formats like ZIP or HDF5. | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | tralarpa 8 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
There are two approaches (sometimes mixed): (a) you reverse engineer the application writing or reading the file. Even without fully understanding the application it can give you valuable information about the format (e.g. "The application calls fwrite in a for loop ten times, maybe those are related to the ten elements that I see on the screen"). (b) you reverse engineer only the file. For example, you change one value in the application and compare the resulting output file. Or the opposite way: you change one value in the file and see what happens in the application when you load it. | |||||||||||||||||||||||
| ▲ | DannyBee 2 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
As someone who has reverse engineers hundreds of random file formats of all kinds over the years, the comment that suggests understanding the code is generally spot on. You can basically divide the world into read/write/write-only formats and read-only formats. For read/write/write-only formats, usually the in-memory data structures were written first, and then the serialization/deserialization code. So it almost always more useful to see how the code works, than try to just figure out what random bytes in the file mean. A not insignificant percent of the time, the serialization/deserialization code is fairly straightforward - read some bytes, maybe decompress them, maybe checksum them and compare to a checksum field, shove in right place in memory structure/create a class using it, move on. Different parts of the program may read different parts of a file, but again, usually a given part of the deserialization/serialization code is fairly understandable. Read-only formats are scattershot. Lots of reasons. I'll just cover a few. First, because the code doesn't usually have both the writing and reading, you have less of a point of reference for what reading code is doing. Second, they are not uncommonly memory mapped serializations of in-memory structures. But not necessarily even for the current platform. So it may even make perfect sense and be easy to undersatnd on some platform, but on your platform, the code is doing weird conversions and such. This is essentially a variant of "the format was designed before the code". Lots and lots more issues. I still would start by trying to understand deserialization code rather the format directly in this case, but it often is significantly harder to get a handle on. There are commonalities in some types of programs (IE you will find commonalities between games using the same engine, etc), but if you are talking "in general", the above is the best "in general" i can give you. One other tip - it is common to expect things to be logical and make sense - you can even see an example in this very article. Don't expect this. For example, data fields that don't make sense or are broken, but the program doesn't use it so it doesn't matter, checksums that don't actually check anything, signed/verified files where the signing key is changeable easily, encryption where the key is hardcoded or stored in the file, you name it. Most folks verify that their program works, they don't usually go look and verify that everything written/read makes any sense. | |||||||||||||||||||||||
| ▲ | LunicLynx 6 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
The way I do it is looking for markers. Most files have some kind of magic number in the beginning. So these can valuable to recognize. The next part is always looking into the values of 32 bit or 64 bit integers, if their value is higher than 0 but less then the files size they often are offsets into the file, meaning they address specific parts. Another recommendation is to understand what you are looking for. For games, you are most likely looking for meshes and textures. For meshes in 3D every vertex of a mesh is most likely represented by 3 floats / doubles. If you see clusters of 3 floats with sensical values (e.g. without an +/-E component) its likely that your looking at real floats. When looking for textures it can help to adjust the view on the data to the same resolution of the data your looking for. For example, if you are looking for a 8bit alpha map with a resolution of 64 x 64 then try to get 64 bytes in a row in your hex editor, you might be lucky to see the pattern show up. For save games I can only reiterated what has been mentioned before. Look for unique specific values in the file as integers. For example how much gold you have. I used these technics to reverse engineer: * Diablo 2 save games * World of Worcraft adt chunks * .NET Assembly files (I would recommend reading the ECMA specification though) * jade format of Beyond good and evil Ah yes, invest in a good hex editor of course. For me Hex Workshop has been part of this journey. | |||||||||||||||||||||||
| ▲ | ashdnazg 8 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
The bare basics are working with a hex editor and understanding data types - ints, floats, null-terminated strings, length-prefixed strings etc. I'd recommend taking a well documented binary file format (Doom WAD file?), go over the documentation, and see that you manage to see the individual values in the hex editor. Now, after you have a feel for how things might look in hex, look at your own file. Start by saving an empty project from your program and identifying the header, maybe it's compressed? If it's not, change a tiny thing in the program and save again, compare the files to see what changed. Or alternatively change the file a tiny bit and load it. Write a parser and add things as you learn more. If the file isn't intentionally obfuscated, it should probably be just a matter of persevering until you can parse the entire file. | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | mrgaro 7 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
It helps tremendously if you have a programming background as usually the developers behind the original format didn't have any need to make things harder than they need to be. Because of this, you can often guess how the format works, aka. "If I was the original developer, how would I do this?" | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | iberator 9 hours ago | parent | prev [-] | ||||||||||||||||||||||
> country format Country ?! What's the meaning | |||||||||||||||||||||||
| |||||||||||||||||||||||