Remix.run Logo
jefftk a day ago

The FASTA format looks like:

    > title
    bases with optional newlines
    > title
    bases with optional newlines
    ...
The author is talking about removing the non-semantic optional newlines (hard wrapping), not all the newlines in the file.

It makes a lot of sense that this would work: bacteria have many subsequences in common, but if you insert non-semantic newlines at effectively random offsets then compression tools will not be able to use the repetition effectively.

LeifCarrotson a day ago | parent | next [-]

In case "bases with optional newlines" wasn't obvious to anyone else, a specific example (from Wikipedia) is:

    ;LCBO - Prolactin precursor - Bovine
    MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHDLSS
    EMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPLYHL
    VTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDED
    ARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC*
where "SS...EM", HL..VT", or "ED..AR" may be common subsequences, but the plaintext file arbitrarily wraps at column 65 so it renders on a DEC VT100 terminal from the 70s nicely.

Or, for an even simpler example:

    ; plaintext
    GATTAC
    AGATTA
    CAGATT
    ACCAGA
    TTACAG
    ATTACA
becomes, on disk, something like

    ; plaintext\r\nGATTAC\r\nAGATTA\r\nCAGATT\r\nACCAGA\r\nTTACAG\r\nATTACA\r\n
which is hard to compress, while

    ; plaintext\r\nGATTACAGATTACAGATTACCAGATTACAGATTACA
is just

    "; plaintext\r\n" + "GATTACA" * 7
and then, if you want, you can reflow the text when it's time to render to the screen.
tgtweak 20 hours ago | parent | next [-]

Feels like it could be an extension to the compression lib (and would retain newlines as such) vs requiring external document tailoring. Also feels like a very specific use case but this optimization might have larger applications outside this narrow field/format.

spatoa 5 hours ago | parent | prev | next [-]

When working with data, I definitely prefer the UI to adapt to the data. I never save anything for the display back.

Terr_ 20 hours ago | parent | prev [-]

Huh, so in other words: "If you don't arbitrarily interrupt continuous sequences of data with cosmetic noise, they compress better."

bede a day ago | parent | prev | next [-]

Thank you for clarifying this – yes the non-semantic nature of these particular line breaks is a key detail I omitted.

tialaramex 21 hours ago | parent [-]

It might be worth (in some other context) introducing a pre-processing step which handles this at both ends. I'm thinking like PNG - the PNG compression is "just" zlib but for RGBA that wouldn't do a great job, however there's a (per row) filter step first, so e.g. we can store just the difference from the row above, now big areas of block colour or vertical stripes are mostly zeros and those compress well.

Guessing which PNG filters to use can make a huge difference to compression with only a tiny change to write speed. Or (like Adobe 20+ years ago) you can screw it up and get worse compression and slower speeds. These days brutal "try everything" modes exist which can squeeze out those last few bytes by trying even the unlikeliest combinations.

I can imagine a filter layer which says this textual data comes in 78 character blocks punctuated with \n so we're going to strip those out, then compress and in the opposite direction we decompress then put back the newlines.

For FASTA we can just unconditionally choose to remove the extra newlines but that may not be true for most inputs, so the filters would help there.

dekhn 18 hours ago | parent [-]

For one approach to compressing FASTQ (which is a series 3 distinct lines), we broke the distinct lines each into their own streams and then compressed each stream independently. That way the coder for the header line, sequence line, and error line could learn the model of that specific line type and get slightly better compression (not unlike a columnar format, although in this case, we simply combined "blocks" of streams together with a record separator.

I'm still pretty amazed that periodic newlines hurt compression ratios so much, given the compressor can use both a huffman coding and a lookback dictionary.

The best rule in sequence data storage is to store as little of it as possible.

AndrewOMartin a day ago | parent | prev | next [-]

The compression ratio will likely skyrocket if you sorted the list of bases.

shellfishgene a day ago | parent [-]

You're joking, but a few bioinformatics tools use the Burrows-Wheeler transform to save memory, which is a bit like sorting the bases.

jefftk a day ago | parent [-]

You can also improve compression by reordering the sequences within the FASTA file, as long as you're using it as a dictionary and not a list of title-sequence pairs.

mylons a day ago | parent | prev | next [-]

this is also the insight that the bwa developer had, to use the burrows-wheeler transform which is part of bzip2 due to it's compression properties being particularly good for genomic sequences.

dekhn 18 hours ago | parent [-]

I once had the distinct pleasure of hosting the author of BWA (R. Durbin) at Google, and pointing out "That's Mike Burrows, over there, next to Jeff Dean and Sanjay Ghemawat". That led to an interesting discussion between Durbin and Dean on DNA sequence compression. It's not the first time I've been in a room with a bunch of geniuses and simply kept my mouth shut so nobody would know I'm an idiot.

mylons an hour ago | parent | next [-]

fwiw Heng Li was "the author." He was a postdoc under Durbin's professorship. I was around when bwa was developed and was working in a collaboration with Heng Li (I was working on SOLiD R&D). Any development emails were between Heng and our team, we never spoke with Durbin.

first author on a paper like this indicates the most significant contributions https://academic.oup.com/bioinformatics/article/25/14/1754/2...

an hour ago | parent | prev | next [-]
[deleted]
nerpderp82 4 hours ago | parent | prev [-]

But you're the genius I keep my mouth shut around!

a day ago | parent | prev | next [-]
[deleted]
amelius a day ago | parent | prev [-]

This was a question that I thought was interesting enough to test ChatGPT with.

Surprisingly, it gave an answer along the lines of the parent comment.

However, it seems it didn't figure this out by itself but it says:

> It’s widely known in bioinformatics that “one-line Fasta” files compress much better with LZ-based algorithms, and this is discussed in forums, papers, and practical guides on storing genomic data.