▲ | dekhn 16 hours ago | |
For one approach to compressing FASTQ (which is a series 3 distinct lines), we broke the distinct lines each into their own streams and then compressed each stream independently. That way the coder for the header line, sequence line, and error line could learn the model of that specific line type and get slightly better compression (not unlike a columnar format, although in this case, we simply combined "blocks" of streams together with a record separator. I'm still pretty amazed that periodic newlines hurt compression ratios so much, given the compressor can use both a huffman coding and a lookback dictionary. The best rule in sequence data storage is to store as little of it as possible. |