▲ | bede a day ago | |||||||
Thank you for clarifying this – yes the non-semantic nature of these particular line breaks is a key detail I omitted. | ||||||||
▲ | tialaramex 21 hours ago | parent [-] | |||||||
It might be worth (in some other context) introducing a pre-processing step which handles this at both ends. I'm thinking like PNG - the PNG compression is "just" zlib but for RGBA that wouldn't do a great job, however there's a (per row) filter step first, so e.g. we can store just the difference from the row above, now big areas of block colour or vertical stripes are mostly zeros and those compress well. Guessing which PNG filters to use can make a huge difference to compression with only a tiny change to write speed. Or (like Adobe 20+ years ago) you can screw it up and get worse compression and slower speeds. These days brutal "try everything" modes exist which can squeeze out those last few bytes by trying even the unlikeliest combinations. I can imagine a filter layer which says this textual data comes in 78 character blocks punctuated with \n so we're going to strip those out, then compress and in the opposite direction we decompress then put back the newlines. For FASTA we can just unconditionally choose to remove the extra newlines but that may not be true for most inputs, so the filters would help there. | ||||||||
|