Remix.run Logo
FL33TW00D a day ago

Looking forward to the relegation of FASTQ and FASTA to the depths of hell where they belong. Incredibly inefficient and poorly designed formats.

jefftk a day ago | parent [-]

How so? As long as you remove the hard wrapping and use compression aren't they in the same range as other options?

(I currently store a lot of data as FASTQ, and smaller file sizes could save us a bunch of money. But FASTQ + zstd is very good.)

FL33TW00D a day ago | parent | next [-]

https://www.biorxiv.org/content/10.1101/2025.04.08.647863v1....

optionalsquid 17 hours ago | parent [-]

The fact that these formats are unable to represent degenerate bases (Ns in particular, but also the remaining IUPAC bases), in my experience renders them unusable for many, if not most, use-cases, including for the storage of FASTQ data

dwattttt 16 hours ago | parent [-]

The question of how to represent things not specified in the original format is a tough one.

At the loosest end a format can leave lots of space for new symbols, and you can just use those to represent something new. But then not everyone agrees on what the new symbol means, and worse multiple groups can use symbols to mean different things.

On the other end of the spectrum, you can be strict about the format, and not leave space for new symbols. Then to represent new things you need a new standard, and people to agree on it.

It's mostly a question of how well code can be updated and agreed upon, how strict you can require your tooling to be w.r.t. formats.

optionalsquid 10 hours ago | parent | next [-]

The original FASTA/Pearson format and fasta/tfasta tools have supported 'N' for ambiguous nucleotides since at least 1996 [1], and the FASTQ format has to my knowledge always supported 'N' bases (i.e. since around 2000). IUPAC codes themselves date back to 1970 [2]. You can probably get away with not supporting the full range of IUPAC nucleotide codes, but not supporting 'N' makes your tool unusable to represent what is probably the majority of available FASTA/FASTQ data

[1] See 'release.v16' in the fasta2 release at https://fasta.bioch.virginia.edu/wrpearson/fasta/fasta_versi...

[2] https://iupac.qmul.ac.uk/misc/naabb.html

melagonster 13 hours ago | parent | prev [-]

The problem is IUPAC just exists.

fwip a day ago | parent | prev [-]

There's a few options out there that have noticeably better compression, with the downside of being less widely-compatible with tools. zstd also has the benefit of being very fast (depending on your settings, of course).

CRAM compresses unmapped fastq pretty well, and can do even better with reference-based compression. If your institution is okay with it, you can see additional savings by quantizing quality scores (modern Illumina sequencers already do this for you). If you're aligning your data anyways, probably retaining just the compressed CRAM file with unmapped reads included is your best bet.

There are also other fasta/fastq specific tools like fqzcomp or MZPAQ. Last I checked, both of these could about halve the size of our fastq.gz files.