Remix.run Logo
semiinfinitely a day ago

other file formats that rival fasta in stupidity include fastq pdb bed sam cram vcf. further reading [1]

> "intentionally or not, bioinformatics found a way to survive: obfuscation. By making the tools unusable, by inventing file format after file format, by seeking out the most brittle techniques"

1. https://madhadron.com/science/farewell_to_bioinformatics.htm...

jakobnissen a day ago | parent [-]

SAM is not a bad file format. What's bad about SAM?

optionalsquid a day ago | parent [-]

I don't dislike the format, and it is much, much better than what it replaced, but SAM, and its binary sister-format BAM, does have some flaws:

- The original index format could not handle large chromosomes, so now there are two index formats: .bai and .csi

- For BAM, the CIGAR (alignment description) operation count is limited to 16 bits, which means that very long alignments cannot be represented. One workaround I've seen (but thankfully not used) is saving the CIGAR as a string in a tag

- SAM cannot unambiguously represent sequences with only a single base (e.g. after trimming), since a '*' in the quality column can be interpreted either as a single Phred score (9) or as a special value meaning "no qualities". BAM can represent such sequences unambiguously, but most tools output SAM

jakobnissen a day ago | parent [-]

True. I'd consider these minor flaws. W.r.t. the CIGAR, the spec says you do need to store it as a tag.