FASTA is a candidate for the stupidest file format ever invented and a testament to the massive gap in perceived vs actual programming ability of the average bioinformatician.

▲

boothby a day ago | parent | next [-]

Spend a few years handling data in arcane, one-off, and proprietary file formats conceived by "brilliant" programmers with strong CS backgrounds and you might reconsider the conclusion you've come to here.

	▲	dwattttt 17 hours ago \| parent [-]
		This is a presentation problem, or possibly a lack of tooling problem. A binary format with a tool that renders it to text works the same as a text format; if the rendering is lossless, you could even consume the text format rather than the binary. A "text" format is built to be understandable, but that's not a requirement; you could write a text format that isn't descriptive, and you'd have just as much trouble understanding what 'A' means as you would understanding what 'C0' means for a binary format. Undocumented formats are a pain, whether they're in text or binary.

▲

semiinfinitely a day ago | parent | prev | next [-]

other file formats that rival fasta in stupidity include fastq pdb bed sam cram vcf. further reading [1]

> "intentionally or not, bioinformatics found a way to survive: obfuscation. By making the tools unusable, by inventing file format after file format, by seeking out the most brittle techniques"

1. https://madhadron.com/science/farewell_to_bioinformatics.htm...

▲

jakobnissen a day ago | parent [-]

SAM is not a bad file format. What's bad about SAM?

▲

optionalsquid a day ago | parent [-]

I don't dislike the format, and it is much, much better than what it replaced, but SAM, and its binary sister-format BAM, does have some flaws:

- The original index format could not handle large chromosomes, so now there are two index formats: .bai and .csi

- For BAM, the CIGAR (alignment description) operation count is limited to 16 bits, which means that very long alignments cannot be represented. One workaround I've seen (but thankfully not used) is saving the CIGAR as a string in a tag

- SAM cannot unambiguously represent sequences with only a single base (e.g. after trimming), since a '*' in the quality column can be interpreted either as a single Phred score (9) or as a special value meaning "no qualities". BAM can represent such sequences unambiguously, but most tools output SAM

	▲	jakobnissen a day ago \| parent [-]
		True. I'd consider these minor flaws. W.r.t. the CIGAR, the spec says you do need to store it as a tag.

▲

totalperspectiv a day ago | parent | prev | next [-]

> a testament to the massive gap in perceived vs actual programming ability of the average bioinformatician.

This is not really a fair statement. Literally all of software bears the weight of some early poor choice that then keeps moving forward via weight of momentum. FASTA and FASTQ formats are exceptionally dumb though.

▲

Fraterkes a day ago | parent | prev | next [-]

I’ll do you the immense favor of taking the bait. What’s so bad about it?

	▲	jszymborski a day ago \| parent [-]
		It's a fine format for what it is. A parser to stream FASTA can be written in like 30 lines [0], much easier than say CSV where the edge cases can get hairy. If you need something like fast random reads, use the FAIDX format [1], or even better just store it in an LMDB or SQLite embedded db. People forget FASTA was from 1985, and it sticks around because (1) it's easy to parse and write (2) we have mountains of sequences in that format going back 4 decades. [O] https://gist.github.com/jszym/9860a2671dabb45424f2673a49e4b5... [1] https://seqan.readthedocs.io/en/main/Tutorial/InputOutput/In...

▲

StillBored a day ago | parent | prev | next [-]

I think the prevalence of the format vs something more widely used should be part of that metric.

On those grounds, the lack of pre-tokenization in html/css/js ranks at this point as a planet killing level of poor choices.

▲

a day ago | parent | prev | next [-]

[deleted]

▲

fwip a day ago | parent | prev [-]

It might be the stupidest, but stupid in the sense of "the simplest thing that could possibly work."

When FASTA was invented, Sanger sequencing reads would be around a thousand bases in length. Even back then, disk space wasn't so precious that you couldn't spend several kilobytes on the results of your experiment. Plus, being able to view your results with `more` is a useful feature when you're working with data of that size.

And, despite its simplicity, it has worked for forty years.

	▲	michaelhoffman a day ago \| parent \| next [-]
		When FASTA was invented in 1985, generally sequencing reads would be about half that. The simplicity of FASTA seems like a dream compared to the GenBank flat file format used before then. And around the year 2000, less computationally-inclined scientists were storing sequence in Microsoft Word binary .doc files. A lot of file formats (including bioinformatics formats!) have come and gone in that time period. I don't think many would design it this way today, but it has a lot of nice features like the ones you point out that led to its longevity.
	▲	attractivechaos a day ago \| parent \| prev \| next [-]
		FASTA was invented in late 1980s. At that time, unix tools often limited line length. Even in early 2000s, some unix tools (on AIX as I remember) still had this limit.
	▲	melagonster 13 hours ago \| parent \| prev [-]
		Yes, If someone want, they can do many analyses by grep!