Remix.run Logo
ashvardanian a day ago

Nice observation!

Took me a while to realize that Grace Blackwell refers to a person and not an Nvidia chip :)

I’ve worked with large genomic datasets on my own dime, and the default formats show their limits quickly. With FASTA, the first step for me is usually conversion: unzip headers from sequences, store them in Arrow-like tapes for CPU/GPU processing, and persist as Parquet when needed. It’s straightforward, but surprisingly underused in bioinformatics — most pipelines stick to plain text even when modern data tooling would make things much easier :(

jltsiren a day ago | parent | next [-]

Basic text formats persist, because everyone supports them. Many tools have better file formats for internal purposes, but they are rarely flexible enough and robust enough for wider use. There are occasional proposals for better general purpose formats, but the people proposing them rarely agree which of the competing proposals should be adopted. And even if they manage to agree, they probably don't have the time and the money to make it actually happen.

vintermann a day ago | parent [-]

Also for historical reasons I think, since Perl used to be the big bioinformatics language, and it is surprisingly hard to compete with in string handling.

lazide a day ago | parent [-]

Perl+strings really is one of those ‘unreasonably effective’ combinations.

It feels like Benzene in some ways. Use it correctly and gdamn. Just don’t huff it - i mean - use it for your enterprise backend - and it’s worth it.

bede a day ago | parent | prev [-]

Yes, when doing anything intensive with lots of sequences it generally makes sense to liberate them from FASTA as early as possible and index them somehow. But as an interchange format FASTA seems quite sticky. I find the pervasiveness of fastq.gz particularly unfortunate with Gzip being as slow as it is.

> Took me a while to realize that Grace Blackwell refers to a person and not an Nvidia chip :)

I even confused myself about this while writing :-)

chrchang523 20 hours ago | parent [-]

Note that BGZF solves gzip’s speed problem (libdeflate + parallel compression/decompression) without breaking compatibility, and usually the hit to compression ratio is tolerable.