▲ | dwattttt 16 hours ago | |
The question of how to represent things not specified in the original format is a tough one. At the loosest end a format can leave lots of space for new symbols, and you can just use those to represent something new. But then not everyone agrees on what the new symbol means, and worse multiple groups can use symbols to mean different things. On the other end of the spectrum, you can be strict about the format, and not leave space for new symbols. Then to represent new things you need a new standard, and people to agree on it. It's mostly a question of how well code can be updated and agreed upon, how strict you can require your tooling to be w.r.t. formats. | ||
▲ | optionalsquid 10 hours ago | parent | next [-] | |
The original FASTA/Pearson format and fasta/tfasta tools have supported 'N' for ambiguous nucleotides since at least 1996 [1], and the FASTQ format has to my knowledge always supported 'N' bases (i.e. since around 2000). IUPAC codes themselves date back to 1970 [2]. You can probably get away with not supporting the full range of IUPAC nucleotide codes, but not supporting 'N' makes your tool unusable to represent what is probably the majority of available FASTA/FASTQ data [1] See 'release.v16' in the fasta2 release at https://fasta.bioch.virginia.edu/wrpearson/fasta/fasta_versi... | ||
▲ | melagonster 13 hours ago | parent | prev [-] | |
The problem is IUPAC just exists. |