SMILES and SELFIES are molecular graph representations and aren't meant to solve the "parse this sum formula" problem.

SELFIES are for genAI. If you ask a VAE to generate SMILES, it will spit out some strings that are invalid - can't happen with SELFIES, that is the one application where they are robust.

▲

dekhn 19 hours ago | parent [-]

It's still being argued if you really need SELFIES, or if SMILES autoencoders can be trained to only generate valid molecules, or if generating invalid molecules is useful (I'm in camp SELFIES, but I also want better ways to represent and learn on graphical chemical structures, ratehr than serialized strings).

▲

chermi 14 hours ago | parent [-]

can you guys explain what makes SELFIES robust? I'd only heard of SMILES until this thread, but I have been out of this space for 10 years.

▲

dekhn 13 hours ago | parent [-]

Let me start with an example- some time ago I worked on a VAE that encoded and decoded SMILES strings. The idea is that you should be able to encode a SMILES into an embedding space, do all the normal things you would do in that space, and then convert the resulting embedding vector back to a valid molecule.

The VAE is trained with a very large number of valid SMILES strings, typically tokenized at the character level (so "C" is a token, and "Br" is "B" then "r"). I and others have observed that VAEs trained like this produce large number of embedding vectors that do not decode to valid SMILES strings- they have syntax errors, or perform chemical alchemy (personally, I saw the training set had Br (bromine) and Ca (Calcium), and the output molecules sometimes were Ba (barium) even though that's not in the original dataset at all.

There are other reasons why the tokenizer produces bad results- only about 1-10% of vectors decode to valid molecules. Invalid SMILES are mostly useless- they don't correspond to actual structures.

To respond to this, the SELFIES format makes a few changes so that it is effectively impossible to produce invalid SELFIES stringes when decoding a VAE. Among other things, tokenization matches the actual elements and so the model will only ever output valid elements.

I believe this is the SMILES paper that my own experiments were based on: https://arxiv.org/pdf/1610.02415 (see https://github.com/maxhodak/keras-molecules for an open source attempt at implementation)

And this is the paper introducing SELFIES: https://arxiv.org/abs/1905.13741 (open source packages for working with SELFIES, and some example training scripts https://github.com/aspuru-guzik-group/selfies see "Validity of Latent Space in VAE SMILES vs. SELFIES for more detail on the robustness).

BTW, as a side note: even though we put a bunch of effort into duplicating the original SMILES VAE, it was extremely slow to train and not very useful. Now you can just ask Gemini to write a full SELFIES VAE and train it in less than a day on a conventional GPU (thanks pytorch transformers!) to get a decent basic set of embeddings useful for exploring chemical space.

▲

chermi 11 hours ago | parent [-]

Thanks, that's very interesting! Naive question, but why couldn't you force a specific tokenization scheme on SMILES? Specifically, just one token per element? I understand SELFIES does more, but your example of Ba/Br made me wonder.

	▲	dekhn 11 hours ago \| parent [-]
		I asked the authors of the original SMILES paper and they didn't have a good answer. I wrote a parser for SMILES so I could tokenize that way but never followed up, and eventually SELFIES was announced.