Remix.run Logo
whitten 21 hours ago

Does the SMILE (or Simplified Molecular Input Line Entry System) code have an EBNF definition ? https://en.wikipedia.org/wiki/Simplified_Molecular_Input_Lin... Claims there is a context free grammar.

dalke 14 hours ago | parent | next [-]

That's "SMILES".

Yes. Here is the yacc grammar for the SMILES parser in the RDKit. https://github.com/rdkit/rdkit/blob/master/Code/GraphMol/Smi...

There's also one from OpenSMILES at http://opensmiles.org/opensmiles.html#_grammar . It has a shift/reduce error (as I recall) that I was not competent enough to fix.

I prefer to parser almost completely in the lexer, with a small amount of lexer state to handle balanced parens, bracket atoms, and matching ring closures. See https://hg.sr.ht/~dalke/opensmiles-ragel and more specifically https://hg.sr.ht/~dalke/opensmiles-ragel/browse/opensmiles.r... .

dalke 4 hours ago | parent [-]

Oh, I should have pointed out my Python lexer-driven parser at https://hg.sr.ht/~dalke/smiview/browse/smiview.py

The lexer: https://hg.sr.ht/~dalke/smiview/browse/smiview.py?rev=tip#L3...

The lexer state transitions: https://hg.sr.ht/~dalke/smiview/browse/smiview.py?rev=tip#L3...

dekhn 13 hours ago | parent | prev | next [-]

I wrote a very simple SMILES parser using pyparsing https://github.com/dakoner/smilesparser/tree/master I wouldn't say it's intended for production work, but it has been useful in situations where I didn't want to pull in rdkit.

dalke 5 hours ago | parent [-]

I see you include the dot disconnect "." as part of the Bond definition.

You also define Chain as:

  Chain <<= pp.Group(pp.Optional(Bond) + pp.Or([Atom, RingClosure]))
I believe this means your grammar allows the invalid SMILES C=.N
fred_tandemai 16 hours ago | parent | prev [-]

[dead]