Remix.run Logo
GistNoesis 6 hours ago

Intellectually, I don't like this approach.

Predicting the end-result from the sequence of protein directly is prone to miss any new phenomenon and would just regurgitate/interpolate the training datasets.

I would much prefer an approach based on first principles.

In theory folding is easy, it's just running a simulation of your protein surrounded by some water molecules for the same number of nano-seconds nature do.

The problem is that usually this take a long time because evolving a system needs to compute the energy of the system as a position of the atoms which is a complex problem involving Quantum Mechanics. It's mostly due to the behavior of the electrons, but because they are much lighter they operate on a faster timescale. You typically don't care about them, only the effect they have on your atoms.

In the past, you would use various Lennard-Jones potentials for pairs of atoms when the pair of atoms are unbounded, and other potentials when they are bonded and it would get very complex very quickly. But now there are deep-learning based approach to compute the energy of the system by using a neural network. (See (Gromacs) Neural Network Potentials https://rowansci.com/publications/introduction-to-nnps ). So you train these networks so that they learn the local interactions between atoms based on trajectories generated from ab-initio theories. This allows you to have a faster simulator which approximate the more complex physics. It's in a sort just tabulating using a neural network the effect of the electrons would have in a specific atom arrangements according to the theory you have chosen.

At any time if you have some doubt, you can always run the slower simulator in the small local neighborhood to check that the effective field neural network approximation holds.

Only then once you have your simulator which is able to fold, you can generate some dataset of pairs "sequence of protein" to "end of trajectory", to learn the shortcut like Alpha/Simple/Fold do. And when in doubt you can go back to the slower more precise method.

If you had enough data and can train perfectly a model with sufficient representation power, you could theoretically infer the correct physics just from the correspondence initial to final arrangements. But if you don't have enough data it will just learn some shortcut and accept that it will be wrong some times.

slashdave 5 hours ago | parent [-]

> it's just running a simulation of your protein surrounded by some water molecules for the same number of nano-seconds nature do.

No, the environment is important. Also, some proteins fold while being sequenced.

Folding can also take minutes in some cases, which is the real problem.

> which is a complex problem involving Quantum Mechanics

Most MD simulations use classical approximations, and I don't see why folding is any different.