molecular modeling framework
qchem is a framework for molecular modeling which brings together PyTorch Geometric and RDKit and is constructed using the cosmosis framework.
pyg is a library built upon pytorch and is for writing and training graph neural networks. The structure of graphs generalize well to the modeling of molecules with atoms as nodes and bonds as edges. The concepts of symmetry, invariance, and equivariance in molecular modeling are fundamental to all of machine learning. Also you get to work with "first principles" and model nature with the potential to unlock greater things. For these reasons modeling molecules is pretty interesting. rdkit is a toolkit for cheminformatics and cosmosis is a machine learning framework.
This quote from the paper Machine Learning for Molecular Simulation sums it up nicely.
First we need some data with which to train our models. The data is obtained by doing calculation intensive simulations that scale at O(N^3) or worse with molecule size and can take days of high performance computing per molecule!
qchem contains several molecule datasets created from open sources.
These datasets are flexible so they may be optimized for the model, hardware and experiments being run. A dataset can be created with all possible features or a minimal set, created in advanced or at runtime, allowing for greed or thrift with memory and compute and enabling the exploration of different combinations of features and transformations.
Each feature can be transformed and routed independently. There is support for tokenization, encoding and embedding of string/text and categorical data.
Now we need a model. We could use an off-the-shelf model from pyg, pytorch, sklearn, huggingface, ect or create a custom model from scratch.
GModel inherits from CModel and adds the support for graph neural networks.
A custom model can be created by simply implementing the build() method of the GModel class.
This graph model can easily be modified via the build parameters enabling the search of the model space.
The environment can be created with just a few command lines. In the environment's jupyterlab notebook import the classes.
Input the parameters and create the learner.
In this toy but fully functional example the QM9 dataset was used. It's 133,885 molecules reside as text compressed on the hard drive. First the dataset is decompressed and traversed, filtering and recovering the molecules and features selected by the dataset parameters and creating Molecule object instances. Each instance calls upon rdkit (the cheminfomatics library) and other utilities to augment the hard data. This enriched dataset can be pickled or created at runtime. At runtime the tensors are collected, transformed and routed to the model on the gpu. The categorical features like bond_type are encoded and embedded and concatenated with the continuous features. This GraphNet model allows for the graph convolution to be selected as a model parameter. In this case NetConv, an edge conditioned message passing convolution is chosen. The molecules that are being modeled are of different length; they have a different number of atoms. After the message passing iterations where the node/atom level data is propagated along the edges/bonds, the resultant embeddings have to be padded or pooled or sampled to be made uniform in dimension. In this case they are pooled and then passed through a feedforward network which reduces the output to a single continuous value which is then compared to the target value. The mean square error is then back propagated to the model weights and with the help of the Optimizer makes a small correction. The Metrics instance in the learner tracks, records and displays this iterative training process. The Learn instance also handles the saving and restoring of models.