This is a blog for our paper Molecule Generation by Principal Subgraph Mining and Assembling which is accepted by NeurIPS 2022. Molecule generation is critical for various applications in domains like drug discovery and material science. Current attention has been paid to generating molecular graphs based on subgraphs. However, there are two critical problems in existing subgraph-level generation methods. First, these methods usually construct the vocabulary of subgraphs with external chemical libraries (Yang et al., 2021) or with hand-crafted rules (Jin et al., 2020). From the perspective of data science, these subgraphs are not designed to capture the combination patterns of atoms and bonds in molecules. The second problem is that they assemble the subgraphs in a sequential manner, which focuses mostly on the local arrangement. To tackle the above problems, we propose a novel notion, principal subgraph, along with an theoretically efficient algorithm to extract them to construct the vocabulary. We further propose a two-step generation framework which first generate a sequence of subgraphs and then globally assemble them into a molecule.
Principal Subgraph
There are some patterns in molecules. For example, below are five subgraphs that frequently occurs in the ZINC250K dataset, each of which is labeled with the ratio of molecules containing it at the bottom:
Intuitively, using these frequent subgraphs for generation helps the model better capture the complicated distribution of molecular graphs. Moreover, some frequent subgraphs are closely related to molecular properties (the most common example will be functional groups). Therefore, using them for generation may also improve the ability of the model to optimize molecular properties. And here comes the question: how to discover them from a given set of molecules? We know that frequent subgraph mining is an NP-hard problem so that we cannot just simply enumerate all possible subgraphs and sort them by frequencies. Fortunately, we propose an approximate solution to avoid the unaffordable efficiency problem while still ensuring the quality of the extracted subgraphs. To introduce our solution, we need to first define a novel and powerful notion: Principal Subgraph (PS).
What is Principal Subgraph
A molecule can be represented as a graph
With the aforementioned notations, we can give the definition of
Principal Subgraph. We call subgraph
The definition might be a little tricky, but in naturally language the definition means that amongst all subgraphs of the larger frequency, a principal subgraph basically represents the largest repetitive pattern in size within the data. It is desirable to leverage patterns of this kind as the building blocks for molecule generation since those subgraphs with a larger size than them are less frequent/reusable. Next, we will propose an algorithm to extract principal subgraphs for generation.
How to Extract Principal Subgraphs
We have defined the concept of spatially intersects
beforehand. Similarly, we can define sptially union as follows:
if two subgraphs
- Initialization. We first decide the size of the
vocabulary as
. The vocabulary is initialized with all unique atoms (subgraph with one node). Namely, all atoms are included in the vocabulary. - Merge. For every two neighboring fragments
and in the current vocabulary, we merge them by deriving the spatial union . Here, the neighboring fragments of a given fragment in a molecule is defined as the ones that contain at least one first-order neighbor nodes of a certain node in . - Update. We count the frequency of each identical
merged subgraph in the last stage. We choose the most frequent one as a
new fragment in the vocabulary
. Then, we go back to the merge stage until the vocabulary size reaches the predefined number .
Each subgraph can be seen as a small molecule, therefore we can translate subgraphs into SMILES to transform the graph matching problem into string matching problem.
Here we present an example of implementing the algorithm on a toy
dataset of three molecules:
Let's get back to the construction of the vocabulary. Now we have
Below is the pseudo code for the above algorithm:
The proposed algorithm enjoys the following properties, which ensure its efficacy:
- Monotonicity: The frequency of the non-single-atom
fragments in
decreases monotonically, namely , if . - Significance: Each fragment
in is a principal subgraph. - Completeness: For any principal subgraph
arising in the dataset, there always exists a fragment in satisfying , when has collected all fragments with frequency no less than .
These conclusions are interesting and valuable. Monotonicity ensures that the subgraphs with higher frequencies are always extracted before those with lower frequencies. This is important because subgraphs with higher frequencies are more likely to reflect the frequent patterns and should be included into the vocabulary earlier. Significance indicates that each extracted subgraph is a principal subgraph that basically represents the “largest” repetitive pattern in size within the data. Completeness means our algorithm is expressive enough to represent (at least contain) any potential principal subgraph. For proof of these conclusions, please refer to our paper.
Let's take a look back at the example of toy dataset. Even if the
previously mentioned ambiguity happened and the 3rd and 4th carbons are
merged in the third molecule, the final vocabulary will include a
fragment which equals to or contains
We provide some PS from the vocabulary constructed from ZINC250K and visualize them below. We found the constructed vocabulary really captures patterns in the dataset.
Subgraph-level Decomposition
Now we have defined principal subgraph and proposed an efficient
algorithm to extract them. The only remaining question is that how do we
represent a molecule with the constructed vocabulary. By saying
"represent a molecule" we mean decompose the molecule into the subgraphs
in the given vocabulary. A decomposition of a molecule
Two-step Generation
With the vocabulary of subgraphs, we generate molecules in two steps: first predicting which subgraphs should be selected and than assembling them globally. We use VAE-based generation framework and the overview of our model is depicted in the figure below:
The encoding of molecular graphs into latent variables can be obtained by an arbitrary graph neural network. We use GIN (Xu et al. 2019) in our paper. We want to emphasize that the subgraph-level information is integrated into the model by adding a fragment embedding to the atom nodes according to which subgraph they are in. Since the vocabulary consists of principal subgraphs in our paper, we name this generation framework as PS-VAE.
Generation of Subgraphs
Given a latent variable variable
Global Assembling of Subgraphs
The generated fragment set can be seen as a disconnected molecular
graph where bonds between the subgraphs are missing. We formalize bond
completion as a link prediction task which is familiar to the GNN
community. Specifically, we implement message passing on the atom-level
incomplete graph. Then, given node
We visualize some generated molecules below:
Property Optimization
In real scenarios concerning molecule generation, we usually need to generate molecules with optimized properties. We consider the setting where the property score can be given by some black-box scorers (e.g. computational methods, efficient wetlab methods, ...). We first train a predictor on the latent space of our PS-VAE to simulate the given scorers. Then we perform gradient ascending on the latent space to search for an optimized latent variable that gives high predicted property score, which shows promise for decoding into an optimized molecule. We conduct experiments on two widely used properties: Penalized logP and QED:
Please refer to our paper for more experimental results and detailed descriptions.
Analysis
Proper Size of Vocabulary
A larger
Correlations Between Principal Subgraphs and Properties
We may also wonder whether there truely exists correlations between the extracted PS and molecular properties, and whether PS-VAE can discover and utilize them. To analyze this, we present the normalized distribution of generated fragments and Pearson correlation coefficient between the fragments and Penalized logP (PlogP) in the figure below:
By saying "normalized distribution" we mean the frequencies of each fragment is divided by their frequencies in the dataset. Therefore, in non-optimization settings, it is expected that each fragment has a normalized frequency of 1 because our model is supposed to fit the distribution of the dataset. This is indeed observed in the figure, where the blue bins are approximately of the same height indicating a frequency of 1. Compared with the flat distribution under the non-optimization setting, the generated distribution shifts towards the fragments positively correlated with PlogP under the PlogP-optimization setting. The generation of fragments negatively correlated with PlogP is also suppressed. Therefore, we can draw the conclusion that correlations exist between fragments and PlogP, and our model can accurately discover and utilize these correlations.
Discussion
Though we have conducted extensive experiments to validate the efficacy of principal subgraphs, they are still preliminary attempts. We think there are a lot more domains that can utilize principal subgraphs for enhancement, as well as more efforts to improve the extraction algorithm. For example, currently the subgraph-level decomposition of molecules are merely implemented on the nodes. If we can also upgrade the edges to subgraph-level, it is possible to upgrade all atom-level models to their subgraph-level counterparts with only replacement of the vocabulary. Further, domains like pretraining or property prediction on the molecules may also extract abundant information from the subgraph-level representations of molecules. To conclude, we think our work provides insights into the selection of subgraphs on molecular representations and can inspire further search in this direction.
Contact
For further discussion, please contact Xiangzhe Kong (jackie_kxz@outlook.com)
v1.5.2