Molecule fingerprints can be used for fast substructure searching due to the fact that if a particular substructure appears in a molecule, then all bits set in the substructure's fingerprint will also be set in the molecule's fingerprints. The
Prepare a dataset of fingerprints
First import one hundred thousand molecules randomly selected from the ChEMBL database.
It is important to end the expression with a semicolon here, suppressing the output so that the system does not try to format each molecule.
Now precompute the fingerprints for these molecules, using the "BitVector" output format:
In[140]:=
fprints=
PatternFingerprint
[mols,"BitVector"];
Search the dataset for a query molecule
Now that we have precomputed the pattern fingerprints we can search through them for a query substructure. This works because all bits set in the query fingerprint will also be set in the fingerprint for a molecule containing the query.
The speedup by using fingerprints to prescreen is quite large:
In[53]:=
%〚1〛/%%%〚1〛
Out[53]=
68.2037
Of course computing the fingerprints all at once is an expensive operation, and the time savings will only fully be realized when searching for many substructures in a large set of molecules.