WolframChemistry/MoleculeFingerprints

Substructure Screening

Molecule fingerprints can be used for fast substructure searching due to the fact that if a particular substructure appears in a molecule, then all bits set in the substructure's fingerprint will also be set in the molecule's fingerprints. The

Prepare a dataset of fingerprints

First import one hundred thousand molecules randomly selected from the ChEMBL database.

In[22]:=

mols=Import[PacletObject["WolframChemistry/MoleculeFingerprints"]["AssetLocation","SMILES strings for 100K molecules from ChEMBL"]];

It is important to end the expression with a semicolon here, suppressing the output so that the system does not try to format each molecule.

Now precompute the fingerprints for these molecules, using the "BitVector" output format:

In[140]:=

fprints=

PatternFingerprint

[mols,"BitVector"];

Search the dataset for a query molecule

Now that we have precomputed the pattern fingerprints we can search through them for a query substructure. This works because all bits set in the query fingerprint will also be set in the fingerprint for a molecule containing the query.

In[47]:=

query=Molecule["caffeine"];MoleculePlot[query]

Out[48]=

Compute the query fingerprint:

In[49]:=

queryFP=

PatternFingerprint

[query,"BitVector"]

Out[49]=

DataStructure

Type:BitVector

Capacity:2048



Use

Pick

to find the molecules whose fingerprints contain the query:

In[50]:=

prescreened=Pick[mols,queryFP["Copy"]["BitAnd",#]===queryFP&/@fprints];//AbsoluteTiming

Out[50]=

{0.309261,Null}

It is possible that molecules that don't contain the query could still have the same bits set, due to bit collision:

In[51]:=

CountsBy[prescreened,MoleculeContainsQ[query,IncludeHydrogensFalse]]

Out[51]=

True167,False26

By using the fingerprints to quickly screen out the majority of the molecules without doing a full substructure search.

In[52]:=

CountsBy[mols,MoleculeContainsQ[query,IncludeHydrogensFalse]]//AbsoluteTiming

Out[52]=

{21.0927,False99833,True167}

The speedup by using fingerprints to prescreen is quite large:

In[53]:=

%〚1〛/%%%〚1〛

Out[53]=

68.2037

Of course computing the fingerprints all at once is an expensive operation, and the time savings will only fully be realized when searching for many substructures in a large set of molecules.