Welcome to the MACAW documentation!
MACAW (Molecular AutoenCoding Auto-Workaround) is a cheminformatic tool for Python that embeds molecules in a low-dimensional, continuous numeric space. The embeddings are molecular features that can be used as inputs in mathematical and machine-learning models.
MACAW embeddings can be used as an alternative to conventional molecular descriptors. MACAW embeddings are fast and easy to compute, variable selection is not needed, and they may enable more accuracte predictive models than conventional molecular descriptors.
MACAW also provides original algorithms to generate molecular libraries and to evolve molecules in silico to meet a desired specification (inverse molecular design). The design specification can be any property or combination of properties that can be predicted for the molecule, such as its octane number or its binding affinity to a protein. Details about the algorithms can be found in the MACAW publication.
Contents
Installation
MACAW requires rdkit 2020.09.4 or later to run, which can be installed using conda:
conda install -c conda-forge rdkit
Alternative methods to install rdkit are given here.
Warning
rdkit has to be installed manually and is not automatically installed by pip as a dependency.
Then run the following command to install MACAW:
pip install macaw_py
Usage
The different MACAW functions can be imported in Python using:
from macaw import *
Molecule embedding
- class MACAW(type_fp='Morgan2', metric='Tanimoto', n_components=15, algorithm='MDS', n_landmarks=50, Yset=10, idx_landmarks=None, random_state=None)
Class providing MACAW numeric embeddings of molecules.
- Parameters
type_fp (str, optional) – Desired type of fingerprint to use to characterize molecules. Options include ‘RDK5’, ‘RDK7’, ‘Morgan2’, ‘Morgan3’, ‘featMorgan2’, ‘featMorgan3,’Avalon’,’MACCS’, ‘atompairs’, ‘torsion’, ‘pattern’, ‘secfp6’, and ‘layered’. Combinations can also be specified with the ‘+’ symbol, e.g. ‘RDK5+MACCS’. Defaults to ‘Morgan2’.
metric (str, optional) – str, optional Distrance metric used to measure similarity between molecular fingerprints. Options include ‘Tanimoto’, ‘dice’, ‘cosine’, ‘Sokal’, ‘Kulczynski’, ‘Mcconnaughey’, ‘Braun-Blanquet’, ‘Rogot-Goldberg’, ‘asymmetric’, ‘Manhattan’, and ‘Blay-Roger’. Defaults to ‘Tanimoto’.
n_components (int, optional) – Number of dimensions for the embedding. Defaults to 15.
algorithm (str, optional) – Algorithm to use for the projection. Options available are ‘MDS’, ‘isomap’, ‘PCA’, ICA’, ‘FA’, and ‘umap’. Defaults to ‘MDS’.
n_landmarks (int, optional) – Desired number of landmark molecules to use. Defaults to 50.
Yset (str or int, optional) – Specifies how to use the input in Y during fitting, if provided. Options include ‘highest’ and ‘lowest’. If an integer is provided, it will use uniform sampling of landmarks after splitting the molecules in Yset bins. Defaults to 10.
idx_landmarks (list, optional) – List indicating the indices of the molecules to be used as landmarks.
random_state (int, optional) – Seed to have the same choice of landmarks across runs.
Note
If the Y argument is provided during fitting, it will be used in the choice of the landmarks. If Yset is an integer, then the dataset will be split in Yset bins according to Y and landmarks will be sampled from the bins with equal probability. If Yset is set to ‘highest’ or ‘lowest’, then the landmarks will be the n_landmarks molecules with the highest or lowest Y values, respectively.
Warning
No attribute should be modified directly. Instead, setter methods are available for the modifiable attributes: set_type_fp(), set_metric(), set_Y(), set_n_components(), and set_algorithm().
- set_type_fp(type_fp)
Method to change the type_fp used in an existing MACAW object.
- set_metric(metric)
Method to change the metric used in an existing MACAW object.
- set_n_components(n_components)
Method to change the n_components used in an existing MACAW object.
- set_algorithm(algorithm)
Method to change the algorithm used in an existing MACAW object.
- fit(smiles, Y=None)
Method to select the landmarks and initialize the MACAW embedding space.
- Parameters
smiles (list) – List of molecules given in SMILES format.
Y – List of property values of interest, one for each molecule in smiles. If provided, it may help choosing a more diverse set of landmark molecules.
- transform(qsmiles)
Method to embed a list of molecules in an existing MACAW space.
- Parameters
qsmiles (list) – List of query molecules to be embedded given in SMILES format.
- Returns
A 2D array such that each row is the embedding of each qsmiles molecule.
- Return type
numpy.ndarray
Note
If any invalid SMILES is encountered in the input, the corresponding row in the output will be filled with nan’s.
- fit_transform(qsmiles, Y=None)
Combination of the fit and transform methods.
- MACAW_optimus(smiles, y, exhaustiveness=1, C=20.0, problem='auto', verbose=False, random_state=None, **kwargs)
Function that identifies and recommends a MACAW embedding for a given problem. It does so by evaluating the performance of different embeddings as inputs to a support vector machine.
- Parameters
smiles (list) – List of molecules in SMILES format.
y (list or numpy.ndarray) – List containing the property of interest for each molecule in smiles.
exhaustiveness (int, optional) – int, optional Controls how many combinations of fingeprint types and distance metrics to explore. If set to 1, it will only explore individual fingeprints. If set to 2, it will explore individual fingeprints and combinations of two fingeprints. If set to 3, it will explore additional metrics and perform a slower cross-validation.
C (float, optional) – Regularization hyperparameter for the SVM. Defaults to 20.
problem (str, optional) – Indicates whether it is a ‘regression’ or ‘classification’ problem. It determines if the model to use is a SVR or SVC. Defaults to ‘auto’, which will try to guess the problem type.
verbose (bool, optional) – Prints intermediate scores for the different type_fp and metric combinations.
random_state (int, optional) – Seed to have the same downsampling and choice of landmarks across runs.
kwargs – optional Allows to pass additional parameters to the MACAW class constructor (other than type_fp and metric).
- Returns
MACAW object with the best settings identified.
- Return type
Molecule generation
- library_maker(smiles=None, n_gen=20000, max_len=0, p='exp', noise_factor=0.1, algorithm='position', full_alphabet=False, return_selfies=False, random_state=None)
Generates molecules in a probabilistic manner. The molecules generated can be fully random or be biased around the distribution of input molecules.
- Parameters
smiles (list, optional) – List of molecules in SMILES format. If not provided, it will generate random molecules using the alphabet of robust SELFIES tokens.
n_gen (int, optional) – Target number of molecules to be generated. The actual number of molecules returned can be lower. Defaults to 20000.
max_len (int, optional) – Maximum length of the molecules generated in SELFIES format. If 0 (default), the maximum length seen in the input molecules will be used.
p (str, float or numpy.ndarray, optional) – Controls the SELFIES length distribution of molecules being generated. Options include ‘exp’ (exponential distribution, default), ‘empirical’ (observed input distribution), and ‘cumsum’ (cumulative observed input distribution). If p is numeric, then a potential distribution of degree p is used. If p is an array, then each element is considered to be a weight for sampling molecules with length given by the corresponding index (range(1,len(p+1))).
noise_factor (float, optional) – Controls the level of randomness added to the SELFIES frequency counts. Defaults to 0.1.
algorithm (str, optional) – Select to use ‘position’, ‘transition’ or ‘dual’ algorithm to compute the probability of sampling different SELFIES tokens. Defaults to ‘position’.
full_alphabet (bool, optional) – Enables the use of all robust tokens in the SELFIES package. If False (default), only makes use of the tokens present in the input smiles.
return_selfies (bool, optional) – If True, the ouptut will include both SMILES and SELFIES.
random_state (int, optional) – Seed to have the same subsampling and choice of landmarks across runs.
- Returns
List containing the molecules generated in SMILES format. If return_selfies is set to True, it will return a tuple with two lists containing the SMILES and SELFIES, respectively.
- Return type
Note
Internally, molecules are generated as SELFIES. The molecules generated are filtered to remove synonyms. The molecules returned are canonical SMILES.
On-specification molecule evolution
- library_evolver(smiles, model, mcw=None, spec=0.0, k1=2000, k2=100, n_rounds=8, n_hits=10, max_len=0, max_len_inc=2, force_new=False, random_state=None, **kwargs)
Recommends a list of molecules close to a desired specification by evolving increasingly focused libraries.
- Parameters
smiles (list) – List of molecules in SMILES format.
model (function) – Function that takes as input the features produced by the MACAW embedder mcw and returns a scalar (predicted property). The model may also directly take SMILES as its input, in which case no embedder needs to be provided. The model must be able to take a list of multiple inputs and produce the corresponding list of predictions.
mcw (MACAW or function, optional) – Embedder to featurize the smiles input into a representation compatible with model. If not provided, it will be assigned the unity function, and the model will have to take SMILES directly as its input.
spec (float) – Target specification that the recommended molecules should match.
k1 (int, optional) – Target number of molecules to be generated in each intermediate library. Defaults to 3000.
k2 (int, optional) – Numer of molecules that should be selected and carried over from an intermediate library to the next round. Defaults to 100.
n_rounds (int, optional) – Number of iterations for the library generation and selection process. Defaults to 8.
n_hits (int, optional) – Number of recommended molecules to return.
max_len (int, optional) – Maximum length of the molecules generated in SELFIES format. If 0 (default), the maximum length seen in the input molecules will be used.
max_len_inc (int, optional) – Maximum increment in SELFIES length from one round to the next. Defaults to 2.
force_new (book, optional) – Forces to return only SMILES not present in the smiles input. Defaults to False.
random_state (int, optional) – Seed to have the same subsampling and choice of landmarks across runs.
- Returns
A tuple (list, numpy.ndarray). The first element is the list of molecules recommended in SMILES format. The second element is an array with the predicted property values for each recommended molecule according to the model provided.
- Return type
See also
This function makes extensive use of the library_maker function. See the library_maker documentation for information on additional parameters.
Other functions
- hit_finder(X_lib, model, spec, X=[], Y=[], n_hits=10, k1=5, k2=25, p=1, n_rounds=1)
Identifies promising hit molecules from a library according to a property specification.
- Parameters
X_lib (numpy.ndarray) – Array containing the MACAW embeddings of a library of molecules. It can be generated with the MACAW transform method.
model (function) – Function that predicts property values given instances from X_lib.
spec (float) – Desired property value specification.
X (numpy.ndarray, optional) – Array containing the MACAW embedding of known molecules. It can be generated with the MACAW transform method.
Y (list or numpy.ndarray, optional) – Array containing the property values for the known molecules.
n_hits (int, optional) – Desired number of hit molecules to be returned. Defaults to 10.
k1 (int, optional) – Number of initial seed molecules to be carried in the search. Defaults to 5.
k2 (int, optional) – Number of molecules per seed to be retrieved for evaluation. Defaults to 25.
p (int or float, optional) – Minkowski norm to be used in the retrieval of molecules. If 0 < p < 1, then a V-distance is used. Defaults to 1 (Manhattan distance).
n_rounds (int, optional) – Number of times the whole search will be iterated over. Defaults to 1.
- Returns
A tuple (list,numpy.ndarray). The first element is the list of indices of the hit molecules found in X_lib. The second element is an array of property values predicted for the hit molecules using the model supplied.
- Return type
Note
The function uses an heuristic search to identify molecules close to the desired specification across the library.
If X`and `Y are provided, it first takes the k1 known molecules closest to the specification to guide the retrieval of the k2 closest molecules in the MACAW embedding space (according to a p-norm). This process is done using a sklearn BallTree structure. The k1 x k2 molecules retrieved are then evaluated using the model provided (model). If n_rounds = 1 (default), the indices of the n_hits molecules closest to the specification are finally returned to the user. If n_rounds > 1, then the k1 molecules closest to the specification are used to initiate another retrieval round.
The actual number of molecules being evaluated can be smaller than k1 x k2 if there is overlap between the list of molecules returned from different seeds.
See also
If a p value is provided such that 0 < p < 1, then V-distance is used. This can be regarded as a weighted version of Manhattan distance, see publication for details.
- smiles_cleaner(smiles, return_idx=False, deep_clean=False)
Function to remove invalid SMILES from a list.
- Parameters
- Returns
Returns a list containing only the valid SMILES, in the same order as the input. If return_idx is set to True, the return will be a tuple with three lists. The first list contains the valid SMILES, the second list contains the indices of the valid SMILES, and the third list contains the indices of the invalid SMILES in the input.
- Return type
Note
We recommend to set deep_clean=True if preparing an input library for the library_evolver function.
How to cite?
@article{doi:10.26434/chemrxiv-2022-x647j,
author = {Blay, Vincent and Radivojevich, Tijana and Allen, Jonathan E. and Hudson, Corey M. and Garcia-Martin, Hector},
title = {MACAW: an accessible tool for molecular embedding and inverse molecular design},
journal = {ChemRxiv},
volume = {0},
number = {ja},
pages = {null},
year = {2022},
URL = {https://doi.org/10.26434/chemrxiv-2022-x647j},
eprint = {https://doi.org/10.26434/chemrxiv-2022-x647j}
}