API reference: crem.utils¶
Iterative enumeration and the built-in filter_func / sample_func helpers.
utils ¶
enumerate_compounds ¶
enumerate_compounds(mol, db_fname, mode='scaffold', n_iterations=1, radius=3, max_replacements=None, protected_ids=None, replace_ids=None, min_freq=0, protect_added_frag=False, return_smi=False, ncpu=None, **kwargs)
Convenience function to perform scaffold decoration or enumeration of analog series. This performs in multiple iterations by modification of compounds enumerated on the previous iteration. May result in combinatorial explosion. The function returns the list of distinct molecules generated over all iterations.
Scaffold decoration uses grow procedure. Hydrogens will be added to the supplied molecule and replaced with fragments from the database. A user can protect particular positions from expansion by setting protect_ids argument. Note: The value of protect_added_frag parameter is ignored. New fragments cannot be attached to previously added fragments.
Enumeration of analog series uses mutate procedure. The molecule should be supplied with explicit hydrogens if one wants to replace them as well, otherwise only heavy atoms will be considered for replacements. It is recommended to set an altered part of a molecule not too large to obtain reasonable suggestions.
| Parameters: |
|
|---|
Source code in crem/utils.py
def enumerate_compounds(mol, db_fname, mode='scaffold', n_iterations=1, radius=3, max_replacements=None,
protected_ids=None, replace_ids=None, min_freq=0, protect_added_frag=False, return_smi=False,
ncpu=None, **kwargs):
'''
Convenience function to perform scaffold decoration or enumeration of analog series. This performs in multiple
iterations by modification of compounds enumerated on the previous iteration. May result in combinatorial explosion.
The function returns the list of distinct molecules generated over all iterations.
Scaffold decoration uses grow procedure. Hydrogens will be added to the supplied molecule and replaced with
fragments from the database. A user can protect particular positions from expansion by setting protect_ids argument.
Note: The value of protect_added_frag parameter is ignored. New fragments cannot be attached to previously
added fragments.
Enumeration of analog series uses mutate procedure. The molecule should be supplied with explicit hydrogens if one
wants to replace them as well, otherwise only heavy atoms will be considered for replacements. It is recommended to
set an altered part of a molecule not too large to obtain reasonable suggestions.
:param mol:
:param db_fname: path to DB file with fragment replacements.
:param mode: 'scaffold' decoration or 'analogs' enumeration. In 'scaffold' mode the supplied molecule will be
substituted with fragments from a database. In 'analogs' mode the supplied molecule wil be mutated.
Default: scaffold.
:param n_iterations: the number of rounds of generation. Molecules generated on the previous round are supplied
to the next iteration. Be careful setting this parameter as it may cause combinatorial
explosion. Default: 1.
:param radius: radius of context which will be considered for replacement. Default: 3.
:param max_replacements: maximum number of replacements to make for each molecule on each iteration. None will
result in all possible replacements or it is possible to set a randomly chosen number of
replacements. Default: None.
:param protected_ids: iterable with atom ids which will not be altered. In 'scaffold' mode this can be ids of
hydrogens or heavy atoms whose hydrogens should be protected from expansion. In 'analogs' mode
these are ids of hydrogens and heavy atoms. Please note, a molecule should be supplied with
explicit hydrogens in 'analogs' mode if one wants to replace them. Default: None.
:param replace_ids: iterable with atom ids to replace, it has lower priority then `protected_ids`. Default: None.
:param min_freq: minimum occurrence of fragments in DB for replacement. Default: 0.
:param protect_added_frag: True or False. If set True new fragments cannot be attached/replace fragments added on
previous iterations. Applicable only in 'analogs' mode. In 'scaffold' mode user input is
ignored and the argument internally set to True. Default: False
:param return_smi: if True will return the list of SMILES instead of Mol objects. Default: False.
:param ncpu: number of cores. None means all cpus.
:param kwargs: additional keyword arguments forwarded to the underlying generator -
``grow_mol`` in 'scaffold' mode (e.g. ``min_atoms``, ``max_atoms``) and
``mutate_mol`` in 'analogs' mode (e.g. ``min_size``, ``max_size``, ``min_inc``,
``max_inc``, ``replace_cycles``). See those functions for the meaning of each argument.
'''
if mode not in ['scaffold', 'analogs']:
raise ValueError('Wrong mode. Please choose one from the list - "analogs","scaffold"')
if ncpu is None:
ncpu = cpu_count()
else:
ncpu = max(1, min(int(ncpu), cpu_count()))
pool = joblib.Parallel(n_jobs=ncpu)
# to check if the statical arguments are in the kwargs dict
for kw in ['return_mol', 'return_rxn', 'return_rxn_freq', 'ncores']:
if kw in kwargs:
kwargs.pop(kw)
if mode == 'scaffold':
protect_added_frag = True
if protected_ids is None and replace_ids is not None:
protected_ids = list(set(a.GetIdx() for a in mol.GetAtoms()).difference(replace_ids))
if protected_ids is None:
protected_ids = []
start_mols = {mol: protected_ids}
# to get results ordered by iterations
generated_mols = OrderedDict()
n = 0
for n in range(n_iterations):
new_mols = ()
if mode == 'scaffold':
new_mols = pool(joblib.delayed(grow_mol2)(m, db_name=db_fname, protected_ids=prot_ids,
min_freq=min_freq, radius=radius,
max_replacements=max_replacements,
return_mol=True, return_rxn=False, return_rxn_freq=False,
ncores=1 if len(start_mols) > 1 else ncpu, **kwargs)
for m, prot_ids in start_mols.items())
if mode == 'analogs':
new_mols = pool(joblib.delayed(mutate_mol2)(m, db_name=db_fname, protected_ids=prot_ids,
min_freq=0, radius=radius, max_replacements=max_replacements,
return_mol=True, return_rxn=False, return_rxn_freq=False,
ncores=1 if len(start_mols) > 1 else ncpu, **kwargs)
for m, prot_ids in start_mols.items())
parent_protected_ids_list = start_mols.values()
start_mols = OrderedDict()
for childs, parent_protected_ids in zip(new_mols, parent_protected_ids_list):
for items in childs:
if items[0] not in generated_mols:
protected_ids = __get_child_protected_atom_ids(items[1], parent_protected_ids)
if protect_added_frag:
protect_added_ids = __get_child_added_atom_ids(items[1])
protected_ids = set(protected_ids + protect_added_ids)
generated_mols[items[0]] = items[1]
start_mols[items[1]] = protected_ids
if not start_mols:
break
if n + 1 < n_iterations:
sys.stderr.write(f'INFO. Procedure is finished after {n + 1} iterations instead of {n_iterations}\n')
if not return_smi:
return list(generated_mols.values())
else:
return list(generated_mols.keys())
sample_csp3 ¶
sample_csp3(row_ids, cur, radius, n)
Performs random selection of fragments proportionally to a squared fraction of sp3 carbon atoms.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Source code in crem/utils.py
def sample_csp3(row_ids, cur, radius, n):
"""
Performs random selection of fragments proportionally to a squared fraction of sp3 carbon atoms.
:param row_ids: the list of row ids of fragments to consider
:param cur: cursor to the fragment database
:param radius: context radius
:param n: the number of fragments to select
:return: the list of row ids of selected fragments
"""
if n >= len(row_ids):
return row_ids
d = defaultdict(list)
for rowid, core_smi, _, _ in _get_replacements(cur, radius, row_ids):
d[core_smi].append(rowid)
smis = list(d.keys())
values = [rdMolDescriptors.CalcFractionCSP3(Chem.MolFromSmiles(smi)) ** 2 for smi in smis]
values = [v + 0.01 for v in values]
values = np.array(values) / sum(values)
selected_smiles = np.random.choice(smis, n, replace=False, p=values).tolist()
ids = []
for smi in selected_smiles:
ids.extend(d[smi])
if len(ids) < n:
ids = random.sample(ids, n)
return ids
filter_max_ring_size ¶
filter_max_ring_size(row_ids, cur, radius, max_size=6)
Remove fragments having a ring size greater than a maximum threshold value
| Parameters: |
|
|---|
| Returns: |
|
|---|
Source code in crem/utils.py
def filter_max_ring_size(row_ids, cur, radius, max_size=6):
"""
Remove fragments having a ring size greater than a maximum threshold value
:param row_ids: the list of row ids of fragments to consider
:param cur: cursor to the fragment database
:param radius: context radius
:param max_size: maximum allowed ring size
:return: the list of row ids of selected fragments
"""
d = defaultdict(list)
for rowid, core_smi, _, _ in _get_replacements(cur, radius, row_ids):
d[core_smi].append(rowid)
smis = list(d.keys())
for smi in smis:
mol = Chem.MolFromSmiles(smi)
w = mol.GetRingInfo()
rings = w.AtomRings()
if rings and max(len(atom_ids) for atom_ids in rings) > max_size:
del d[smi]
ids = []
for v in d.values():
ids.extend(v)
return ids
filter_acyclic_attachment_points ¶
filter_acyclic_attachment_points(row_ids, cur, radius)
Keep only fragments where each attachment point [*:n] is attached to an acyclic atom.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Source code in crem/utils.py
def filter_acyclic_attachment_points(row_ids, cur, radius):
"""
Keep only fragments where each attachment point [*:n] is attached to an acyclic atom.
:param row_ids: the list of row ids of fragments to consider
:param cur: cursor to the fragment database
:param radius: context radius
:return: the list of row ids of selected fragments
"""
d = defaultdict(list)
for rowid, core_smi, _, _ in _get_replacements(cur, radius, row_ids):
d[core_smi].append(rowid)
for smi in list(d.keys()):
mol = Chem.MolFromSmiles(smi)
if mol is None:
del d[smi]
continue
keep = True
for atom in mol.GetAtoms():
if atom.GetAtomicNum() != 0:
continue
# Attachment dummy atoms in CReM cores should have a single neighbor.
if not atom.GetNeighbors():
keep = False
break
neighbor = atom.GetNeighbors()[0]
if neighbor.IsInRing():
keep = False
break
if not keep:
del d[smi]
ids = []
for v in d.values():
ids.extend(v)
return ids