API reference: crem.crem¶
The structure-generation API. See Operations for task-oriented guides.
crem ¶
mutate_mol ¶
mutate_mol(mol, db_name, radius=3, min_size=0, max_size=10, min_rel_size=0, max_rel_size=1, min_inc=-2, max_inc=2, max_replacements=None, replace_cycles='no', replace_ids=None, protected_ids=None, symmetry_fixes=False, min_freq=0, return_rxn=False, return_rxn_freq=False, return_mol=False, ncores=1, filter_func=None, sample_func=None, set_names=None, seed=None, **kwargs)
Generator of new molecules by replacement of fragments in the supplied molecule with fragments from DB.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Source code in crem/crem.py
def mutate_mol(mol, db_name, radius=3, min_size=0, max_size=10, min_rel_size=0, max_rel_size=1, min_inc=-2, max_inc=2,
max_replacements=None, replace_cycles="no", replace_ids=None, protected_ids=None,
symmetry_fixes=False, min_freq=0, return_rxn=False, return_rxn_freq=False, return_mol=False, ncores=1,
filter_func=None, sample_func=None, set_names=None, seed=None, **kwargs):
"""
Generator of new molecules by replacement of fragments in the supplied molecule with fragments from DB.
:param mol: RDKit Mol object. If hydrogens are explicit they will be replaced as well, otherwise not.
:param db_name: path to DB file with fragment replacements.
:param radius: radius of context which will be considered for replacement. Default: 3.
:param min_size: minimum number of heavy atoms in a fragment to replace. If 0 - hydrogens will be replaced
(if they are explicit). Default: 0.
:param max_size: maximum number of heavy atoms in a fragment to replace. Default: 10.
:param min_rel_size: minimum relative size of a replaced fragment to the whole molecule
(in terms of a number of heavy atoms)
:param max_rel_size: maximum relative size of a replaced fragment to the whole molecule
(in terms of a number of heavy atoms)
:param min_inc: minimum change of a number of heavy atoms in replacing fragments to a number of heavy atoms in
replaced one. Negative value means that the replacing fragments would be smaller than the replaced
one on a specified number of heavy atoms. Default: -2.
:param max_inc: maximum change of a number of heavy atoms in replacing fragments to a number of heavy atoms in
replaced one. Default: 2.
:param max_replacements: maximum number of replacements to make. If the number of replacements available in DB is
greater than the specified value the specified number of randomly chosen replacements
will be applied. Default: None.
:param replace_cycles: controls replacement of cyclic source fragments.
``"no"``/False uses ordinary acyclic-cut mutation
only. ``"forced"``/True allows cyclic cores from
ordinary fragmentation to be replaced ignoring the size filters.
``"partial_all"`` additionally replaces partial
ring arcs using exhaustive side cuts.
``"partial_exo"`` additionally replaces partial
ring arcs using only exo side cuts adjacent to the
selected ring arc. Default: ``"no"``.
:param replace_ids: iterable with atom ids to replace, it has lower priority over `protected_ids` (replace_ids
which are present in protected_ids would be protected).
Ids of hydrogen atoms (if any) connected to the specified heavy atoms will be automatically
labeled as replaceable. Default: None.
:param protected_ids: iterable with atom ids which will not be mutated. If the molecule was supplied with explicit
hydrogen the ids of protected hydrogens should be supplied as well, otherwise they will be
replaced.
Ids of all equivalent atoms should be supplied (e.g. to protect meta-position in toluene
ids of both carbons in meta-positions should be supplied)
This argument has a higher priority over `replace_ids`. Default: None.
:param symmetry_fixes: if set True duplicated fragments with equivalent atoms having different ids will be
enumerated. This makes sense if one wants to replace particular atom(s) which have
equivalent ones. By default, among equivalent atoms only those with the lowest ids
are replaced. This will result in generation of duplicated molecules if several equivalent
atoms are selected which will be filtered later nevertheless. So, it is not very reasonable
to use this argument and select several equivalent atoms to replace.
This solves the issue of rdkit MMPA fragmenter which removes duplicates internally.
:param min_freq: minimum occurrence of fragments in DB for replacement. Default: 0.
:param set_names: column name or list of column names in radius tables defining the set(s) of fragments to use
with min_freq in v1 databases. A fragment is included if at least one of the named sets
satisfies the min_freq threshold (OR logic). If None (default), all available set columns
are used. If a column name is not found, a ValueError is raised listing available set names.
Ignored for v0 databases. Default: None.
:param return_rxn: whether to additionally return rxn of a transformation. Default: False.
:param return_rxn_freq: whether to additionally return the frequency of a transformation in the DB. Default: False.
:param return_mol: whether to additionally return RDKit Mol object of a generated molecule. Default: False.
:param ncores: number of cores. Default: 1.
:param filter_func: a function which will filter selected fragments by additional rules
(in this way one may add arbitrary selection constrains). The function takes necessary first
three arguments: row_ids (list or set of row_ids from the fragment database supplied to
the mutate_mol function), cursor of that fragment database and radius (int). This is required
access the selected fragments. Other arguments are custom and user-defined.
It is the most convenient to define a filtering function, implement specific logic inside and
pass it to mutate_mol using functools.partial. The filtering function should return a list/set
of row ids which are a subset of the input row ids.
:param sample_func: a function which will sample selected fragments if max_replacements is supplied. If omitted
uniform sampling will be used. The function takes necessary first four arguments: row_ids
(list or set of row_ids from the fragment database), cursor of that fragment database,
radius (int) and the number of returned items (int). This is required to access the selected
fragments. Other arguments can be custom and user-defined. The function should return
a list/set of selected row ids.
:param seed: random seed for reproducible fragment selection when max_replacements is set. Default: None.
:param **kwargs: named arguments to additionally filter replacing fragments. For v0 DB use columns from radiusX,
for v1 DB use columns from frags or frags_h. Values are a single value or 2-item tuple with lower
and upper bound of the corresponding parameter of a fragment. This can be useful to annotate
fragments with additional custom properties (e.g. number of particular pharmacophore features,
lipophilicity, etc) and use these parameters to additionally restrict selected fragments.
:return: generator over new molecules. If no additional return arguments were called this would be a generator over
SMILES of new molecules. If any of additional return values were asked the function will return a list
of list where the first item is SMILES, then rxn string of a transformation (optional), frequency of
fragment occurrence in the DB (optional), RDKit Mol object (optional).
Only entries with distinct SMILES will be returned.
Note: supply RDKit Mol object with explicit hydrogens if H replacement is required
"""
replace_cycles = _normalize_replace_cycles(replace_cycles)
__check_db_existence(db_name)
products = {Chem.MolToSmiles(Chem.RemoveHs(mol))}
mol = __backup_atom_properties(mol, __atom_properties_to_backup)
protected_ids = set(protected_ids) if protected_ids else set()
if replace_ids:
ids = set()
for i in replace_ids:
ids.update(a.GetIdx() for a in mol.GetAtomWithIdx(i).GetNeighbors() if a.GetAtomicNum() == 1)
ids = set(a.GetIdx() for a in mol.GetAtoms()).difference(ids).difference(replace_ids) # ids which should be protected
protected_ids.update(ids) # since protected_ids has a higher priority add them anyway
# protected_ids = sorted(protected_ids) # why we made sorted?
if ncores == 1:
for frag_sma, core_sma, freq, context_mol in __gen_replacements(mol1=mol, mol2=None, db_name=db_name,
radius=radius, min_size=min_size,
max_size=max_size,
min_rel_size=min_rel_size,
max_rel_size=max_rel_size,
min_inc=min_inc, max_inc=max_inc,
max_replacements=max_replacements,
replace_cycles=replace_cycles,
protected_ids_1=protected_ids,
protected_ids_2=None, min_freq=min_freq,
set_names=set_names,
symmetry_fixes=symmetry_fixes,
filter_func=filter_func,
sample_func=sample_func,
return_frag_smi_only=False,
operation="mutate",
seed=seed, **kwargs):
for smi, m, rxn in __frag_replace(mol, None, frag_sma, core_sma, radius, context_mol):
if max_replacements is None or len(products) < (max_replacements + 1): # +1 because we added source mol to output smiles
if smi not in products:
products.add(smi)
res = [smi]
if return_rxn:
res.append(rxn)
if return_rxn_freq:
res.append(freq)
if return_mol:
res.append(m)
if len(res) == 1:
yield res[0]
else:
yield res
else:
p = Pool(min(ncores, cpu_count()))
try:
for items in p.imap(__frag_replace_mp, __get_data(mol, db_name, radius, min_size, max_size, min_rel_size,
max_rel_size, min_inc, max_inc, replace_cycles,
protected_ids, min_freq, set_names, max_replacements,
symmetry_fixes, filter_func=filter_func,
sample_func=sample_func,
seed=seed, **kwargs),
chunksize=100):
for smi, m, rxn, freq in items:
if max_replacements is None or len(products) < (max_replacements + 1): # +1 because we added source mol to output smiles
if smi not in products:
products.add(smi)
res = [smi]
if return_rxn:
res.append(rxn)
if return_rxn_freq:
res.append(freq)
if return_mol:
res.append(m)
if len(res) == 1:
yield res[0]
else:
yield res
finally:
p.close()
p.join()
grow_mol ¶
grow_mol(mol, db_name, radius=3, min_atoms=1, max_atoms=2, max_replacements=None, replace_ids=None, protected_ids=None, symmetry_fixes=False, min_freq=0, return_rxn=False, return_rxn_freq=False, return_mol=False, ncores=1, filter_func=None, sample_func=None, set_names=None, seed=None, **kwargs)
Replace hydrogens with fragments from the database.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Source code in crem/crem.py
def grow_mol(mol, db_name, radius=3, min_atoms=1, max_atoms=2, max_replacements=None, replace_ids=None,
protected_ids=None, symmetry_fixes=False, min_freq=0, return_rxn=False, return_rxn_freq=False,
return_mol=False, ncores=1, filter_func=None, sample_func=None, set_names=None, seed=None, **kwargs):
"""
Replace hydrogens with fragments from the database.
:param mol: RDKit Mol object.
:param db_name: path to DB file with fragment replacements.
:param radius: radius of context which will be considered for replacement. Default: 3.
:param min_atoms: minimum number of atoms in the fragment which will replace H
:param max_atoms: maximum number of atoms in the fragment which will replace H
:param max_replacements: maximum number of replacements to make. If the number of replacements available in DB is
greater than the specified value the specified number of randomly chosen replacements
will be applied. Default: None.
:param replace_ids: iterable with ids of heavy atom with replaceable Hs or/and ids of H atoms to replace,
it has lower priority over `protected_ids` (replace_ids
which are present in protected_ids would be protected). Default: None.
:param protected_ids: iterable with hydrogen atom ids or ids of heavy atoms at which hydrogens will not be replaced.
Ids of all equivalent atoms should be supplied (e.g. to protect meta-position in toluene
ids of both carbons in meta-positions should be supplied).
This argument has a higher priority over `replace_ids`. Default: None.
:param symmetry_fixes: if Sset True duplicated fragments with equivalent atoms having different ids will be
enumerated. This makes sense if one wants to replace particular atom(s) which have
equivalent ones. By default, among equivalent atoms only those with the lowest ids
are replaced. This will result in generation of duplicated molecules if several equivalent
atoms are selected which will be filtered later nevertheless. So, it is not very reasonable
to use this argument and select several equivalent atoms to replace.
This solves the issue of rdkit MMPA fragmenter which removes duplicates internally.
:param min_freq: minimum occurrence of fragments in DB for replacement. Default: 0.
:param set_names: column name or list of column names in radius tables defining the set(s) of fragments to use
with min_freq in v1 databases. A fragment is included if at least one of the named sets
satisfies the min_freq threshold (OR logic). If None (default), all available set columns
are used. If a column name is not found, a ValueError is raised listing available set names.
Ignored for v0 databases. Default: None.
:param return_rxn: whether to additionally return rxn of a transformation. Default: False.
:param return_rxn_freq: whether to additionally return the frequency of a transformation in the DB. Default: False.
:param return_mol: whether to additionally return RDKit Mol object of a generated molecule. Default: False.
:param ncores: number of cores. Default: 1.
:param filter_func: a function which will filter selected fragments by additional rules
(in this way one may add arbitrary selection constrains). The function takes necessary first
three arguments: row_ids (list or set of row_ids from the fragment database supplied to
the grow_mol function), cursor of that fragment database and radius (int). This is required
access the selected fragments. Other arguments are custom and user-defined.
It is the most convenient to define a filtering function, implement specific logic inside and
pass it to grow_mol using functools.partial. The filtering function should return a list/set
of row ids which are a subset of the input row ids.
:param sample_func: a function which will sample selected fragments if max_replacements is supplied. If omitted
uniform sampling will be used. The function takes necessary first four arguments: row_ids
(list or set of row_ids from the fragment database), cursor of that fragment database,
radius (int) and the number of returned items (int). This is required to access the selected
fragments. Other arguments can be custom and user-defined. The function should return
a list/set of selected row ids.
:param seed: random seed for reproducible fragment selection when max_replacements is set. Default: None.
:param **kwargs: named arguments to additionally filter replacing fragments. For v0 DB use columns from radiusX,
for v1 DB use columns from frags or frags_h. Values are a single value or 2-item tuple with lower
and upper bound of the corresponding parameter of a fragment. This can be useful to annotate
fragments with additional custom properties (e.g. number of particular pharmacophore features,
lipophilicity, etc) and use these parameters to additionally restrict selected fragments.
:return: generator over new molecules. If no additional return arguments were called this would be a generator over
SMILES of new molecules. If any of additional return values were asked the function will return a list
of list where the first item is SMILES, then rxn string of a transformation (optional), frequency of
fragment occurrence in the DB (optional), RDKit Mol object (optional).
Only entries with distinct SMILES will be returned.
"""
__check_db_existence(db_name)
m = Chem.AddHs(mol)
# create the list of ids of protected Hs only would be enough, however in the first case (replace_ids) the full list
# of protected atom ids is created
if protected_ids:
ids = []
for i in protected_ids:
if m.GetAtomWithIdx(i).GetAtomicNum() == 1:
ids.append(i)
else:
for a in m.GetAtomWithIdx(i).GetNeighbors():
if a.GetAtomicNum() == 1:
ids.append(a.GetIdx())
protected_ids = set(ids) # ids of protected Hs
else:
protected_ids = set()
if replace_ids:
ids = set() # ids if replaceable Hs
for i in replace_ids:
if m.GetAtomWithIdx(i).GetAtomicNum() == 1:
ids.add(i)
else:
ids.update(a.GetIdx() for a in m.GetAtomWithIdx(i).GetNeighbors() if a.GetAtomicNum() == 1)
ids = set(a.GetIdx() for a in m.GetAtoms() if a.GetAtomicNum() == 1).difference(ids) # ids of Hs to protect
protected_ids.update(ids) # since protected_ids has a higher priority add them anyway
return mutate_mol(m, db_name, radius, min_size=0, max_size=0, min_inc=min_atoms, max_inc=max_atoms,
max_replacements=max_replacements, replace_ids=None, protected_ids=protected_ids,
min_freq=min_freq, set_names=set_names, return_rxn=return_rxn, return_rxn_freq=return_rxn_freq,
return_mol=return_mol, ncores=ncores, symmetry_fixes=symmetry_fixes, filter_func=filter_func,
sample_func=sample_func, seed=seed, **kwargs)
link_mols ¶
link_mols(mol1, mol2, db_name, radius=3, dist=None, min_atoms=1, max_atoms=2, max_replacements=None, replace_ids_1=None, replace_ids_2=None, protected_ids_1=None, protected_ids_2=None, min_freq=0, return_rxn=False, return_rxn_freq=False, return_mol=False, ncores=1, filter_func=None, sample_func=None, set_names=None, seed=None, **kwargs)
Link two molecules by a linker from the database.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Source code in crem/crem.py
def link_mols(mol1, mol2, db_name, radius=3, dist=None, min_atoms=1, max_atoms=2, max_replacements=None,
replace_ids_1=None, replace_ids_2=None, protected_ids_1=None, protected_ids_2=None,
min_freq=0, return_rxn=False, return_rxn_freq=False, return_mol=False, ncores=1, filter_func=None,
sample_func=None, set_names=None, seed=None, **kwargs):
"""
Link two molecules by a linker from the database.
:param mol1: the first RDKit Mol object
:param mol2: the second RDKit Mol object
:param db_name: path to DB file with fragment replacements.
:param radius: radius of context which will be considered for replacement. Default: 3.
:param dist: topological distance between two attachment points in the fragment which will link molecules.
Can be a single integer or a tuple of lower and upper bound values.
:param min_atoms: minimum number of heavy atoms in the fragment which will link molecules
:param max_atoms: maximum number of heavy atoms in the fragment which will link molecules
:param max_replacements: maximum number of replacements to make. If the number of replacements available in DB is
greater than the specified value the specified number of randomly chosen replacements
will be applied. Default: None.
:param replace_ids_1: iterable with ids of heavy atom of the first molecule with replaceable Hs or/and ids of H
atoms to replace,
it has lower priority over `protected_ids_1` (replace_ids
which are present in protected_ids would be protected). Default: None.
:param replace_ids_2: iterable with ids of heavy atom of the second molecule with replaceable Hs or/and ids of H
atoms to replace,
it has lower priority over `protected_ids_2` (replace_ids
which are present in protected_ids would be protected). Default: None.
:param protected_ids_1: iterable with ids of heavy atoms of the first molecule at which no H replacement should
be made and/or ids of protected hydrogens.
This argument has a higher priority over `replace_ids_1`. Default: None.
:param protected_ids_2: iterable with ids of heavy atoms of the second molecule at which no H replacement should
be made and/or ids of protected hydrogens.
This argument has a higher priority over `replace_ids_2`. Default: None.
:param min_freq: minimum occurrence of fragments in DB for replacement. Default: 0.
:param set_names: column name or list of column names in radius tables defining the set(s) of fragments to use
with min_freq in v1 databases. A fragment is included if at least one of the named sets
satisfies the min_freq threshold (OR logic). If None (default), all available set columns
are used. If a column name is not found, a ValueError is raised listing available set names.
Ignored for v0 databases. Default: None.
:param return_rxn: whether to additionally return rxn of a transformation. Default: False.
:param return_rxn_freq: whether to additionally return the frequency of a transformation in the DB. Default: False.
:param return_mol: whether to additionally return RDKit Mol object of a generated molecule. Default: False.
:param ncores: number of cores. Default: 1.
:param filter_func: a function which will filter selected fragments by additional rules
(in this way one may add arbitrary selection constrains). The function takes necessary first
three arguments: row_ids (list or set of row_ids from the fragment database supplied to
the link_mols function), cursor of that fragment database and radius (int). This is required
access the selected fragments. Other arguments are custom and user-defined.
It is the most convenient to define a filtering function, implement specific logic inside and
pass it to link_mols using functools.partial. The filtering function should return a list/set
of row ids which are a subset of the input row ids.
:param sample_func: a function which will sample selected fragments if max_replacements is supplied. If omitted
uniform sampling will be used. The function takes necessary first four arguments: row_ids
(list or set of row_ids from the fragment database), cursor of that fragment database,
radius (int) and the number of returned items (int). This is required to access the selected
fragments. Other arguments can be custom and user-defined. The function should return
a list/set of selected row ids.
:param seed: random seed for reproducible fragment selection when max_replacements is set. Default: None.
:param **kwargs: named arguments to additionally filter replacing fragments. For v0 DB use columns from radiusX,
for v1 DB use columns from frags or frags_h. Values are a single value or 2-item tuple with lower
and upper bound of the corresponding parameter of a fragment. This can be useful to annotate
fragments with additional custom properties (e.g. number of particular pharmacophore features,
lipophilicity, etc) and use these parameters to additionally restrict selected fragments.
:return: generator over new molecules. If no additional return arguments were called this would be a generator over
SMILES of new molecules. If any of additional return values were asked the function will return a list
of list where the first item is SMILES, then rxn string of a transformation (optional), frequency of
fragment occurrence in the DB (optional), RDKit Mol object (optional).
Only entries with distinct SMILES will be returned.
"""
def __get_protected_ids(m, replace_ids, protected_ids):
# the list of ids of heavy atom with protected hydrogens should be returned
if protected_ids:
ids = set()
for i in protected_ids:
if m.GetAtomWithIdx(i).GetAtomicNum() == 1:
ids.update(a.GetIdx() for a in m.GetAtomWithIdx(i).GetNeighbors())
else:
ids.add(i)
protected_ids = ids
else:
protected_ids = set()
if replace_ids:
ids = set()
for i in replace_ids:
if m.GetAtomWithIdx(i).GetAtomicNum() == 1:
ids.update(a.GetIdx() for a in m.GetAtomWithIdx(i).GetNeighbors())
else:
ids.add(i)
heavy_atom_ids = set(a.GetIdx() for a in m.GetAtoms() if a.GetAtomicNum() > 1)
ids = heavy_atom_ids.difference(ids) # ids of heavy atoms which should be protected
protected_ids.update(ids) # since protected_ids has a higher priority add them anyway
return protected_ids
__check_db_existence(db_name)
products = set()
mol1 = __backup_atom_properties(Chem.AddHs(mol1), __atom_properties_to_backup)
mol2 = __backup_atom_properties(Chem.AddHs(mol2), __atom_properties_to_backup)
protected_ids_1 = __get_protected_ids(mol1, replace_ids_1, protected_ids_1)
protected_ids_2 = __get_protected_ids(mol2, replace_ids_2, protected_ids_2)
if ncores == 1:
for frag_sma, core_sma, freq, context_mol in __gen_replacements(mol1=mol1, mol2=mol2,
db_name=db_name, radius=radius,
dist=dist, min_size=0,
max_size=0, min_rel_size=0,
max_rel_size=1,
min_inc=min_atoms,
max_inc=max_atoms,
replace_cycles=False,
max_replacements=max_replacements,
protected_ids_1=protected_ids_1,
protected_ids_2=protected_ids_2,
min_freq=min_freq,
set_names=set_names,
filter_func=filter_func,
sample_func=sample_func,
return_frag_smi_only=False,
operation="link",
seed=seed, **kwargs):
for smi, m, rxn in __frag_replace(mol1, mol2, frag_sma, core_sma, radius, context_mol):
if max_replacements is None or (max_replacements is not None and len(products) < max_replacements):
if smi not in products:
products.add(smi)
res = [smi]
if return_rxn:
res.append(rxn)
if return_rxn_freq:
res.append(freq)
if return_mol:
res.append(m)
if len(res) == 1:
yield res[0]
else:
yield res
else:
p = Pool(min(ncores, cpu_count()))
try:
for items in p.imap(__frag_replace_mp, __get_data_link(mol1, mol2, db_name, radius, dist, min_atoms, max_atoms,
protected_ids_1, protected_ids_2, min_freq,
set_names, max_replacements, filter_func=filter_func,
sample_func=sample_func, seed=seed, **kwargs),
chunksize=100):
for smi, m, rxn, freq in items:
if max_replacements is None or (max_replacements is not None and len(products) < max_replacements):
if smi not in products:
products.add(smi)
res = [smi]
if return_rxn:
res.append(rxn)
if return_rxn_freq:
res.append(freq)
if return_mol:
res.append(m)
if len(res) == 1:
yield res[0]
else:
yield res
finally:
p.close()
p.join()
make_cycle ¶
make_cycle(mol, db_name, radius=3, ring_size=None, ring_closures=True, min_atoms=1, max_atoms=10, max_replacements=None, replace_ids=None, protected_ids=None, symmetry_fixes=False, min_freq=0, return_rxn=False, return_rxn_freq=False, return_mol=False, ncores=1, filter_func=None, sample_func=None, set_names=None, seed=None, **kwargs)
Generate new rings (macrocycles or smaller native cycles) by linking two atoms in the same molecule with a 2-attachment-point fragment from the DB.
Two complementary modes:
ring_closures=False(broad): query any linker fragment. Internally both fragmenters are run on the input molecule (the connected-env arc-cut fragmenter and the disconnected-env macrocycle fragmenter) and theis_ring_closureprovenance column is not filtered, so DB rows of either provenance can match.ring_closures=True(strict): only the connected-env arc-cut fragmenter runs and the query is restricted tois_ring_closure=1rows (populated by--frag-mode ring/bothor the corresponding*_optimalmodes at DB build time). Useful for closing native (typically aliphatic) rings.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Source code in crem/crem.py
def make_cycle(mol, db_name, radius=3, ring_size=None, ring_closures=True,
min_atoms=1, max_atoms=10, max_replacements=None,
replace_ids=None, protected_ids=None, symmetry_fixes=False, min_freq=0,
return_rxn=False, return_rxn_freq=False, return_mol=False, ncores=1, filter_func=None,
sample_func=None, set_names=None, seed=None, **kwargs):
"""
Generate new rings (macrocycles or smaller native cycles) by linking two
atoms in the same molecule with a 2-attachment-point fragment from the DB.
Two complementary modes:
* ``ring_closures=False`` (broad): query **any** linker fragment.
Internally both fragmenters are run on the input molecule (the
connected-env arc-cut fragmenter and the disconnected-env macrocycle
fragmenter) and the ``is_ring_closure`` provenance column is **not**
filtered, so DB rows of either provenance can match.
* ``ring_closures=True`` (strict): only the connected-env arc-cut
fragmenter runs and the query is restricted to ``is_ring_closure=1``
rows (populated by ``--frag-mode ring`` / ``both`` or the corresponding
``*_optimal`` modes at DB build time).
Useful for closing native (typically aliphatic) rings.
:param mol: RDKit Mol object.
:param db_name: path to DB file with fragment replacements.
:param radius: radius of context which will be considered for replacement. Default: 3.
:param ring_size: size of the *new* ring being formed (in atoms = bonds).
``int`` for a single size, ``(min, max)`` tuple for a
window. ``None`` imposes no ring-size constraint. The
per-anchor-pair ``dist2`` filter is derived as
``ring_size − d_in`` where ``d_in`` is the topological
distance between the two anchor heavy atoms in the
input molecule.
:param ring_closures: if True, query ring-closure (arc) fragments in DB
(rows with ``is_ring_closure = 1``). If False
(default) query acyclic-cut linker fragments.
:param min_atoms: minimum number of heavy atoms in the linker fragment. Default: 1.
:param max_atoms: maximum number of heavy atoms in the linker fragment. Default: 10.
:param max_replacements: maximum number of replacements to make. If the number of replacements available in DB is
greater than the specified value the specified number of randomly chosen replacements
will be applied. Default: None.
:param replace_ids: iterable with ids of heavy atom with replaceable Hs or/and ids of H atoms to replace,
it has lower priority over `protected_ids` (replace_ids
which are present in protected_ids would be protected). Default: None.
:param protected_ids: iterable with ids of heavy atoms at which no H replacement should be made and/or ids of
protected hydrogens. This argument has a higher priority over `replace_ids`. Default: None.
:param symmetry_fixes: accepted for API compatibility with mutate/grow functions but not used here.
:param min_freq: minimum occurrence of fragments in DB for replacement. Default: 0.
:param return_rxn: whether to additionally return rxn of a transformation. Default: False.
:param return_rxn_freq: whether to additionally return the frequency of a transformation in the DB. Default: False.
:param return_mol: whether to additionally return RDKit Mol object of a generated molecule. Default: False.
:param ncores: number of cores. Default: 1.
:param filter_func: a function which will filter selected fragments by additional rules
(in this way one may add arbitrary selection constrains). The function takes necessary first
three arguments: row_ids (list or set of row_ids from the fragment database supplied to
make_cycle), cursor of that fragment database and radius (int). This is required to
access the selected fragments. Other arguments are custom and user-defined.
It is the most convenient to define a filtering function, implement specific logic inside and
pass it using functools.partial. The filtering function should return a list/set
of row ids which are a subset of the input row ids.
:param sample_func: a function which will sample selected fragments if max_replacements is supplied. If omitted
uniform sampling will be used. The function takes necessary first four arguments: row_ids
(list or set of row_ids from the fragment database), cursor of that fragment database,
radius (int) and the number of returned items (int). This is required to access the selected
fragments. Other arguments can be custom and user-defined. The function should return
a list/set of selected row ids.
:param set_names: column name or list of column names in radius tables defining the set(s) of fragments to use
with min_freq in v1 databases. A fragment is included if at least one of the named sets
satisfies the min_freq threshold (OR logic). If None (default), all available set columns
are used. If a column name is not found, a ValueError is raised listing available set names.
Ignored for v0 databases. Default: None.
:param seed: random seed for reproducible fragment selection when max_replacements is set. Default: None.
:param **kwargs: named arguments to additionally filter replacing fragments. For v0 DB use columns from radiusX,
for v1 DB use columns from frags or frags_h. Values are a single value or 2-item tuple with lower
and upper bound of the corresponding parameter of a fragment.
:return: generator over new molecules. If no additional return arguments were requested this is a generator over
SMILES of new molecules. If additional return values were requested, the function yields a list where
the first item is SMILES, then rxn string (optional), frequency (optional), RDKit Mol object (optional).
Only entries with distinct SMILES will be returned.
"""
def __get_protected_ids(m, replace_ids, protected_ids):
# the list of ids of heavy atoms with protected hydrogens should be returned
if protected_ids:
ids = set()
for i in protected_ids:
if m.GetAtomWithIdx(i).GetAtomicNum() == 1:
ids.update(a.GetIdx() for a in m.GetAtomWithIdx(i).GetNeighbors())
else:
ids.add(i)
protected_ids = ids
else:
protected_ids = set()
if replace_ids:
ids = set()
for i in replace_ids:
if m.GetAtomWithIdx(i).GetAtomicNum() == 1:
ids.update(a.GetIdx() for a in m.GetAtomWithIdx(i).GetNeighbors())
else:
ids.add(i)
heavy_atom_ids = set(a.GetIdx() for a in m.GetAtoms() if a.GetAtomicNum() > 1)
ids = heavy_atom_ids.difference(ids) # ids of heavy atoms which should be protected
protected_ids.update(ids) # since protected_ids has a higher priority add them anyway
return protected_ids
__check_db_existence(db_name)
products = set()
mol = Chem.AddHs(mol)
source_smi = Chem.MolToSmiles(Chem.RemoveHs(mol), isomericSmiles=True)
protected_ids = __get_protected_ids(mol, replace_ids, protected_ids)
mol = __backup_atom_properties(mol, __atom_properties_to_backup)
if ncores == 1:
for frag_sma, core_sma, freq, context_mol in __gen_replacements(mol1=mol, mol2=None, db_name=db_name,
radius=radius,
min_size=0, max_size=0,
min_rel_size=0, max_rel_size=1,
min_inc=min_atoms, max_inc=max_atoms,
max_replacements=max_replacements,
replace_cycles=False,
protected_ids_1=protected_ids,
protected_ids_2=None,
min_freq=min_freq, set_names=set_names,
filter_func=filter_func,
sample_func=sample_func,
return_frag_smi_only=False,
operation="cycle",
ring_closures=ring_closures,
ring_size=ring_size,
seed=seed, **kwargs):
for smi, m, rxn in __frag_replace(mol, None, frag_sma, core_sma, radius, context_mol):
if max_replacements is None or (max_replacements is not None and len(products) < max_replacements):
if smi != source_smi and smi not in products:
products.add(smi)
res = [smi]
if return_rxn:
res.append(rxn)
if return_rxn_freq:
res.append(freq)
if return_mol:
res.append(m)
if len(res) == 1:
yield res[0]
else:
yield res
else:
p = Pool(min(ncores, cpu_count()))
try:
for items in p.imap(__frag_replace_mp, __get_data_cycle(mol, db_name, radius, ring_size,
ring_closures, min_atoms, max_atoms,
protected_ids, min_freq, set_names,
max_replacements,
filter_func=filter_func,
sample_func=sample_func, seed=seed,
**kwargs),
chunksize=100):
for smi, m, rxn, freq in items:
if max_replacements is None or (max_replacements is not None and len(products) < max_replacements):
if smi != source_smi and smi not in products:
products.add(smi)
res = [smi]
if return_rxn:
res.append(rxn)
if return_rxn_freq:
res.append(freq)
if return_mol:
res.append(m)
if len(res) == 1:
yield res[0]
else:
yield res
finally:
p.close()
p.join()
mutate_mol2 ¶
mutate_mol2(*args, **kwargs)
Convenience function which can be used to process molecules in parallel using multiprocessing module. It calls mutate_mol which cannot be used directly in multiprocessing because it is a generator
| Parameters: |
|
|---|
| Returns: |
|
|---|
Source code in crem/crem.py
def mutate_mol2(*args, **kwargs):
"""
Convenience function which can be used to process molecules in parallel using multiprocessing module.
It calls mutate_mol which cannot be used directly in multiprocessing because it is a generator
:param args: positional arguments, the same as in mutate_mol function
:param kwargs: keyword arguments, the same as in mutate_mol function
:return: list with output molecules
"""
return list(mutate_mol(*args, **kwargs))
grow_mol2 ¶
grow_mol2(*args, **kwargs)
Convenience function which can be used to process molecules in parallel using multiprocessing module. It calls grow_mol which cannot be used directly in multiprocessing because it is a generator
| Parameters: |
|
|---|
| Returns: |
|
|---|
Source code in crem/crem.py
def grow_mol2(*args, **kwargs):
"""
Convenience function which can be used to process molecules in parallel using multiprocessing module.
It calls grow_mol which cannot be used directly in multiprocessing because it is a generator
:param args: positional arguments, the same as in grow_mol function
:param kwargs: keyword arguments, the same as in grow_mol function
:return: list with output molecules
"""
return list(grow_mol(*args, **kwargs))
link_mols2 ¶
link_mols2(*args, **kwargs)
Convenience function which can be used to process molecules in parallel using multiprocessing module. It calls link_mols which cannot be used directly in multiprocessing because it is a generator
| Parameters: |
|
|---|
| Returns: |
|
|---|
Source code in crem/crem.py
def link_mols2(*args, **kwargs):
"""
Convenience function which can be used to process molecules in parallel using multiprocessing module.
It calls link_mols which cannot be used directly in multiprocessing because it is a generator
:param args: positional arguments, the same as in link_mols function
:param kwargs: keyword arguments, the same as in link_mols function
:return: list with output molecules
"""
return list(link_mols(*args, **kwargs))
make_cycle2 ¶
make_cycle2(*args, **kwargs)
Convenience function which can be used to process molecules in parallel using multiprocessing module. It calls make_cycle which cannot be used directly in multiprocessing because it is a generator
| Parameters: |
|
|---|
| Returns: |
|
|---|
Source code in crem/crem.py
def make_cycle2(*args, **kwargs):
"""
Convenience function which can be used to process molecules in parallel using multiprocessing module.
It calls make_cycle which cannot be used directly in multiprocessing because it is a generator
:param args: positional arguments, the same as in make_cycle function
:param kwargs: keyword arguments, the same as in make_cycle function
:return: list with output molecules
"""
return list(make_cycle(*args, **kwargs))
get_replacements ¶
get_replacements(mol1, db_name, radius, mol2=None, dist=None, min_size=0, max_size=8, min_rel_size=0, max_rel_size=1, min_inc=-2, max_inc=2, max_replacements=None, replace_cycles='no', protected_ids_1=None, protected_ids_2=None, replace_ids_1=None, replace_ids_2=None, min_freq=0, symmetry_fixes=False, filter_func=None, sample_func=None, return_frag_smi_only=True, set_names=None, seed=None, **kwargs)
An auxiliary function, which returns smiles of fragments in a DB which satisfy given criteria
| Parameters: |
|
|---|
| Returns: |
|
|---|
Source code in crem/crem.py
def get_replacements(mol1, db_name, radius, mol2=None, dist=None, min_size=0, max_size=8, min_rel_size=0,
max_rel_size=1, min_inc=-2, max_inc=2, max_replacements=None, replace_cycles="no",
protected_ids_1=None, protected_ids_2=None, replace_ids_1=None,
replace_ids_2=None, min_freq=0, symmetry_fixes=False, filter_func=None, sample_func=None,
return_frag_smi_only=True,
set_names=None, seed=None, **kwargs):
"""
An auxiliary function, which returns smiles of fragments in a DB which satisfy given criteria
:param mol1: RDKit Mol object
:param db_name: path to DB file with fragment replacements.
:param radius: radius of context which will be considered for replacement. Default: 3.
:param mol2: a second RDKit Mol object if searching for linking fragments
:param dist: topological distance between two attachment points in the fragment which will link molecules.
Can be a single integer or a tuple of lower and upper bound values.
:param min_size: minimum number of heavy atoms in a fragment to replace. If 0 - hydrogens will be replaced
(if they are explicit).
:param max_size: maximum number of heavy atoms in a fragment to replace.
:param min_rel_size: minimum relative size of a replaced fragment to the whole molecule
(in terms of a number of heavy atoms)
:param max_rel_size: maximum relative size of a replaced fragment to the whole molecule
(in terms of a number of heavy atoms)
:param min_inc: minimum change of a number of heavy atoms in replacing fragments to a number of heavy atoms in
replaced one. Negative value means that the replacing fragments would be smaller than the replaced
one on a specified number of heavy atoms.
:param max_inc: maximum change of a number of heavy atoms in replacing fragments to a number of heavy atoms in
replaced one.
:param max_replacements: maximum number of replacements to make. If the number of replacements available in DB is
greater than the specified value the specified number of randomly chosen replacements
will be applied.
:param replace_cycles: controls replacement of cyclic source fragments for
single-molecule searches. ``"no"``/False uses
ordinary acyclic-cut mutation only.
``"forced"``/True allows cyclic cores from
ordinary fragmentation to be replaced ignoring the size filters.
``"partial_all"`` additionally searches partial
ring arcs with exhaustive side cuts.
``"partial_exo"`` additionally searches partial
ring arcs with only exo side cuts. Ignored for
link searches. Default: ``"no"``.
:param protected_ids_1: iterable with atom ids which will not be mutated in mol1. If the molecule was supplied with
explicit hydrogen the ids of protected hydrogens should be supplied as well, otherwise they
will be replaced.
Ids of all equivalent atoms should be supplied (e.g. to protect meta-position in toluene
ids of both carbons in meta-positions should be supplied)
This argument has a higher priority over `replace_ids_1`.
:param protected_ids_2: iterable with atom ids which will not be mutated in mol2. If the molecule was supplied with
explicit hydrogen the ids of protected hydrogens should be supplied as well, otherwise they
will be replaced.
Ids of all equivalent atoms should be supplied (e.g. to protect meta-position in toluene
ids of both carbons in meta-positions should be supplied)
This argument has a higher priority over `replace_ids_2`.
:param replace_ids_1: iterable with atom ids to replace in mol1, it has lower priority over `protected_ids`
(replace_ids which are present in protected_ids would be protected).
:param replace_ids_2: iterable with atom ids to replace in mol2, it has lower priority over `protected_ids`
(replace_ids which are present in protected_ids would be protected).
:param min_freq: minimum occurrence of fragments in DB for replacement. Default: 0.
:param set_names: column name or list of column names in radius tables defining the set(s) of fragments to use
with min_freq in v1 databases. A fragment is included if at least one of the named sets
satisfies the min_freq threshold (OR logic). If None (default), all available set columns
are used. If a column name is not found, a ValueError is raised listing available set names.
Ignored for v0 databases. Default: None.
:param symmetry_fixes: if set True duplicated fragments with equivalent atoms having different ids will be
enumerated. This makes sense if one wants to replace particular atom(s) which have
equivalent ones. By default, among equivalent atoms only those with the lowest ids
are replaced. This will result in generation of duplicated molecules if several equivalent
atoms are selected which will be filtered later nevertheless. So, it is not very reasonable
to use this argument and select several equivalent atoms to replace.
This solves the issue of rdkit MMPA fragmenter which removes duplicates internally.
:param filter_func: a function which will filter selected fragments by additional rules
(in this way one may add arbitrary selection constrains). The function takes necessary first
three arguments: row_ids (list or set of row_ids from the fragment database supplied to
the mutate_mol function), cursor of that fragment database and radius (int). This is required
access the selected fragments. Other arguments are custom and user-defined.
It is the most convenient to define a filtering function, implement specific logic inside and
pass it to mutate_mol using functools.partial. The filtering function should return a list/set
of row ids which are a subset of the input row ids.
:param sample_func: a function which will sample selected fragments if max_replacements is supplied. If omitted
uniform sampling will be used. The function takes necessary first four arguments: row_ids
(list or set of row_ids from the fragment database), cursor of that fragment database,
radius (int) and the number of returned items (int). This is required to access the selected
fragments. Other arguments can be custom and user-defined. The function should return
a list/set of selected row ids.
:param return_frag_smi_only: control whether to return only SMILES of fragments selected from a database or return
a tuple `(source_core_smi, replacement_core_smi, freq, context_mol)` which can be
further passed to `get_mols_from_replacements`.
:param seed: random seed for reproducible fragment selection when max_replacements is set. Default: None.
:param **kwargs: named arguments to additionally filter replacing fragments. For v0 DB use columns from radiusX,
for v1 DB use columns from frags or frags_h. Values are a single value or 2-item tuple with lower
and upper bound of the corresponding parameter of a fragment. This can be useful to annotate
fragments with additional custom properties (e.g. number of particular pharmacophore features,
lipophilicity, etc) and use these parameters to additionally restrict selected fragments.
:return: generator over smiles of fragments in a DB which satisfy given criteria
"""
replace_cycles = _normalize_replace_cycles(replace_cycles)
protected_ids_1 = set(protected_ids_1) if protected_ids_1 else set()
if replace_ids_1:
replace_ids_1 = set(replace_ids_1) if replace_ids_1 else set()
protected_ids_1 = set(protected_ids_1) | set(range(mol1.GetNumAtoms())).difference(replace_ids_1)
if isinstance(mol2, Chem.Mol):
protected_ids_2 = set(protected_ids_2) if protected_ids_2 else set()
if replace_ids_2:
replace_ids_2 = set(replace_ids_2) if replace_ids_2 else set()
protected_ids_2 = set(protected_ids_2) | set(range(mol2.GetNumAtoms())).difference(replace_ids_2)
else:
protected_ids_2 = None
mol1 = __backup_atom_properties(mol1, __atom_properties_to_backup)
if isinstance(mol2, Chem.Mol):
mol2 = __backup_atom_properties(mol2, __atom_properties_to_backup)
for res in __gen_replacements(mol1=mol1, mol2=mol2, db_name=db_name, radius=radius, dist=dist,
min_size=min_size, max_size=max_size, min_rel_size=min_rel_size,
max_rel_size=max_rel_size, min_inc=min_inc, max_inc=max_inc,
max_replacements=max_replacements, replace_cycles=replace_cycles,
protected_ids_1=protected_ids_1, protected_ids_2=protected_ids_2,
min_freq=min_freq, set_names=set_names, symmetry_fixes=symmetry_fixes,
filter_func=filter_func, sample_func=sample_func,
return_frag_smi_only=return_frag_smi_only,
operation=("link" if isinstance(mol2, Chem.Mol) else "mutate"),
seed=seed, **kwargs):
if return_frag_smi_only:
yield res
else:
src_core, repl_core, freq, context_mol = res
yield src_core, repl_core, freq, __prepare_context_mol_for_output(context_mol)
get_mols_from_replacements ¶
get_mols_from_replacements(mol1, radius, replacements, mol2=None, return_rxn=False, return_rxn_freq=False, return_mol=False)
Source code in crem/crem.py
def get_mols_from_replacements(mol1, radius, replacements, mol2=None, return_rxn=False, return_rxn_freq=False,
return_mol=False):
if isinstance(mol2, Chem.Mol):
products = set()
else:
products = {Chem.MolToSmiles(Chem.RemoveHs(mol1))}
for items in replacements:
if len(items) == 4:
frag_sma, core_sma, freq, context_mol = items
else:
raise ValueError('Each replacement tuple should have 4 items: '
'(source_core_smi, replacement_core_smi, freq, context_mol)\n')
for smi, m, rxn in __frag_replace(mol1, mol2, frag_sma, core_sma, radius, context_mol):
if smi not in products:
products.add(smi)
res = [smi]
if return_rxn:
res.append(rxn)
if return_rxn_freq:
res.append(freq)
if return_mol:
res.append(m)
if len(res) == 1:
yield res[0]
else:
yield res
_get_replacements ¶
_get_replacements(db_cur, radius, row_ids, schema_meta=None)
Source code in crem/crem.py
def _get_replacements(db_cur, radius, row_ids, schema_meta=None):
if schema_meta is None:
schema_meta = _load_schema_meta(db_cur, radius)
user_version = schema_meta['user_version']
if user_version == 0:
sql = f"""SELECT rowid, core_smi, core_sma, freq
FROM radius{radius}
WHERE rowid IN ({','.join(map(str, row_ids))})"""
elif user_version == 1:
# Note: freq was removed from DB, therefore 0 is returned (maybe None is better)
sql = f"""SELECT r.rowid, f.core_smi
FROM radius{radius} r
JOIN frags f ON r.core_smi_id = f.core_smi_id
WHERE r.rowid IN ({','.join(map(str, row_ids))})"""
else:
raise NotImplementedError('Not implemented for database version other than 0 and 1')
db_cur.execute(sql)
if user_version == 1:
# Keep tuple shape identical to user_version 0 for compatibility.
return [(row_id, core_smi, core_smi, 0) for row_id, core_smi in db_cur.fetchall()]
return db_cur.fetchall()