API reference: crem.utils

Iterative enumeration and the built-in filter_func / sample_func helpers.

utils

enumerate_compounds

enumerate_compounds(mol, db_fname, mode='scaffold', n_iterations=1, radius=3, max_replacements=None, protected_ids=None, replace_ids=None, min_freq=0, protect_added_frag=False, return_smi=False, ncpu=None, **kwargs)

Convenience function to perform scaffold decoration or enumeration of analog series. This performs in multiple iterations by modification of compounds enumerated on the previous iteration. May result in combinatorial explosion. The function returns the list of distinct molecules generated over all iterations.

Scaffold decoration uses grow procedure. Hydrogens will be added to the supplied molecule and replaced with fragments from the database. A user can protect particular positions from expansion by setting protect_ids argument. Note: The value of protect_added_frag parameter is ignored. New fragments cannot be attached to previously added fragments.

Enumeration of analog series uses mutate procedure. The molecule should be supplied with explicit hydrogens if one wants to replace them as well, otherwise only heavy atoms will be considered for replacements. It is recommended to set an altered part of a molecule not too large to obtain reasonable suggestions.

Parameters:
  • mol
  • db_fname

    path to DB file with fragment replacements.

  • mode

    'scaffold' decoration or 'analogs' enumeration. In 'scaffold' mode the supplied molecule will be substituted with fragments from a database. In 'analogs' mode the supplied molecule wil be mutated. Default: scaffold.

  • n_iterations

    the number of rounds of generation. Molecules generated on the previous round are supplied to the next iteration. Be careful setting this parameter as it may cause combinatorial explosion. Default: 1.

  • radius

    radius of context which will be considered for replacement. Default: 3.

  • max_replacements

    maximum number of replacements to make for each molecule on each iteration. None will result in all possible replacements or it is possible to set a randomly chosen number of replacements. Default: None.

  • protected_ids

    iterable with atom ids which will not be altered. In 'scaffold' mode this can be ids of hydrogens or heavy atoms whose hydrogens should be protected from expansion. In 'analogs' mode these are ids of hydrogens and heavy atoms. Please note, a molecule should be supplied with explicit hydrogens in 'analogs' mode if one wants to replace them. Default: None.

  • replace_ids

    iterable with atom ids to replace, it has lower priority then protected_ids. Default: None.

  • min_freq

    minimum occurrence of fragments in DB for replacement. Default: 0.

  • protect_added_frag

    True or False. If set True new fragments cannot be attached/replace fragments added on previous iterations. Applicable only in 'analogs' mode. In 'scaffold' mode user input is ignored and the argument internally set to True. Default: False

  • return_smi

    if True will return the list of SMILES instead of Mol objects. Default: False.

  • ncpu

    number of cores. None means all cpus.

  • kwargs

    additional keyword arguments forwarded to the underlying generator - grow_mol in 'scaffold' mode (e.g. min_atoms, max_atoms) and mutate_mol in 'analogs' mode (e.g. min_size, max_size, min_inc, max_inc, replace_cycles). See those functions for the meaning of each argument.

Source code in crem/utils.py
def enumerate_compounds(mol, db_fname, mode='scaffold', n_iterations=1, radius=3, max_replacements=None,
                        protected_ids=None, replace_ids=None, min_freq=0, protect_added_frag=False, return_smi=False,
                        ncpu=None, **kwargs):
    '''
    Convenience function to perform scaffold decoration or enumeration of analog series. This performs in multiple
    iterations by modification of compounds enumerated on the previous iteration. May result in combinatorial explosion.
    The function returns the list of distinct molecules generated over all iterations.

    Scaffold decoration uses grow procedure. Hydrogens will be added to the supplied molecule and replaced with
    fragments from the database. A user can protect particular positions from expansion by setting protect_ids argument.
    Note: The value of protect_added_frag parameter is ignored. New fragments cannot be attached to previously
    added fragments.

    Enumeration of analog series uses mutate procedure. The molecule should be supplied with explicit hydrogens if one
    wants to replace them as well, otherwise only heavy atoms will be considered for replacements. It is recommended to
    set an altered part of a molecule not too large to obtain reasonable suggestions.

    :param mol:
    :param db_fname: path to DB file with fragment replacements.
    :param mode: 'scaffold' decoration or 'analogs' enumeration. In 'scaffold' mode the supplied molecule will be
                 substituted with fragments from a database. In 'analogs' mode the supplied molecule wil be mutated.
                 Default: scaffold.
    :param n_iterations: the number of rounds of generation. Molecules generated on the previous round are supplied
                         to the next iteration. Be careful setting this parameter as it may cause combinatorial
                         explosion. Default: 1.
    :param radius: radius of context which will be considered for replacement. Default: 3.
    :param max_replacements: maximum number of replacements to make for each molecule on each iteration. None will
                             result in all possible replacements or it is possible to set a randomly chosen number of
                             replacements. Default: None.
    :param protected_ids: iterable with atom ids which will not be altered. In 'scaffold' mode this can be ids of
                          hydrogens or heavy atoms whose hydrogens should be protected from expansion. In 'analogs' mode
                          these are ids of hydrogens and heavy atoms. Please note, a molecule should be supplied with
                          explicit hydrogens in 'analogs' mode if one wants to replace them. Default: None.
    :param replace_ids: iterable with atom ids to replace, it has lower priority then `protected_ids`. Default: None.
    :param min_freq: minimum occurrence of fragments in DB for replacement. Default: 0.
    :param protect_added_frag: True or False. If set True new fragments cannot be attached/replace fragments added on
                               previous iterations. Applicable only in 'analogs' mode. In 'scaffold' mode user input is
                               ignored and the argument internally set to True. Default: False
    :param return_smi: if True will return the list of SMILES instead of Mol objects. Default: False.
    :param ncpu: number of cores. None means all cpus.

    :param kwargs: additional keyword arguments forwarded to the underlying generator -
                   ``grow_mol`` in 'scaffold' mode (e.g. ``min_atoms``, ``max_atoms``) and
                   ``mutate_mol`` in 'analogs' mode (e.g. ``min_size``, ``max_size``, ``min_inc``,
                   ``max_inc``, ``replace_cycles``). See those functions for the meaning of each argument.

    '''

    if mode not in ['scaffold', 'analogs']:
        raise ValueError('Wrong mode. Please choose one from the list - "analogs","scaffold"')

    if ncpu is None:
        ncpu = cpu_count()
    else:
        ncpu = max(1, min(int(ncpu), cpu_count()))
    pool = joblib.Parallel(n_jobs=ncpu)

    # to check if the statical arguments are in the kwargs dict
    for kw in ['return_mol', 'return_rxn', 'return_rxn_freq', 'ncores']:
        if kw in kwargs:
            kwargs.pop(kw)

    if mode == 'scaffold':
        protect_added_frag = True

    if protected_ids is None and replace_ids is not None:
        protected_ids = list(set(a.GetIdx() for a in mol.GetAtoms()).difference(replace_ids))
    if protected_ids is None:
        protected_ids = []

    start_mols = {mol: protected_ids}
    # to get results ordered by iterations
    generated_mols = OrderedDict()
    n = 0

    for n in range(n_iterations):
        new_mols = ()
        if mode == 'scaffold':
            new_mols = pool(joblib.delayed(grow_mol2)(m, db_name=db_fname, protected_ids=prot_ids,
                                                      min_freq=min_freq, radius=radius,
                                                      max_replacements=max_replacements,
                                                      return_mol=True, return_rxn=False, return_rxn_freq=False,
                                                      ncores=1 if len(start_mols) > 1 else ncpu, **kwargs)
                            for m, prot_ids in start_mols.items())
        if mode == 'analogs':
            new_mols = pool(joblib.delayed(mutate_mol2)(m, db_name=db_fname, protected_ids=prot_ids,
                                                        min_freq=0, radius=radius, max_replacements=max_replacements,
                                                        return_mol=True, return_rxn=False, return_rxn_freq=False,
                                                        ncores=1 if len(start_mols) > 1 else ncpu, **kwargs)
                            for m, prot_ids in start_mols.items())

        parent_protected_ids_list = start_mols.values()
        start_mols = OrderedDict()
        for childs, parent_protected_ids in zip(new_mols, parent_protected_ids_list):
            for items in childs:
                if items[0] not in generated_mols:
                    protected_ids = __get_child_protected_atom_ids(items[1], parent_protected_ids)
                    if protect_added_frag:
                        protect_added_ids = __get_child_added_atom_ids(items[1])
                        protected_ids = set(protected_ids + protect_added_ids)

                    generated_mols[items[0]] = items[1]
                    start_mols[items[1]] = protected_ids

        if not start_mols:
            break

    if n + 1 < n_iterations:
        sys.stderr.write(f'INFO. Procedure is finished after {n + 1} iterations instead of {n_iterations}\n')

    if not return_smi:
        return list(generated_mols.values())
    else:
        return list(generated_mols.keys())

sample_csp3

sample_csp3(row_ids, cur, radius, n)

Performs random selection of fragments proportionally to a squared fraction of sp3 carbon atoms.

Parameters:
  • row_ids

    the list of row ids of fragments to consider

  • cur

    cursor to the fragment database

  • radius

    context radius

  • n

    the number of fragments to select

Returns:
  • the list of row ids of selected fragments

Source code in crem/utils.py
def sample_csp3(row_ids, cur, radius, n):
    """
    Performs random selection of fragments proportionally to a squared fraction of sp3 carbon atoms.
    :param row_ids: the list of row ids of fragments to consider
    :param cur: cursor to the fragment database
    :param radius: context radius
    :param n: the number of fragments to select
    :return: the list of row ids of selected fragments
    """
    if n >= len(row_ids):
        return row_ids
    d = defaultdict(list)
    for rowid, core_smi, _, _ in _get_replacements(cur, radius, row_ids):
        d[core_smi].append(rowid)
    smis = list(d.keys())
    values = [rdMolDescriptors.CalcFractionCSP3(Chem.MolFromSmiles(smi)) ** 2 for smi in smis]
    values = [v + 0.01 for v in values]
    values = np.array(values) / sum(values)
    selected_smiles = np.random.choice(smis, n, replace=False, p=values).tolist()
    ids = []
    for smi in selected_smiles:
        ids.extend(d[smi])
    if len(ids) < n:
        ids = random.sample(ids, n)
    return ids

filter_max_ring_size

filter_max_ring_size(row_ids, cur, radius, max_size=6)

Remove fragments having a ring size greater than a maximum threshold value

Parameters:
  • row_ids

    the list of row ids of fragments to consider

  • cur

    cursor to the fragment database

  • radius

    context radius

  • max_size

    maximum allowed ring size

Returns:
  • the list of row ids of selected fragments

Source code in crem/utils.py
def filter_max_ring_size(row_ids, cur, radius, max_size=6):
    """
    Remove fragments having a ring size greater than a maximum threshold value
    :param row_ids: the list of row ids of fragments to consider
    :param cur: cursor to the fragment database
    :param radius: context radius
    :param max_size: maximum allowed ring size
    :return: the list of row ids of selected fragments
    """
    d = defaultdict(list)
    for rowid, core_smi, _, _ in _get_replacements(cur, radius, row_ids):
        d[core_smi].append(rowid)
    smis = list(d.keys())
    for smi in smis:
        mol = Chem.MolFromSmiles(smi)
        w = mol.GetRingInfo()
        rings = w.AtomRings()
        if rings and max(len(atom_ids) for atom_ids in rings) > max_size:
            del d[smi]
    ids = []
    for v in d.values():
        ids.extend(v)
    return ids

filter_acyclic_attachment_points

filter_acyclic_attachment_points(row_ids, cur, radius)

Keep only fragments where each attachment point [*:n] is attached to an acyclic atom.

Parameters:
  • row_ids

    the list of row ids of fragments to consider

  • cur

    cursor to the fragment database

  • radius

    context radius

Returns:
  • the list of row ids of selected fragments

Source code in crem/utils.py
def filter_acyclic_attachment_points(row_ids, cur, radius):
    """
    Keep only fragments where each attachment point [*:n] is attached to an acyclic atom.

    :param row_ids: the list of row ids of fragments to consider
    :param cur: cursor to the fragment database
    :param radius: context radius
    :return: the list of row ids of selected fragments
    """
    d = defaultdict(list)
    for rowid, core_smi, _, _ in _get_replacements(cur, radius, row_ids):
        d[core_smi].append(rowid)

    for smi in list(d.keys()):
        mol = Chem.MolFromSmiles(smi)
        if mol is None:
            del d[smi]
            continue

        keep = True
        for atom in mol.GetAtoms():
            if atom.GetAtomicNum() != 0:
                continue
            # Attachment dummy atoms in CReM cores should have a single neighbor.
            if not atom.GetNeighbors():
                keep = False
                break
            neighbor = atom.GetNeighbors()[0]
            if neighbor.IsInRing():
                keep = False
                break

        if not keep:
            del d[smi]

    ids = []
    for v in d.values():
        ids.extend(v)
    return ids