API reference: crem.crem

The structure-generation API. See Operations for task-oriented guides.

crem

mutate_mol

mutate_mol(mol, db_name, radius=3, min_size=0, max_size=10, min_rel_size=0, max_rel_size=1, min_inc=-2, max_inc=2, max_replacements=None, replace_cycles='no', replace_ids=None, protected_ids=None, symmetry_fixes=False, min_freq=0, return_rxn=False, return_rxn_freq=False, return_mol=False, ncores=1, filter_func=None, sample_func=None, set_names=None, seed=None, **kwargs)

Generator of new molecules by replacement of fragments in the supplied molecule with fragments from DB.

Parameters:
  • mol

    RDKit Mol object. If hydrogens are explicit they will be replaced as well, otherwise not.

  • db_name

    path to DB file with fragment replacements.

  • radius

    radius of context which will be considered for replacement. Default: 3.

  • min_size

    minimum number of heavy atoms in a fragment to replace. If 0 - hydrogens will be replaced (if they are explicit). Default: 0.

  • max_size

    maximum number of heavy atoms in a fragment to replace. Default: 10.

  • min_rel_size

    minimum relative size of a replaced fragment to the whole molecule (in terms of a number of heavy atoms)

  • max_rel_size

    maximum relative size of a replaced fragment to the whole molecule (in terms of a number of heavy atoms)

  • min_inc

    minimum change of a number of heavy atoms in replacing fragments to a number of heavy atoms in replaced one. Negative value means that the replacing fragments would be smaller than the replaced one on a specified number of heavy atoms. Default: -2.

  • max_inc

    maximum change of a number of heavy atoms in replacing fragments to a number of heavy atoms in replaced one. Default: 2.

  • max_replacements

    maximum number of replacements to make. If the number of replacements available in DB is greater than the specified value the specified number of randomly chosen replacements will be applied. Default: None.

  • replace_cycles

    controls replacement of cyclic source fragments. "no"/False uses ordinary acyclic-cut mutation only. "forced"/True allows cyclic cores from ordinary fragmentation to be replaced ignoring the size filters. "partial_all" additionally replaces partial ring arcs using exhaustive side cuts. "partial_exo" additionally replaces partial ring arcs using only exo side cuts adjacent to the selected ring arc. Default: "no".

  • replace_ids

    iterable with atom ids to replace, it has lower priority over protected_ids (replace_ids which are present in protected_ids would be protected). Ids of hydrogen atoms (if any) connected to the specified heavy atoms will be automatically labeled as replaceable. Default: None.

  • protected_ids

    iterable with atom ids which will not be mutated. If the molecule was supplied with explicit hydrogen the ids of protected hydrogens should be supplied as well, otherwise they will be replaced. Ids of all equivalent atoms should be supplied (e.g. to protect meta-position in toluene ids of both carbons in meta-positions should be supplied) This argument has a higher priority over replace_ids. Default: None.

  • symmetry_fixes

    if set True duplicated fragments with equivalent atoms having different ids will be enumerated. This makes sense if one wants to replace particular atom(s) which have equivalent ones. By default, among equivalent atoms only those with the lowest ids are replaced. This will result in generation of duplicated molecules if several equivalent atoms are selected which will be filtered later nevertheless. So, it is not very reasonable to use this argument and select several equivalent atoms to replace. This solves the issue of rdkit MMPA fragmenter which removes duplicates internally.

  • min_freq

    minimum occurrence of fragments in DB for replacement. Default: 0.

  • set_names

    column name or list of column names in radius tables defining the set(s) of fragments to use with min_freq in v1 databases. A fragment is included if at least one of the named sets satisfies the min_freq threshold (OR logic). If None (default), all available set columns are used. If a column name is not found, a ValueError is raised listing available set names. Ignored for v0 databases. Default: None.

  • return_rxn

    whether to additionally return rxn of a transformation. Default: False.

  • return_rxn_freq

    whether to additionally return the frequency of a transformation in the DB. Default: False.

  • return_mol

    whether to additionally return RDKit Mol object of a generated molecule. Default: False.

  • ncores

    number of cores. Default: 1.

  • filter_func

    a function which will filter selected fragments by additional rules (in this way one may add arbitrary selection constrains). The function takes necessary first three arguments: row_ids (list or set of row_ids from the fragment database supplied to the mutate_mol function), cursor of that fragment database and radius (int). This is required access the selected fragments. Other arguments are custom and user-defined. It is the most convenient to define a filtering function, implement specific logic inside and pass it to mutate_mol using functools.partial. The filtering function should return a list/set of row ids which are a subset of the input row ids.

  • sample_func

    a function which will sample selected fragments if max_replacements is supplied. If omitted uniform sampling will be used. The function takes necessary first four arguments: row_ids (list or set of row_ids from the fragment database), cursor of that fragment database, radius (int) and the number of returned items (int). This is required to access the selected fragments. Other arguments can be custom and user-defined. The function should return a list/set of selected row ids.

  • seed

    random seed for reproducible fragment selection when max_replacements is set. Default: None.

  • **kwargs

    named arguments to additionally filter replacing fragments. For v0 DB use columns from radiusX, for v1 DB use columns from frags or frags_h. Values are a single value or 2-item tuple with lower and upper bound of the corresponding parameter of a fragment. This can be useful to annotate fragments with additional custom properties (e.g. number of particular pharmacophore features, lipophilicity, etc) and use these parameters to additionally restrict selected fragments.

Returns:
  • generator over new molecules. If no additional return arguments were called this would be a generator over SMILES of new molecules. If any of additional return values were asked the function will return a list of list where the first item is SMILES, then rxn string of a transformation (optional), frequency of fragment occurrence in the DB (optional), RDKit Mol object (optional). Only entries with distinct SMILES will be returned.

    Note: supply RDKit Mol object with explicit hydrogens if H replacement is required

Source code in crem/crem.py
def mutate_mol(mol, db_name, radius=3, min_size=0, max_size=10, min_rel_size=0, max_rel_size=1, min_inc=-2, max_inc=2,
               max_replacements=None, replace_cycles="no", replace_ids=None, protected_ids=None,
               symmetry_fixes=False, min_freq=0, return_rxn=False, return_rxn_freq=False, return_mol=False, ncores=1,
               filter_func=None, sample_func=None, set_names=None, seed=None, **kwargs):
    """
    Generator of new molecules by replacement of fragments in the supplied molecule with fragments from DB.

    :param mol: RDKit Mol object. If hydrogens are explicit they will be replaced as well, otherwise not.
    :param db_name: path to DB file with fragment replacements.
    :param radius: radius of context which will be considered for replacement. Default: 3.
    :param min_size: minimum number of heavy atoms in a fragment to replace. If 0 - hydrogens will be replaced
                     (if they are explicit). Default: 0.
    :param max_size: maximum number of heavy atoms in a fragment to replace. Default: 10.
    :param min_rel_size: minimum relative size of a replaced fragment to the whole molecule
                         (in terms of a number of heavy atoms)
    :param max_rel_size: maximum relative size of a replaced fragment to the whole molecule
                         (in terms of a number of heavy atoms)
    :param min_inc: minimum change of a number of heavy atoms in replacing fragments to a number of heavy atoms in
                    replaced one. Negative value means that the replacing fragments would be smaller than the replaced
                    one on a specified number of heavy atoms. Default: -2.
    :param max_inc: maximum change of a number of heavy atoms in replacing fragments to a number of heavy atoms in
                    replaced one. Default: 2.
    :param max_replacements: maximum number of replacements to make. If the number of replacements available in DB is
                             greater than the specified value the specified number of randomly chosen replacements
                             will be applied. Default: None.
    :param replace_cycles: controls replacement of cyclic source fragments.
                           ``"no"``/False uses ordinary acyclic-cut mutation
                           only. ``"forced"``/True allows cyclic cores from
                           ordinary fragmentation to be replaced ignoring the size filters.
                           ``"partial_all"`` additionally replaces partial
                           ring arcs using exhaustive side cuts.
                           ``"partial_exo"`` additionally replaces partial
                           ring arcs using only exo side cuts adjacent to the
                           selected ring arc. Default: ``"no"``.
    :param replace_ids: iterable with atom ids to replace, it has lower priority over `protected_ids` (replace_ids
                        which are present in protected_ids would be protected).
                        Ids of hydrogen atoms (if any) connected to the specified heavy atoms will be automatically
                        labeled as replaceable. Default: None.
    :param protected_ids: iterable with atom ids which will not be mutated. If the molecule was supplied with explicit
                          hydrogen the ids of protected hydrogens should be supplied as well, otherwise they will be
                          replaced.
                          Ids of all equivalent atoms should be supplied (e.g. to protect meta-position in toluene
                          ids of both carbons in meta-positions should be supplied)
                          This argument has a higher priority over `replace_ids`. Default: None.
    :param symmetry_fixes: if set True duplicated fragments with equivalent atoms having different ids will be
                           enumerated. This makes sense if one wants to replace particular atom(s) which have
                           equivalent ones. By default, among equivalent atoms only those with the lowest ids
                           are replaced. This will result in generation of duplicated molecules if several equivalent
                           atoms are selected which will be filtered later nevertheless. So, it is not very reasonable
                           to use this argument and select several equivalent atoms to replace.
                           This solves the issue of rdkit MMPA fragmenter which removes duplicates internally.
    :param min_freq: minimum occurrence of fragments in DB for replacement. Default: 0.
    :param set_names: column name or list of column names in radius tables defining the set(s) of fragments to use
                      with min_freq in v1 databases. A fragment is included if at least one of the named sets
                      satisfies the min_freq threshold (OR logic). If None (default), all available set columns
                      are used. If a column name is not found, a ValueError is raised listing available set names.
                      Ignored for v0 databases. Default: None.
    :param return_rxn: whether to additionally return rxn of a transformation. Default: False.
    :param return_rxn_freq: whether to additionally return the frequency of a transformation in the DB.  Default: False.
    :param return_mol: whether to additionally return RDKit Mol object of a generated molecule.  Default: False.
    :param ncores: number of cores. Default: 1.
    :param filter_func: a function which will filter selected fragments by additional rules
                        (in this way one may add arbitrary selection constrains). The function takes necessary first
                        three arguments: row_ids (list or set of row_ids from the fragment database supplied to
                        the mutate_mol function), cursor of that fragment database and radius (int). This is required
                        access the selected fragments. Other arguments are custom and user-defined.
                        It is the most convenient to define a filtering function, implement specific logic inside and
                        pass it to mutate_mol using functools.partial. The filtering function should return a list/set
                        of row ids which are a subset of the input row ids.
    :param sample_func: a function which will sample selected fragments if max_replacements is supplied. If omitted
                        uniform sampling will be used. The function takes necessary first four arguments: row_ids
                        (list or set of row_ids from the fragment database), cursor of that fragment database,
                        radius (int) and the number of returned items (int). This is required to access the selected
                        fragments. Other arguments can be custom and user-defined. The function should return
                        a list/set of selected row ids.
    :param seed: random seed for reproducible fragment selection when max_replacements is set. Default: None.
    :param **kwargs: named arguments to additionally filter replacing fragments. For v0 DB use columns from radiusX,
                     for v1 DB use columns from frags or frags_h. Values are a single value or 2-item tuple with lower
                     and upper bound of the corresponding parameter of a fragment. This can be useful to annotate
                     fragments with additional custom properties (e.g. number of particular pharmacophore features,
                     lipophilicity, etc) and use these parameters to additionally restrict selected fragments.
    :return: generator over new molecules. If no additional return arguments were called this would be a generator over
             SMILES of new molecules. If any of additional return values were asked the function will return a list
             of list where the first item is SMILES, then rxn string of a transformation (optional), frequency of
             fragment occurrence in the DB (optional), RDKit Mol object (optional).
             Only entries with distinct SMILES will be returned.

    Note: supply RDKit Mol object with explicit hydrogens if H replacement is required

    """

    replace_cycles = _normalize_replace_cycles(replace_cycles)

    __check_db_existence(db_name)
    products = {Chem.MolToSmiles(Chem.RemoveHs(mol))}
    mol = __backup_atom_properties(mol, __atom_properties_to_backup)

    protected_ids = set(protected_ids) if protected_ids else set()

    if replace_ids:
        ids = set()
        for i in replace_ids:
            ids.update(a.GetIdx() for a in mol.GetAtomWithIdx(i).GetNeighbors() if a.GetAtomicNum() == 1)
        ids = set(a.GetIdx() for a in mol.GetAtoms()).difference(ids).difference(replace_ids)  # ids which should be protected
        protected_ids.update(ids)  # since protected_ids has a higher priority add them anyway

    # protected_ids = sorted(protected_ids)  # why we made sorted?

    if ncores == 1:

        for frag_sma, core_sma, freq, context_mol in __gen_replacements(mol1=mol, mol2=None, db_name=db_name,
                                                                        radius=radius, min_size=min_size,
                                                                        max_size=max_size,
                                                                        min_rel_size=min_rel_size,
                                                                        max_rel_size=max_rel_size,
                                                                        min_inc=min_inc, max_inc=max_inc,
                                                                        max_replacements=max_replacements,
                                                                        replace_cycles=replace_cycles,
                                                                        protected_ids_1=protected_ids,
                                                                        protected_ids_2=None, min_freq=min_freq,
                                                                        set_names=set_names,
                                                                        symmetry_fixes=symmetry_fixes,
                                                                        filter_func=filter_func,
                                                                        sample_func=sample_func,
                                                                        return_frag_smi_only=False,
                                                                        operation="mutate",
                                                                        seed=seed, **kwargs):
            for smi, m, rxn in __frag_replace(mol, None, frag_sma, core_sma, radius, context_mol):
                if max_replacements is None or len(products) < (max_replacements + 1):  # +1 because we added source mol to output smiles
                    if smi not in products:
                        products.add(smi)
                        res = [smi]
                        if return_rxn:
                            res.append(rxn)
                            if return_rxn_freq:
                                res.append(freq)
                        if return_mol:
                            res.append(m)
                        if len(res) == 1:
                            yield res[0]
                        else:
                            yield res
    else:

        p = Pool(min(ncores, cpu_count()))
        try:
            for items in p.imap(__frag_replace_mp, __get_data(mol, db_name, radius, min_size, max_size, min_rel_size,
                                                              max_rel_size, min_inc, max_inc, replace_cycles,
                                                              protected_ids, min_freq, set_names, max_replacements,
                                                              symmetry_fixes, filter_func=filter_func,
                                                              sample_func=sample_func,
                                                              seed=seed, **kwargs),
                                chunksize=100):
                for smi, m, rxn, freq in items:
                    if max_replacements is None or len(products) < (max_replacements + 1):  # +1 because we added source mol to output smiles
                        if smi not in products:
                            products.add(smi)
                            res = [smi]
                            if return_rxn:
                                res.append(rxn)
                                if return_rxn_freq:
                                    res.append(freq)
                            if return_mol:
                                res.append(m)
                            if len(res) == 1:
                                yield res[0]
                            else:
                                yield res
        finally:
            p.close()
            p.join()

grow_mol

grow_mol(mol, db_name, radius=3, min_atoms=1, max_atoms=2, max_replacements=None, replace_ids=None, protected_ids=None, symmetry_fixes=False, min_freq=0, return_rxn=False, return_rxn_freq=False, return_mol=False, ncores=1, filter_func=None, sample_func=None, set_names=None, seed=None, **kwargs)

Replace hydrogens with fragments from the database.

Parameters:
  • mol

    RDKit Mol object.

  • db_name

    path to DB file with fragment replacements.

  • radius

    radius of context which will be considered for replacement. Default: 3.

  • min_atoms

    minimum number of atoms in the fragment which will replace H

  • max_atoms

    maximum number of atoms in the fragment which will replace H

  • max_replacements

    maximum number of replacements to make. If the number of replacements available in DB is greater than the specified value the specified number of randomly chosen replacements will be applied. Default: None.

  • replace_ids

    iterable with ids of heavy atom with replaceable Hs or/and ids of H atoms to replace, it has lower priority over protected_ids (replace_ids which are present in protected_ids would be protected). Default: None.

  • protected_ids

    iterable with hydrogen atom ids or ids of heavy atoms at which hydrogens will not be replaced. Ids of all equivalent atoms should be supplied (e.g. to protect meta-position in toluene ids of both carbons in meta-positions should be supplied). This argument has a higher priority over replace_ids. Default: None.

  • symmetry_fixes

    if Sset True duplicated fragments with equivalent atoms having different ids will be enumerated. This makes sense if one wants to replace particular atom(s) which have equivalent ones. By default, among equivalent atoms only those with the lowest ids are replaced. This will result in generation of duplicated molecules if several equivalent atoms are selected which will be filtered later nevertheless. So, it is not very reasonable to use this argument and select several equivalent atoms to replace. This solves the issue of rdkit MMPA fragmenter which removes duplicates internally.

  • min_freq

    minimum occurrence of fragments in DB for replacement. Default: 0.

  • set_names

    column name or list of column names in radius tables defining the set(s) of fragments to use with min_freq in v1 databases. A fragment is included if at least one of the named sets satisfies the min_freq threshold (OR logic). If None (default), all available set columns are used. If a column name is not found, a ValueError is raised listing available set names. Ignored for v0 databases. Default: None.

  • return_rxn

    whether to additionally return rxn of a transformation. Default: False.

  • return_rxn_freq

    whether to additionally return the frequency of a transformation in the DB. Default: False.

  • return_mol

    whether to additionally return RDKit Mol object of a generated molecule. Default: False.

  • ncores

    number of cores. Default: 1.

  • filter_func

    a function which will filter selected fragments by additional rules (in this way one may add arbitrary selection constrains). The function takes necessary first three arguments: row_ids (list or set of row_ids from the fragment database supplied to the grow_mol function), cursor of that fragment database and radius (int). This is required access the selected fragments. Other arguments are custom and user-defined. It is the most convenient to define a filtering function, implement specific logic inside and pass it to grow_mol using functools.partial. The filtering function should return a list/set of row ids which are a subset of the input row ids.

  • sample_func

    a function which will sample selected fragments if max_replacements is supplied. If omitted uniform sampling will be used. The function takes necessary first four arguments: row_ids (list or set of row_ids from the fragment database), cursor of that fragment database, radius (int) and the number of returned items (int). This is required to access the selected fragments. Other arguments can be custom and user-defined. The function should return a list/set of selected row ids.

  • seed

    random seed for reproducible fragment selection when max_replacements is set. Default: None.

  • **kwargs

    named arguments to additionally filter replacing fragments. For v0 DB use columns from radiusX, for v1 DB use columns from frags or frags_h. Values are a single value or 2-item tuple with lower and upper bound of the corresponding parameter of a fragment. This can be useful to annotate fragments with additional custom properties (e.g. number of particular pharmacophore features, lipophilicity, etc) and use these parameters to additionally restrict selected fragments.

Returns:
  • generator over new molecules. If no additional return arguments were called this would be a generator over SMILES of new molecules. If any of additional return values were asked the function will return a list of list where the first item is SMILES, then rxn string of a transformation (optional), frequency of fragment occurrence in the DB (optional), RDKit Mol object (optional). Only entries with distinct SMILES will be returned.

Source code in crem/crem.py
def grow_mol(mol, db_name, radius=3, min_atoms=1, max_atoms=2, max_replacements=None, replace_ids=None,
             protected_ids=None, symmetry_fixes=False, min_freq=0, return_rxn=False, return_rxn_freq=False,
             return_mol=False, ncores=1, filter_func=None, sample_func=None, set_names=None, seed=None, **kwargs):
    """
    Replace hydrogens with fragments from the database.

    :param mol: RDKit Mol object.
    :param db_name: path to DB file with fragment replacements.
    :param radius: radius of context which will be considered for replacement. Default: 3.
    :param min_atoms: minimum number of atoms in the fragment which will replace H
    :param max_atoms: maximum number of atoms in the fragment which will replace H
    :param max_replacements: maximum number of replacements to make. If the number of replacements available in DB is
                             greater than the specified value the specified number of randomly chosen replacements
                             will be applied. Default: None.
    :param replace_ids: iterable with ids of heavy atom with replaceable Hs or/and ids of H atoms to replace,
                        it has lower priority over `protected_ids` (replace_ids
                        which are present in protected_ids would be protected). Default: None.
    :param protected_ids: iterable with hydrogen atom ids or ids of heavy atoms at which hydrogens will not be replaced.
                          Ids of all equivalent atoms should be supplied (e.g. to protect meta-position in toluene
                          ids of both carbons in meta-positions should be supplied).
                          This argument has a higher priority over `replace_ids`. Default: None.
    :param symmetry_fixes: if Sset True duplicated fragments with equivalent atoms having different ids will be
                           enumerated. This makes sense if one wants to replace particular atom(s) which have
                           equivalent ones. By default, among equivalent atoms only those with the lowest ids
                           are replaced. This will result in generation of duplicated molecules if several equivalent
                           atoms are selected which will be filtered later nevertheless. So, it is not very reasonable
                           to use this argument and select several equivalent atoms to replace.
                           This solves the issue of rdkit MMPA fragmenter which removes duplicates internally.
    :param min_freq: minimum occurrence of fragments in DB for replacement. Default: 0.
    :param set_names: column name or list of column names in radius tables defining the set(s) of fragments to use
                      with min_freq in v1 databases. A fragment is included if at least one of the named sets
                      satisfies the min_freq threshold (OR logic). If None (default), all available set columns
                      are used. If a column name is not found, a ValueError is raised listing available set names.
                      Ignored for v0 databases. Default: None.
    :param return_rxn: whether to additionally return rxn of a transformation. Default: False.
    :param return_rxn_freq: whether to additionally return the frequency of a transformation in the DB.  Default: False.
    :param return_mol: whether to additionally return RDKit Mol object of a generated molecule.  Default: False.
    :param ncores: number of cores. Default: 1.
    :param filter_func: a function which will filter selected fragments by additional rules
                        (in this way one may add arbitrary selection constrains). The function takes necessary first
                        three arguments: row_ids (list or set of row_ids from the fragment database supplied to
                        the grow_mol function), cursor of that fragment database and radius (int). This is required
                        access the selected fragments. Other arguments are custom and user-defined.
                        It is the most convenient to define a filtering function, implement specific logic inside and
                        pass it to grow_mol using functools.partial. The filtering function should return a list/set
                        of row ids which are a subset of the input row ids.
    :param sample_func: a function which will sample selected fragments if max_replacements is supplied. If omitted
                        uniform sampling will be used. The function takes necessary first four arguments: row_ids
                        (list or set of row_ids from the fragment database), cursor of that fragment database,
                        radius (int) and the number of returned items (int). This is required to access the selected
                        fragments. Other arguments can be custom and user-defined. The function should return
                        a list/set of selected row ids.
    :param seed: random seed for reproducible fragment selection when max_replacements is set. Default: None.
    :param **kwargs: named arguments to additionally filter replacing fragments. For v0 DB use columns from radiusX,
                     for v1 DB use columns from frags or frags_h. Values are a single value or 2-item tuple with lower
                     and upper bound of the corresponding parameter of a fragment. This can be useful to annotate
                     fragments with additional custom properties (e.g. number of particular pharmacophore features,
                     lipophilicity, etc) and use these parameters to additionally restrict selected fragments.
    :return: generator over new molecules. If no additional return arguments were called this would be a generator over
             SMILES of new molecules. If any of additional return values were asked the function will return a list
             of list where the first item is SMILES, then rxn string of a transformation (optional), frequency of
             fragment occurrence in the DB (optional), RDKit Mol object (optional).
             Only entries with distinct SMILES will be returned.

    """

    __check_db_existence(db_name)
    m = Chem.AddHs(mol)

    # create the list of ids of protected Hs only would be enough, however in the first case (replace_ids) the full list
    # of protected atom ids is created
    if protected_ids:

        ids = []
        for i in protected_ids:
            if m.GetAtomWithIdx(i).GetAtomicNum() == 1:
                ids.append(i)
            else:
                for a in m.GetAtomWithIdx(i).GetNeighbors():
                    if a.GetAtomicNum() == 1:
                        ids.append(a.GetIdx())
        protected_ids = set(ids)  # ids of protected Hs

    else:
        protected_ids = set()

    if replace_ids:

        ids = set()  # ids if replaceable Hs
        for i in replace_ids:
            if m.GetAtomWithIdx(i).GetAtomicNum() == 1:
                ids.add(i)
            else:
                ids.update(a.GetIdx() for a in m.GetAtomWithIdx(i).GetNeighbors() if a.GetAtomicNum() == 1)
        ids = set(a.GetIdx() for a in m.GetAtoms() if a.GetAtomicNum() == 1).difference(ids)  # ids of Hs to protect
        protected_ids.update(ids)  # since protected_ids has a higher priority add them anyway

    return mutate_mol(m, db_name, radius, min_size=0, max_size=0, min_inc=min_atoms, max_inc=max_atoms,
                      max_replacements=max_replacements, replace_ids=None, protected_ids=protected_ids,
                      min_freq=min_freq, set_names=set_names, return_rxn=return_rxn, return_rxn_freq=return_rxn_freq,
                      return_mol=return_mol, ncores=ncores, symmetry_fixes=symmetry_fixes, filter_func=filter_func,
                      sample_func=sample_func, seed=seed, **kwargs)
link_mols(mol1, mol2, db_name, radius=3, dist=None, min_atoms=1, max_atoms=2, max_replacements=None, replace_ids_1=None, replace_ids_2=None, protected_ids_1=None, protected_ids_2=None, min_freq=0, return_rxn=False, return_rxn_freq=False, return_mol=False, ncores=1, filter_func=None, sample_func=None, set_names=None, seed=None, **kwargs)

Link two molecules by a linker from the database.

Parameters:
  • mol1

    the first RDKit Mol object

  • mol2

    the second RDKit Mol object

  • db_name

    path to DB file with fragment replacements.

  • radius

    radius of context which will be considered for replacement. Default: 3.

  • dist

    topological distance between two attachment points in the fragment which will link molecules. Can be a single integer or a tuple of lower and upper bound values.

  • min_atoms

    minimum number of heavy atoms in the fragment which will link molecules

  • max_atoms

    maximum number of heavy atoms in the fragment which will link molecules

  • max_replacements

    maximum number of replacements to make. If the number of replacements available in DB is greater than the specified value the specified number of randomly chosen replacements will be applied. Default: None.

  • replace_ids_1

    iterable with ids of heavy atom of the first molecule with replaceable Hs or/and ids of H atoms to replace, it has lower priority over protected_ids_1 (replace_ids which are present in protected_ids would be protected). Default: None.

  • replace_ids_2

    iterable with ids of heavy atom of the second molecule with replaceable Hs or/and ids of H atoms to replace, it has lower priority over protected_ids_2 (replace_ids which are present in protected_ids would be protected). Default: None.

  • protected_ids_1

    iterable with ids of heavy atoms of the first molecule at which no H replacement should be made and/or ids of protected hydrogens. This argument has a higher priority over replace_ids_1. Default: None.

  • protected_ids_2

    iterable with ids of heavy atoms of the second molecule at which no H replacement should be made and/or ids of protected hydrogens. This argument has a higher priority over replace_ids_2. Default: None.

  • min_freq

    minimum occurrence of fragments in DB for replacement. Default: 0.

  • set_names

    column name or list of column names in radius tables defining the set(s) of fragments to use with min_freq in v1 databases. A fragment is included if at least one of the named sets satisfies the min_freq threshold (OR logic). If None (default), all available set columns are used. If a column name is not found, a ValueError is raised listing available set names. Ignored for v0 databases. Default: None.

  • return_rxn

    whether to additionally return rxn of a transformation. Default: False.

  • return_rxn_freq

    whether to additionally return the frequency of a transformation in the DB. Default: False.

  • return_mol

    whether to additionally return RDKit Mol object of a generated molecule. Default: False.

  • ncores

    number of cores. Default: 1.

  • filter_func

    a function which will filter selected fragments by additional rules (in this way one may add arbitrary selection constrains). The function takes necessary first three arguments: row_ids (list or set of row_ids from the fragment database supplied to the link_mols function), cursor of that fragment database and radius (int). This is required access the selected fragments. Other arguments are custom and user-defined. It is the most convenient to define a filtering function, implement specific logic inside and pass it to link_mols using functools.partial. The filtering function should return a list/set of row ids which are a subset of the input row ids.

  • sample_func

    a function which will sample selected fragments if max_replacements is supplied. If omitted uniform sampling will be used. The function takes necessary first four arguments: row_ids (list or set of row_ids from the fragment database), cursor of that fragment database, radius (int) and the number of returned items (int). This is required to access the selected fragments. Other arguments can be custom and user-defined. The function should return a list/set of selected row ids.

  • seed

    random seed for reproducible fragment selection when max_replacements is set. Default: None.

  • **kwargs

    named arguments to additionally filter replacing fragments. For v0 DB use columns from radiusX, for v1 DB use columns from frags or frags_h. Values are a single value or 2-item tuple with lower and upper bound of the corresponding parameter of a fragment. This can be useful to annotate fragments with additional custom properties (e.g. number of particular pharmacophore features, lipophilicity, etc) and use these parameters to additionally restrict selected fragments.

Returns:
  • generator over new molecules. If no additional return arguments were called this would be a generator over SMILES of new molecules. If any of additional return values were asked the function will return a list of list where the first item is SMILES, then rxn string of a transformation (optional), frequency of fragment occurrence in the DB (optional), RDKit Mol object (optional). Only entries with distinct SMILES will be returned.

Source code in crem/crem.py
def link_mols(mol1, mol2, db_name, radius=3, dist=None, min_atoms=1, max_atoms=2, max_replacements=None,
              replace_ids_1=None, replace_ids_2=None, protected_ids_1=None, protected_ids_2=None,
              min_freq=0, return_rxn=False, return_rxn_freq=False, return_mol=False, ncores=1, filter_func=None,
              sample_func=None, set_names=None, seed=None, **kwargs):
    """
    Link two molecules by a linker from the database.

    :param mol1: the first RDKit Mol object
    :param mol2: the second RDKit Mol object
    :param db_name: path to DB file with fragment replacements.
    :param radius: radius of context which will be considered for replacement. Default: 3.
    :param dist: topological distance between two attachment points in the fragment which will link molecules.
                 Can be a single integer or a tuple of lower and upper bound values.
    :param min_atoms: minimum number of heavy atoms in the fragment which will link molecules
    :param max_atoms: maximum number of heavy atoms in the fragment which will link molecules
    :param max_replacements: maximum number of replacements to make. If the number of replacements available in DB is
                             greater than the specified value the specified number of randomly chosen replacements
                             will be applied. Default: None.
    :param replace_ids_1: iterable with ids of heavy atom of the first molecule with replaceable Hs or/and ids of H
                          atoms to replace,
                          it has lower priority over `protected_ids_1` (replace_ids
                          which are present in protected_ids would be protected). Default: None.
    :param replace_ids_2: iterable with ids of heavy atom of the second molecule with replaceable Hs or/and ids of H
                          atoms to replace,
                          it has lower priority over `protected_ids_2` (replace_ids
                          which are present in protected_ids would be protected). Default: None.
    :param protected_ids_1: iterable with ids of heavy atoms of the first molecule at which no H replacement should
                            be made and/or ids of protected hydrogens.
                            This argument has a higher priority over `replace_ids_1`. Default: None.
    :param protected_ids_2: iterable with ids of heavy atoms of the second molecule at which no H replacement should
                            be made and/or ids of protected hydrogens.
                            This argument has a higher priority over `replace_ids_2`. Default: None.
    :param min_freq: minimum occurrence of fragments in DB for replacement. Default: 0.
    :param set_names: column name or list of column names in radius tables defining the set(s) of fragments to use
                      with min_freq in v1 databases. A fragment is included if at least one of the named sets
                      satisfies the min_freq threshold (OR logic). If None (default), all available set columns
                      are used. If a column name is not found, a ValueError is raised listing available set names.
                      Ignored for v0 databases. Default: None.
    :param return_rxn: whether to additionally return rxn of a transformation. Default: False.
    :param return_rxn_freq: whether to additionally return the frequency of a transformation in the DB.  Default: False.
    :param return_mol: whether to additionally return RDKit Mol object of a generated molecule.  Default: False.
    :param ncores: number of cores. Default: 1.
    :param filter_func: a function which will filter selected fragments by additional rules
                        (in this way one may add arbitrary selection constrains). The function takes necessary first
                        three arguments: row_ids (list or set of row_ids from the fragment database supplied to
                        the link_mols function), cursor of that fragment database and radius (int). This is required
                        access the selected fragments. Other arguments are custom and user-defined.
                        It is the most convenient to define a filtering function, implement specific logic inside and
                        pass it to link_mols using functools.partial. The filtering function should return a list/set
                        of row ids which are a subset of the input row ids.
    :param sample_func: a function which will sample selected fragments if max_replacements is supplied. If omitted
                        uniform sampling will be used. The function takes necessary first four arguments: row_ids
                        (list or set of row_ids from the fragment database), cursor of that fragment database,
                        radius (int) and the number of returned items (int). This is required to access the selected
                        fragments. Other arguments can be custom and user-defined. The function should return
                        a list/set of selected row ids.
    :param seed: random seed for reproducible fragment selection when max_replacements is set. Default: None.
    :param **kwargs: named arguments to additionally filter replacing fragments. For v0 DB use columns from radiusX,
                     for v1 DB use columns from frags or frags_h. Values are a single value or 2-item tuple with lower
                     and upper bound of the corresponding parameter of a fragment. This can be useful to annotate
                     fragments with additional custom properties (e.g. number of particular pharmacophore features,
                     lipophilicity, etc) and use these parameters to additionally restrict selected fragments.
    :return: generator over new molecules. If no additional return arguments were called this would be a generator over
             SMILES of new molecules. If any of additional return values were asked the function will return a list
             of list where the first item is SMILES, then rxn string of a transformation (optional), frequency of
             fragment occurrence in the DB (optional), RDKit Mol object (optional).
             Only entries with distinct SMILES will be returned.

    """

    def __get_protected_ids(m, replace_ids, protected_ids):
        # the list of ids of heavy atom with protected hydrogens should be returned

        if protected_ids:

            ids = set()
            for i in protected_ids:
                if m.GetAtomWithIdx(i).GetAtomicNum() == 1:
                    ids.update(a.GetIdx() for a in m.GetAtomWithIdx(i).GetNeighbors())
                else:
                    ids.add(i)
            protected_ids = ids

        else:
            protected_ids = set()

        if replace_ids:

            ids = set()
            for i in replace_ids:
                if m.GetAtomWithIdx(i).GetAtomicNum() == 1:
                    ids.update(a.GetIdx() for a in m.GetAtomWithIdx(i).GetNeighbors())
                else:
                    ids.add(i)
            heavy_atom_ids = set(a.GetIdx() for a in m.GetAtoms() if a.GetAtomicNum() > 1)
            ids = heavy_atom_ids.difference(ids)  # ids of heavy atoms which should be protected
            protected_ids.update(ids)  # since protected_ids has a higher priority add them anyway

        return protected_ids

    __check_db_existence(db_name)
    products = set()

    mol1 = __backup_atom_properties(Chem.AddHs(mol1), __atom_properties_to_backup)
    mol2 = __backup_atom_properties(Chem.AddHs(mol2), __atom_properties_to_backup)

    protected_ids_1 = __get_protected_ids(mol1, replace_ids_1, protected_ids_1)
    protected_ids_2 = __get_protected_ids(mol2, replace_ids_2, protected_ids_2)

    if ncores == 1:

        for frag_sma, core_sma, freq, context_mol in __gen_replacements(mol1=mol1, mol2=mol2,
                                                                         db_name=db_name, radius=radius,
                                                                         dist=dist, min_size=0,
                                                                         max_size=0, min_rel_size=0,
                                                                         max_rel_size=1,
                                                                         min_inc=min_atoms,
                                                                         max_inc=max_atoms,
                                                                         replace_cycles=False,
                                                                         max_replacements=max_replacements,
                                                                         protected_ids_1=protected_ids_1,
                                                                         protected_ids_2=protected_ids_2,
                                                                         min_freq=min_freq,
                                                                         set_names=set_names,
                                                                         filter_func=filter_func,
                                                                         sample_func=sample_func,
                                                                         return_frag_smi_only=False,
                                                                         operation="link",
                                                                         seed=seed, **kwargs):
            for smi, m, rxn in __frag_replace(mol1, mol2, frag_sma, core_sma, radius, context_mol):
                if max_replacements is None or (max_replacements is not None and len(products) < max_replacements):
                    if smi not in products:
                        products.add(smi)
                        res = [smi]
                        if return_rxn:
                            res.append(rxn)
                            if return_rxn_freq:
                                res.append(freq)
                        if return_mol:
                            res.append(m)
                        if len(res) == 1:
                            yield res[0]
                        else:
                            yield res

    else:

        p = Pool(min(ncores, cpu_count()))
        try:
            for items in p.imap(__frag_replace_mp, __get_data_link(mol1, mol2, db_name, radius, dist, min_atoms, max_atoms,
                                                                   protected_ids_1, protected_ids_2, min_freq,
                                                                   set_names, max_replacements, filter_func=filter_func,
                                                                   sample_func=sample_func, seed=seed, **kwargs),
                                chunksize=100):
                for smi, m, rxn, freq in items:
                    if max_replacements is None or (max_replacements is not None and len(products) < max_replacements):
                        if smi not in products:
                            products.add(smi)
                            res = [smi]
                            if return_rxn:
                                res.append(rxn)
                                if return_rxn_freq:
                                    res.append(freq)
                            if return_mol:
                                res.append(m)
                            if len(res) == 1:
                                yield res[0]
                            else:
                                yield res
        finally:
            p.close()
            p.join()

make_cycle

make_cycle(mol, db_name, radius=3, ring_size=None, ring_closures=True, min_atoms=1, max_atoms=10, max_replacements=None, replace_ids=None, protected_ids=None, symmetry_fixes=False, min_freq=0, return_rxn=False, return_rxn_freq=False, return_mol=False, ncores=1, filter_func=None, sample_func=None, set_names=None, seed=None, **kwargs)

Generate new rings (macrocycles or smaller native cycles) by linking two atoms in the same molecule with a 2-attachment-point fragment from the DB.

Two complementary modes:

  • ring_closures=False (broad): query any linker fragment. Internally both fragmenters are run on the input molecule (the connected-env arc-cut fragmenter and the disconnected-env macrocycle fragmenter) and the is_ring_closure provenance column is not filtered, so DB rows of either provenance can match.
  • ring_closures=True (strict): only the connected-env arc-cut fragmenter runs and the query is restricted to is_ring_closure=1 rows (populated by --frag-mode ring / both or the corresponding *_optimal modes at DB build time). Useful for closing native (typically aliphatic) rings.
Parameters:
  • mol

    RDKit Mol object.

  • db_name

    path to DB file with fragment replacements.

  • radius

    radius of context which will be considered for replacement. Default: 3.

  • ring_size

    size of the new ring being formed (in atoms = bonds). int for a single size, (min, max) tuple for a window. None imposes no ring-size constraint. The per-anchor-pair dist2 filter is derived as ring_size − d_in where d_in is the topological distance between the two anchor heavy atoms in the input molecule.

  • ring_closures

    if True, query ring-closure (arc) fragments in DB (rows with is_ring_closure = 1). If False (default) query acyclic-cut linker fragments.

  • min_atoms

    minimum number of heavy atoms in the linker fragment. Default: 1.

  • max_atoms

    maximum number of heavy atoms in the linker fragment. Default: 10.

  • max_replacements

    maximum number of replacements to make. If the number of replacements available in DB is greater than the specified value the specified number of randomly chosen replacements will be applied. Default: None.

  • replace_ids

    iterable with ids of heavy atom with replaceable Hs or/and ids of H atoms to replace, it has lower priority over protected_ids (replace_ids which are present in protected_ids would be protected). Default: None.

  • protected_ids

    iterable with ids of heavy atoms at which no H replacement should be made and/or ids of protected hydrogens. This argument has a higher priority over replace_ids. Default: None.

  • symmetry_fixes

    accepted for API compatibility with mutate/grow functions but not used here.

  • min_freq

    minimum occurrence of fragments in DB for replacement. Default: 0.

  • return_rxn

    whether to additionally return rxn of a transformation. Default: False.

  • return_rxn_freq

    whether to additionally return the frequency of a transformation in the DB. Default: False.

  • return_mol

    whether to additionally return RDKit Mol object of a generated molecule. Default: False.

  • ncores

    number of cores. Default: 1.

  • filter_func

    a function which will filter selected fragments by additional rules (in this way one may add arbitrary selection constrains). The function takes necessary first three arguments: row_ids (list or set of row_ids from the fragment database supplied to make_cycle), cursor of that fragment database and radius (int). This is required to access the selected fragments. Other arguments are custom and user-defined. It is the most convenient to define a filtering function, implement specific logic inside and pass it using functools.partial. The filtering function should return a list/set of row ids which are a subset of the input row ids.

  • sample_func

    a function which will sample selected fragments if max_replacements is supplied. If omitted uniform sampling will be used. The function takes necessary first four arguments: row_ids (list or set of row_ids from the fragment database), cursor of that fragment database, radius (int) and the number of returned items (int). This is required to access the selected fragments. Other arguments can be custom and user-defined. The function should return a list/set of selected row ids.

  • set_names

    column name or list of column names in radius tables defining the set(s) of fragments to use with min_freq in v1 databases. A fragment is included if at least one of the named sets satisfies the min_freq threshold (OR logic). If None (default), all available set columns are used. If a column name is not found, a ValueError is raised listing available set names. Ignored for v0 databases. Default: None.

  • seed

    random seed for reproducible fragment selection when max_replacements is set. Default: None.

  • **kwargs

    named arguments to additionally filter replacing fragments. For v0 DB use columns from radiusX, for v1 DB use columns from frags or frags_h. Values are a single value or 2-item tuple with lower and upper bound of the corresponding parameter of a fragment.

Returns:
  • generator over new molecules. If no additional return arguments were requested this is a generator over SMILES of new molecules. If additional return values were requested, the function yields a list where the first item is SMILES, then rxn string (optional), frequency (optional), RDKit Mol object (optional). Only entries with distinct SMILES will be returned.

Source code in crem/crem.py
def make_cycle(mol, db_name, radius=3, ring_size=None, ring_closures=True,
               min_atoms=1, max_atoms=10, max_replacements=None,
               replace_ids=None, protected_ids=None, symmetry_fixes=False, min_freq=0,
               return_rxn=False, return_rxn_freq=False, return_mol=False, ncores=1, filter_func=None,
               sample_func=None, set_names=None, seed=None, **kwargs):
    """
    Generate new rings (macrocycles or smaller native cycles) by linking two
    atoms in the same molecule with a 2-attachment-point fragment from the DB.

    Two complementary modes:

    * ``ring_closures=False`` (broad): query **any** linker fragment.
      Internally both fragmenters are run on the input molecule (the
      connected-env arc-cut fragmenter and the disconnected-env macrocycle
      fragmenter) and the ``is_ring_closure`` provenance column is **not**
      filtered, so DB rows of either provenance can match.
    * ``ring_closures=True`` (strict): only the connected-env arc-cut
      fragmenter runs and the query is restricted to ``is_ring_closure=1``
      rows (populated by ``--frag-mode ring`` / ``both`` or the corresponding
      ``*_optimal`` modes at DB build time).
      Useful for closing native (typically aliphatic) rings.

    :param mol: RDKit Mol object.
    :param db_name: path to DB file with fragment replacements.
    :param radius: radius of context which will be considered for replacement. Default: 3.
    :param ring_size: size of the *new* ring being formed (in atoms = bonds).
                      ``int`` for a single size, ``(min, max)`` tuple for a
                      window. ``None`` imposes no ring-size constraint. The
                      per-anchor-pair ``dist2`` filter is derived as
                      ``ring_size − d_in`` where ``d_in`` is the topological
                      distance between the two anchor heavy atoms in the
                      input molecule.
    :param ring_closures: if True, query ring-closure (arc) fragments in DB
                          (rows with ``is_ring_closure = 1``). If False
                          (default) query acyclic-cut linker fragments.
    :param min_atoms: minimum number of heavy atoms in the linker fragment. Default: 1.
    :param max_atoms: maximum number of heavy atoms in the linker fragment. Default: 10.
    :param max_replacements: maximum number of replacements to make. If the number of replacements available in DB is
                             greater than the specified value the specified number of randomly chosen replacements
                             will be applied. Default: None.
    :param replace_ids: iterable with ids of heavy atom with replaceable Hs or/and ids of H atoms to replace,
                        it has lower priority over `protected_ids` (replace_ids
                        which are present in protected_ids would be protected). Default: None.
    :param protected_ids: iterable with ids of heavy atoms at which no H replacement should be made and/or ids of
                          protected hydrogens. This argument has a higher priority over `replace_ids`. Default: None.
    :param symmetry_fixes: accepted for API compatibility with mutate/grow functions but not used here.
    :param min_freq: minimum occurrence of fragments in DB for replacement. Default: 0.
    :param return_rxn: whether to additionally return rxn of a transformation. Default: False.
    :param return_rxn_freq: whether to additionally return the frequency of a transformation in the DB. Default: False.
    :param return_mol: whether to additionally return RDKit Mol object of a generated molecule. Default: False.
    :param ncores: number of cores. Default: 1.
    :param filter_func: a function which will filter selected fragments by additional rules
                        (in this way one may add arbitrary selection constrains). The function takes necessary first
                        three arguments: row_ids (list or set of row_ids from the fragment database supplied to
                        make_cycle), cursor of that fragment database and radius (int). This is required to
                        access the selected fragments. Other arguments are custom and user-defined.
                        It is the most convenient to define a filtering function, implement specific logic inside and
                        pass it using functools.partial. The filtering function should return a list/set
                        of row ids which are a subset of the input row ids.
    :param sample_func: a function which will sample selected fragments if max_replacements is supplied. If omitted
                        uniform sampling will be used. The function takes necessary first four arguments: row_ids
                        (list or set of row_ids from the fragment database), cursor of that fragment database,
                        radius (int) and the number of returned items (int). This is required to access the selected
                        fragments. Other arguments can be custom and user-defined. The function should return
                        a list/set of selected row ids.
    :param set_names: column name or list of column names in radius tables defining the set(s) of fragments to use
                      with min_freq in v1 databases. A fragment is included if at least one of the named sets
                      satisfies the min_freq threshold (OR logic). If None (default), all available set columns
                      are used. If a column name is not found, a ValueError is raised listing available set names.
                      Ignored for v0 databases. Default: None.
    :param seed: random seed for reproducible fragment selection when max_replacements is set. Default: None.
    :param **kwargs: named arguments to additionally filter replacing fragments. For v0 DB use columns from radiusX,
                     for v1 DB use columns from frags or frags_h. Values are a single value or 2-item tuple with lower
                     and upper bound of the corresponding parameter of a fragment.
    :return: generator over new molecules. If no additional return arguments were requested this is a generator over
             SMILES of new molecules. If additional return values were requested, the function yields a list where
             the first item is SMILES, then rxn string (optional), frequency (optional), RDKit Mol object (optional).
             Only entries with distinct SMILES will be returned.
    """

    def __get_protected_ids(m, replace_ids, protected_ids):
        # the list of ids of heavy atoms with protected hydrogens should be returned

        if protected_ids:

            ids = set()
            for i in protected_ids:
                if m.GetAtomWithIdx(i).GetAtomicNum() == 1:
                    ids.update(a.GetIdx() for a in m.GetAtomWithIdx(i).GetNeighbors())
                else:
                    ids.add(i)
            protected_ids = ids

        else:
            protected_ids = set()

        if replace_ids:

            ids = set()
            for i in replace_ids:
                if m.GetAtomWithIdx(i).GetAtomicNum() == 1:
                    ids.update(a.GetIdx() for a in m.GetAtomWithIdx(i).GetNeighbors())
                else:
                    ids.add(i)
            heavy_atom_ids = set(a.GetIdx() for a in m.GetAtoms() if a.GetAtomicNum() > 1)
            ids = heavy_atom_ids.difference(ids)  # ids of heavy atoms which should be protected
            protected_ids.update(ids)  # since protected_ids has a higher priority add them anyway

        return protected_ids

    __check_db_existence(db_name)
    products = set()

    mol = Chem.AddHs(mol)
    source_smi = Chem.MolToSmiles(Chem.RemoveHs(mol), isomericSmiles=True)
    protected_ids = __get_protected_ids(mol, replace_ids, protected_ids)
    mol = __backup_atom_properties(mol, __atom_properties_to_backup)

    if ncores == 1:

        for frag_sma, core_sma, freq, context_mol in __gen_replacements(mol1=mol, mol2=None, db_name=db_name,
                                                                        radius=radius,
                                                                        min_size=0, max_size=0,
                                                                        min_rel_size=0, max_rel_size=1,
                                                                        min_inc=min_atoms, max_inc=max_atoms,
                                                                        max_replacements=max_replacements,
                                                                        replace_cycles=False,
                                                                        protected_ids_1=protected_ids,
                                                                        protected_ids_2=None,
                                                                        min_freq=min_freq, set_names=set_names,
                                                                        filter_func=filter_func,
                                                                        sample_func=sample_func,
                                                                        return_frag_smi_only=False,
                                                                        operation="cycle",
                                                                        ring_closures=ring_closures,
                                                                        ring_size=ring_size,
                                                                        seed=seed, **kwargs):
            for smi, m, rxn in __frag_replace(mol, None, frag_sma, core_sma, radius, context_mol):
                if max_replacements is None or (max_replacements is not None and len(products) < max_replacements):
                    if smi != source_smi and smi not in products:
                        products.add(smi)
                        res = [smi]
                        if return_rxn:
                            res.append(rxn)
                            if return_rxn_freq:
                                res.append(freq)
                        if return_mol:
                            res.append(m)
                        if len(res) == 1:
                            yield res[0]
                        else:
                            yield res

    else:

        p = Pool(min(ncores, cpu_count()))
        try:
            for items in p.imap(__frag_replace_mp, __get_data_cycle(mol, db_name, radius, ring_size,
                                                                    ring_closures, min_atoms, max_atoms,
                                                                    protected_ids, min_freq, set_names,
                                                                    max_replacements,
                                                                    filter_func=filter_func,
                                                                    sample_func=sample_func, seed=seed,
                                                                    **kwargs),
                                chunksize=100):
                for smi, m, rxn, freq in items:
                    if max_replacements is None or (max_replacements is not None and len(products) < max_replacements):
                        if smi != source_smi and smi not in products:
                            products.add(smi)
                            res = [smi]
                            if return_rxn:
                                res.append(rxn)
                                if return_rxn_freq:
                                    res.append(freq)
                            if return_mol:
                                res.append(m)
                            if len(res) == 1:
                                yield res[0]
                            else:
                                yield res
        finally:
            p.close()
            p.join()

mutate_mol2

mutate_mol2(*args, **kwargs)

Convenience function which can be used to process molecules in parallel using multiprocessing module. It calls mutate_mol which cannot be used directly in multiprocessing because it is a generator

Parameters:
  • args

    positional arguments, the same as in mutate_mol function

  • kwargs

    keyword arguments, the same as in mutate_mol function

Returns:
  • list with output molecules

Source code in crem/crem.py
def mutate_mol2(*args, **kwargs):
    """
    Convenience function which can be used to process molecules in parallel using multiprocessing module.
    It calls mutate_mol which cannot be used directly in multiprocessing because it is a generator

    :param args: positional arguments, the same as in mutate_mol function
    :param kwargs: keyword arguments, the same as in mutate_mol function
    :return: list with output molecules

    """
    return list(mutate_mol(*args, **kwargs))

grow_mol2

grow_mol2(*args, **kwargs)

Convenience function which can be used to process molecules in parallel using multiprocessing module. It calls grow_mol which cannot be used directly in multiprocessing because it is a generator

Parameters:
  • args

    positional arguments, the same as in grow_mol function

  • kwargs

    keyword arguments, the same as in grow_mol function

Returns:
  • list with output molecules

Source code in crem/crem.py
def grow_mol2(*args, **kwargs):
    """
    Convenience function which can be used to process molecules in parallel using multiprocessing module.
    It calls grow_mol which cannot be used directly in multiprocessing because it is a generator

    :param args: positional arguments, the same as in grow_mol function
    :param kwargs: keyword arguments, the same as in grow_mol function
    :return: list with output molecules

    """
    return list(grow_mol(*args, **kwargs))
link_mols2(*args, **kwargs)

Convenience function which can be used to process molecules in parallel using multiprocessing module. It calls link_mols which cannot be used directly in multiprocessing because it is a generator

Parameters:
  • args

    positional arguments, the same as in link_mols function

  • kwargs

    keyword arguments, the same as in link_mols function

Returns:
  • list with output molecules

Source code in crem/crem.py
def link_mols2(*args, **kwargs):
    """
    Convenience function which can be used to process molecules in parallel using multiprocessing module.
    It calls link_mols which cannot be used directly in multiprocessing because it is a generator

    :param args: positional arguments, the same as in link_mols function
    :param kwargs: keyword arguments, the same as in link_mols function
    :return: list with output molecules

    """
    return list(link_mols(*args, **kwargs))

make_cycle2

make_cycle2(*args, **kwargs)

Convenience function which can be used to process molecules in parallel using multiprocessing module. It calls make_cycle which cannot be used directly in multiprocessing because it is a generator

Parameters:
  • args

    positional arguments, the same as in make_cycle function

  • kwargs

    keyword arguments, the same as in make_cycle function

Returns:
  • list with output molecules

Source code in crem/crem.py
def make_cycle2(*args, **kwargs):
    """
    Convenience function which can be used to process molecules in parallel using multiprocessing module.
    It calls make_cycle which cannot be used directly in multiprocessing because it is a generator

    :param args: positional arguments, the same as in make_cycle function
    :param kwargs: keyword arguments, the same as in make_cycle function
    :return: list with output molecules

    """
    return list(make_cycle(*args, **kwargs))

get_replacements

get_replacements(mol1, db_name, radius, mol2=None, dist=None, min_size=0, max_size=8, min_rel_size=0, max_rel_size=1, min_inc=-2, max_inc=2, max_replacements=None, replace_cycles='no', protected_ids_1=None, protected_ids_2=None, replace_ids_1=None, replace_ids_2=None, min_freq=0, symmetry_fixes=False, filter_func=None, sample_func=None, return_frag_smi_only=True, set_names=None, seed=None, **kwargs)

An auxiliary function, which returns smiles of fragments in a DB which satisfy given criteria

Parameters:
  • mol1

    RDKit Mol object

  • db_name

    path to DB file with fragment replacements.

  • radius

    radius of context which will be considered for replacement. Default: 3.

  • mol2

    a second RDKit Mol object if searching for linking fragments

  • dist

    topological distance between two attachment points in the fragment which will link molecules. Can be a single integer or a tuple of lower and upper bound values.

  • min_size

    minimum number of heavy atoms in a fragment to replace. If 0 - hydrogens will be replaced (if they are explicit).

  • max_size

    maximum number of heavy atoms in a fragment to replace.

  • min_rel_size

    minimum relative size of a replaced fragment to the whole molecule (in terms of a number of heavy atoms)

  • max_rel_size

    maximum relative size of a replaced fragment to the whole molecule (in terms of a number of heavy atoms)

  • min_inc

    minimum change of a number of heavy atoms in replacing fragments to a number of heavy atoms in replaced one. Negative value means that the replacing fragments would be smaller than the replaced one on a specified number of heavy atoms.

  • max_inc

    maximum change of a number of heavy atoms in replacing fragments to a number of heavy atoms in replaced one.

  • max_replacements

    maximum number of replacements to make. If the number of replacements available in DB is greater than the specified value the specified number of randomly chosen replacements will be applied.

  • replace_cycles

    controls replacement of cyclic source fragments for single-molecule searches. "no"/False uses ordinary acyclic-cut mutation only. "forced"/True allows cyclic cores from ordinary fragmentation to be replaced ignoring the size filters. "partial_all" additionally searches partial ring arcs with exhaustive side cuts. "partial_exo" additionally searches partial ring arcs with only exo side cuts. Ignored for link searches. Default: "no".

  • protected_ids_1

    iterable with atom ids which will not be mutated in mol1. If the molecule was supplied with explicit hydrogen the ids of protected hydrogens should be supplied as well, otherwise they will be replaced. Ids of all equivalent atoms should be supplied (e.g. to protect meta-position in toluene ids of both carbons in meta-positions should be supplied) This argument has a higher priority over replace_ids_1.

  • protected_ids_2

    iterable with atom ids which will not be mutated in mol2. If the molecule was supplied with explicit hydrogen the ids of protected hydrogens should be supplied as well, otherwise they will be replaced. Ids of all equivalent atoms should be supplied (e.g. to protect meta-position in toluene ids of both carbons in meta-positions should be supplied) This argument has a higher priority over replace_ids_2.

  • replace_ids_1

    iterable with atom ids to replace in mol1, it has lower priority over protected_ids (replace_ids which are present in protected_ids would be protected).

  • replace_ids_2

    iterable with atom ids to replace in mol2, it has lower priority over protected_ids (replace_ids which are present in protected_ids would be protected).

  • min_freq

    minimum occurrence of fragments in DB for replacement. Default: 0.

  • set_names

    column name or list of column names in radius tables defining the set(s) of fragments to use with min_freq in v1 databases. A fragment is included if at least one of the named sets satisfies the min_freq threshold (OR logic). If None (default), all available set columns are used. If a column name is not found, a ValueError is raised listing available set names. Ignored for v0 databases. Default: None.

  • symmetry_fixes

    if set True duplicated fragments with equivalent atoms having different ids will be enumerated. This makes sense if one wants to replace particular atom(s) which have equivalent ones. By default, among equivalent atoms only those with the lowest ids are replaced. This will result in generation of duplicated molecules if several equivalent atoms are selected which will be filtered later nevertheless. So, it is not very reasonable to use this argument and select several equivalent atoms to replace. This solves the issue of rdkit MMPA fragmenter which removes duplicates internally.

  • filter_func

    a function which will filter selected fragments by additional rules (in this way one may add arbitrary selection constrains). The function takes necessary first three arguments: row_ids (list or set of row_ids from the fragment database supplied to the mutate_mol function), cursor of that fragment database and radius (int). This is required access the selected fragments. Other arguments are custom and user-defined. It is the most convenient to define a filtering function, implement specific logic inside and pass it to mutate_mol using functools.partial. The filtering function should return a list/set of row ids which are a subset of the input row ids.

  • sample_func

    a function which will sample selected fragments if max_replacements is supplied. If omitted uniform sampling will be used. The function takes necessary first four arguments: row_ids (list or set of row_ids from the fragment database), cursor of that fragment database, radius (int) and the number of returned items (int). This is required to access the selected fragments. Other arguments can be custom and user-defined. The function should return a list/set of selected row ids.

  • return_frag_smi_only

    control whether to return only SMILES of fragments selected from a database or return a tuple (source_core_smi, replacement_core_smi, freq, context_mol) which can be further passed to get_mols_from_replacements.

  • seed

    random seed for reproducible fragment selection when max_replacements is set. Default: None.

  • **kwargs

    named arguments to additionally filter replacing fragments. For v0 DB use columns from radiusX, for v1 DB use columns from frags or frags_h. Values are a single value or 2-item tuple with lower and upper bound of the corresponding parameter of a fragment. This can be useful to annotate fragments with additional custom properties (e.g. number of particular pharmacophore features, lipophilicity, etc) and use these parameters to additionally restrict selected fragments.

Returns:
  • generator over smiles of fragments in a DB which satisfy given criteria

Source code in crem/crem.py
def get_replacements(mol1, db_name, radius, mol2=None, dist=None, min_size=0, max_size=8, min_rel_size=0,
                     max_rel_size=1, min_inc=-2, max_inc=2, max_replacements=None, replace_cycles="no",
                     protected_ids_1=None, protected_ids_2=None, replace_ids_1=None,
                     replace_ids_2=None, min_freq=0, symmetry_fixes=False, filter_func=None, sample_func=None,
                     return_frag_smi_only=True,
                     set_names=None, seed=None, **kwargs):
    """
    An auxiliary function, which returns smiles of fragments in a DB which satisfy given criteria
    :param mol1: RDKit Mol object
    :param db_name: path to DB file with fragment replacements.
    :param radius: radius of context which will be considered for replacement. Default: 3.
    :param mol2: a second RDKit Mol object if searching for linking fragments
    :param dist: topological distance between two attachment points in the fragment which will link molecules.
                 Can be a single integer or a tuple of lower and upper bound values.
    :param min_size: minimum number of heavy atoms in a fragment to replace. If 0 - hydrogens will be replaced
                     (if they are explicit).
    :param max_size: maximum number of heavy atoms in a fragment to replace.
    :param min_rel_size: minimum relative size of a replaced fragment to the whole molecule
                         (in terms of a number of heavy atoms)
    :param max_rel_size: maximum relative size of a replaced fragment to the whole molecule
                         (in terms of a number of heavy atoms)
    :param min_inc: minimum change of a number of heavy atoms in replacing fragments to a number of heavy atoms in
                    replaced one. Negative value means that the replacing fragments would be smaller than the replaced
                    one on a specified number of heavy atoms.
    :param max_inc: maximum change of a number of heavy atoms in replacing fragments to a number of heavy atoms in
                    replaced one.
    :param max_replacements: maximum number of replacements to make. If the number of replacements available in DB is
                             greater than the specified value the specified number of randomly chosen replacements
                             will be applied.
    :param replace_cycles: controls replacement of cyclic source fragments for
                           single-molecule searches. ``"no"``/False uses
                           ordinary acyclic-cut mutation only.
                           ``"forced"``/True allows cyclic cores from
                           ordinary fragmentation to be replaced ignoring the size filters.
                           ``"partial_all"`` additionally searches partial
                           ring arcs with exhaustive side cuts.
                           ``"partial_exo"`` additionally searches partial
                           ring arcs with only exo side cuts. Ignored for
                           link searches. Default: ``"no"``.

    :param protected_ids_1: iterable with atom ids which will not be mutated in mol1. If the molecule was supplied with
                            explicit hydrogen the ids of protected hydrogens should be supplied as well, otherwise they
                            will be replaced.
                            Ids of all equivalent atoms should be supplied (e.g. to protect meta-position in toluene
                            ids of both carbons in meta-positions should be supplied)
                            This argument has a higher priority over `replace_ids_1`.
    :param protected_ids_2: iterable with atom ids which will not be mutated in mol2. If the molecule was supplied with
                            explicit hydrogen the ids of protected hydrogens should be supplied as well, otherwise they
                            will be replaced.
                            Ids of all equivalent atoms should be supplied (e.g. to protect meta-position in toluene
                            ids of both carbons in meta-positions should be supplied)
                            This argument has a higher priority over `replace_ids_2`.
    :param replace_ids_1: iterable with atom ids to replace in mol1, it has lower priority over `protected_ids`
                          (replace_ids which are present in protected_ids would be protected).
    :param replace_ids_2: iterable with atom ids to replace in mol2, it has lower priority over `protected_ids`
                          (replace_ids which are present in protected_ids would be protected).
    :param min_freq: minimum occurrence of fragments in DB for replacement. Default: 0.
    :param set_names: column name or list of column names in radius tables defining the set(s) of fragments to use
                      with min_freq in v1 databases. A fragment is included if at least one of the named sets
                      satisfies the min_freq threshold (OR logic). If None (default), all available set columns
                      are used. If a column name is not found, a ValueError is raised listing available set names.
                      Ignored for v0 databases. Default: None.
    :param symmetry_fixes: if set True duplicated fragments with equivalent atoms having different ids will be
                           enumerated. This makes sense if one wants to replace particular atom(s) which have
                           equivalent ones. By default, among equivalent atoms only those with the lowest ids
                           are replaced. This will result in generation of duplicated molecules if several equivalent
                           atoms are selected which will be filtered later nevertheless. So, it is not very reasonable
                           to use this argument and select several equivalent atoms to replace.
                           This solves the issue of rdkit MMPA fragmenter which removes duplicates internally.
    :param filter_func: a function which will filter selected fragments by additional rules
                        (in this way one may add arbitrary selection constrains). The function takes necessary first
                        three arguments: row_ids (list or set of row_ids from the fragment database supplied to
                        the mutate_mol function), cursor of that fragment database and radius (int). This is required
                        access the selected fragments. Other arguments are custom and user-defined.
                        It is the most convenient to define a filtering function, implement specific logic inside and
                        pass it to mutate_mol using functools.partial. The filtering function should return a list/set
                        of row ids which are a subset of the input row ids.
    :param sample_func: a function which will sample selected fragments if max_replacements is supplied. If omitted
                        uniform sampling will be used. The function takes necessary first four arguments: row_ids
                        (list or set of row_ids from the fragment database), cursor of that fragment database,
                        radius (int) and the number of returned items (int). This is required to access the selected
                        fragments. Other arguments can be custom and user-defined. The function should return
                        a list/set of selected row ids.
    :param return_frag_smi_only: control whether to return only SMILES of fragments selected from a database or return
                                 a tuple `(source_core_smi, replacement_core_smi, freq, context_mol)` which can be
                                 further passed to `get_mols_from_replacements`.
    :param seed: random seed for reproducible fragment selection when max_replacements is set. Default: None.
    :param **kwargs: named arguments to additionally filter replacing fragments. For v0 DB use columns from radiusX,
                     for v1 DB use columns from frags or frags_h. Values are a single value or 2-item tuple with lower
                     and upper bound of the corresponding parameter of a fragment. This can be useful to annotate
                     fragments with additional custom properties (e.g. number of particular pharmacophore features,
                     lipophilicity, etc) and use these parameters to additionally restrict selected fragments.
    :return: generator over smiles of fragments in a DB which satisfy given criteria
    """

    replace_cycles = _normalize_replace_cycles(replace_cycles)

    protected_ids_1 = set(protected_ids_1) if protected_ids_1 else set()
    if replace_ids_1:
        replace_ids_1 = set(replace_ids_1) if replace_ids_1 else set()
        protected_ids_1 = set(protected_ids_1) | set(range(mol1.GetNumAtoms())).difference(replace_ids_1)
    if isinstance(mol2, Chem.Mol):
        protected_ids_2 = set(protected_ids_2) if protected_ids_2 else set()
        if replace_ids_2:
            replace_ids_2 = set(replace_ids_2) if replace_ids_2 else set()
            protected_ids_2 = set(protected_ids_2) | set(range(mol2.GetNumAtoms())).difference(replace_ids_2)
    else:
        protected_ids_2 = None

    mol1 = __backup_atom_properties(mol1, __atom_properties_to_backup)
    if isinstance(mol2, Chem.Mol):
        mol2 = __backup_atom_properties(mol2, __atom_properties_to_backup)

    for res in __gen_replacements(mol1=mol1, mol2=mol2, db_name=db_name, radius=radius, dist=dist,
                                  min_size=min_size, max_size=max_size, min_rel_size=min_rel_size,
                                  max_rel_size=max_rel_size, min_inc=min_inc, max_inc=max_inc,
                                  max_replacements=max_replacements, replace_cycles=replace_cycles,
                                  protected_ids_1=protected_ids_1, protected_ids_2=protected_ids_2,
                                  min_freq=min_freq, set_names=set_names, symmetry_fixes=symmetry_fixes,
                                  filter_func=filter_func, sample_func=sample_func,
                                  return_frag_smi_only=return_frag_smi_only,
                                  operation=("link" if isinstance(mol2, Chem.Mol) else "mutate"),
                                  seed=seed, **kwargs):
        if return_frag_smi_only:
            yield res
        else:
            src_core, repl_core, freq, context_mol = res
            yield src_core, repl_core, freq, __prepare_context_mol_for_output(context_mol)

get_mols_from_replacements

get_mols_from_replacements(mol1, radius, replacements, mol2=None, return_rxn=False, return_rxn_freq=False, return_mol=False)
Source code in crem/crem.py
def get_mols_from_replacements(mol1, radius, replacements, mol2=None, return_rxn=False, return_rxn_freq=False,
                               return_mol=False):

    if isinstance(mol2, Chem.Mol):
        products = set()
    else:
        products = {Chem.MolToSmiles(Chem.RemoveHs(mol1))}

    for items in replacements:

        if len(items) == 4:
            frag_sma, core_sma, freq, context_mol = items
        else:
            raise ValueError('Each replacement tuple should have 4 items: '
                             '(source_core_smi, replacement_core_smi, freq, context_mol)\n')

        for smi, m, rxn in __frag_replace(mol1, mol2, frag_sma, core_sma, radius, context_mol):
            if smi not in products:
                products.add(smi)
                res = [smi]
                if return_rxn:
                    res.append(rxn)
                    if return_rxn_freq:
                        res.append(freq)
                if return_mol:
                    res.append(m)
                if len(res) == 1:
                    yield res[0]
                else:
                    yield res

_get_replacements

_get_replacements(db_cur, radius, row_ids, schema_meta=None)
Source code in crem/crem.py
def _get_replacements(db_cur, radius, row_ids, schema_meta=None):
    if schema_meta is None:
        schema_meta = _load_schema_meta(db_cur, radius)
    user_version = schema_meta['user_version']
    if user_version == 0:
        sql = f"""SELECT rowid, core_smi, core_sma, freq
                      FROM radius{radius}
                      WHERE rowid IN ({','.join(map(str, row_ids))})"""
    elif user_version == 1:
        # Note: freq was removed from DB, therefore 0 is returned (maybe None is better)
        sql = f"""SELECT r.rowid, f.core_smi
                  FROM radius{radius} r
                  JOIN frags f ON r.core_smi_id = f.core_smi_id
                  WHERE r.rowid IN ({','.join(map(str, row_ids))})"""
    else:
        raise NotImplementedError('Not implemented for database version other than 0 and 1')
    db_cur.execute(sql)
    if user_version == 1:
        # Keep tuple shape identical to user_version 0 for compatibility.
        return [(row_id, core_smi, core_smi, 0) for row_id, core_smi in db_cur.fetchall()]
    return db_cur.fetchall()