Advanced fragment selection¶
Beyond radius, size windows, min_freq, and set_names, three mechanisms
give fine-grained control over which fragments are used:
filter_func— arbitrary on-the-fly filtering of candidate fragments.sample_func— biased on-the-fly sampling whenmax_replacementsis set.**kwargs— filtering on property columns in the database.
All three are accepted by mutate_mol, grow_mol, link_mols, and
make_cycle.
1. Custom filtering with filter_func¶
A filter_func receives the candidate database row ids and returns the subset
to keep. Its first three arguments are fixed — row_ids, the database cursor
cur, and the context radius — and any further arguments are your own (bind
them with functools.partial). Use the helper crem.crem._get_replacements to
turn row ids into fragment SMILES.
from collections import defaultdict
from functools import partial
from rdkit import Chem
from crem.crem import mutate_mol, _get_replacements
def filter_by_atom(row_ids, cur, radius, atom_number):
"""Keep only fragments containing an atom of the given atomic number."""
if not row_ids:
return []
by_smi = defaultdict(list)
for rowid, core_smi, _, _ in _get_replacements(cur, radius, row_ids):
by_smi[core_smi].append(rowid)
out = []
for smi, ids in by_smi.items():
mol = Chem.MolFromSmiles(smi)
if mol and any(a.GetAtomicNum() == atom_number for a in mol.GetAtoms()):
out.extend(ids)
return out
# only fluorine-containing fragments will be used
mols = list(mutate_mol(
Chem.MolFromSmiles("c1ccccc1C"),
db_name="fragments.db",
filter_func=partial(filter_by_atom, atom_number=9),
max_size=1,
max_inc=3,
))
Built-in filters¶
crem.utils ships ready-made filters:
filter_max_ring_size(row_ids, cur, radius, max_size=6)— drop fragments with a ring larger thanmax_size.filter_acyclic_attachment_points(row_ids, cur, radius)— keep only fragments whose attachment points sit on acyclic atoms (handy formake_cycle).
from functools import partial
from crem.utils import filter_max_ring_size
mols = list(mutate_mol(
Chem.MolFromSmiles("c1ccccc1C"),
db_name="fragments.db",
filter_func=partial(filter_max_ring_size, max_size=6),
))
2. Biased sampling with sample_func¶
When max_replacements is set, CReM samples that many replacements uniformly at
random. A sample_func replaces the uniform draw with your own selection. Its
first four arguments are fixed — row_ids, cur, radius, and n (the number
to return) — and it returns the selected row ids.
The built-in crem.utils.sample_csp3 biases selection toward fragments with a
higher fraction of sp³ carbons:
from crem.utils import sample_csp3
mols = list(mutate_mol(
Chem.MolFromSmiles("c1ccccc1F"),
db_name="fragments.db",
max_replacements=10,
sample_func=sample_csp3,
))
Note
Complex sampling functions run for every replacement site and can slow down generation noticeably.
3. Property filters with **kwargs¶
After adding property columns, filter on
them by passing ranges as keyword arguments. A value is either an exact number
or an inclusive (low, high) tuple:
mols = list(mutate_mol(
Chem.MolFromSmiles("c1ccccc1N"),
db_name="fragments.db",
set_names="chembl",
min_freq=5,
mw=(50, 180),
logp=(0.0, 3.5),
))
Standard properties (mw, logp, rtb, tpsa, fcsp3) are added with:
cremdb_add_prop -i fragments.db -p mw logp rtb tpsa fcsp3 -c 8
The keyword names must match existing columns: frags / frags_h columns in a
v1 database, or radiusN columns in a v0 database.