Fragment sets¶
A v1 database can hold several fragment sets in one file. Each set is a
separate frequency column on every radiusN table, so the same deduplicated
envs and frags tables are shared while each set records how often a fragment
occurs within that set.
This lets one database describe, for example, how common a fragment is in ChEMBL versus in a focused in-house library, and lets you switch between those views at generation time.
v1 only
Fragment sets are a v1 feature. v0 databases have a single freq column and
ignore the set_names argument.
Build with a single set¶
cremdb_create -i input.smi -o fragments.db -s chembl
Every fragment occurrence is counted in the chembl column.
Build with set-membership files¶
Pass one or more membership files to -s/--set-name. Each existing file is
treated as a membership list, and its basename (without extension) becomes the
set column name:
cremdb_create -i input_with_ids.smi -o fragments.db \
-s set1_ids.txt set2_ids.txt all_set
set1_ids.txt→ columnset1_idsset2_ids.txt→ columnset2_idsall_setis not an existing file, so it becomes a default set containing all molecules.
A fragment occurrence contributes to a set's count only when the source molecule's ID belongs to that set.
Membership file format¶
One molecule ID per line:
mol_0001
mol_0007
mol_0042
For membership to work, the input SMILES file must carry IDs in the second column:
CCO mol_0001
c1ccccc1 mol_0002
Inspect the sets in a database¶
cremdb_get_set_names -i fragments.db
prints the set columns per radius table, e.g.:
radius1: ['chembl']
radius2: ['chembl']
radius3: ['chembl']
Equivalently, with SQLite:
PRAGMA table_info(radius3);
Any column other than env_id, core_smi_id, core_num_atoms, dist2, and
is_ring_closure is a fragment-set frequency column.
Use a set at generation time¶
Choose the set(s) with set_names and the threshold with min_freq:
from rdkit import Chem
from crem.crem import mutate_mol
m = Chem.MolFromSmiles("c1ccccc1N")
# Fragments seen at least 5 times in set1_ids
res = list(mutate_mol(m, db_name="fragments.db", set_names="set1_ids", min_freq=5))
set_names accepts a single column name or a list. When several sets are
named, a fragment is included if any of them meets min_freq (OR logic).
With set_names=None (default), all set columns are considered. Naming a column
that does not exist raises a ValueError listing the available set names.
# Frequent in either set
res = list(mutate_mol(m, db_name="fragments.db",
set_names=["set1_ids", "set2_ids"], min_freq=3))