Concepts¶
This page defines the vocabulary used throughout the documentation and in the
function signatures. Understanding these terms makes the parameters of
mutate_mol, grow_mol, link_mols, and make_cycle straightforward.
Fragment, core, and context¶
CReM cuts a molecule into two complementary parts:
- The core (also called the fragment) — the piece that will be replaced.
- The context (also called the environment,
env) — the surrounding atoms that stay in place.
The bond(s) that were cut leave attachment points, written as [*:1],
[*:2], … in SMILES. A core and a context fit together at matching attachment
points.
molecule: Cc1ccc(OC)cc1
one cut gives → context: Cc1ccc([*:1])cc1
core: [*:1]OC
A replacement is the act of swapping one core for another core that was seen attached to the same context somewhere in the fragment database.
Context radius¶
The full context of a fragment can be large, so CReM only keeps the atoms within a fixed number of bonds from each attachment point. That number is the radius.
radius=1keeps only the directly bonded atoms — a permissive context that matches many fragments.radius=3(the default) keeps three bonds of context — a good balance between chemical specificity and database coverage.
A larger radius means a stricter match (fewer, more context-appropriate
replacements); a smaller radius means a looser match (more, but less
context-aware replacements). The radius used at generation time must exist as a
radiusN table in the database.
Fragment size and size windows¶
Several parameters constrain the size (number of heavy atoms) of the fragments involved:
min_size/max_size— size of the core being replaced (mutate_mol).min_size=0allows replacing a hydrogen.min_inc/max_inc— how much the replacing fragment may change the heavy-atom count relative to the replaced one.min_inc=-2, max_inc=2allows the new fragment to be up to two atoms smaller or larger.min_atoms/max_atoms— size of the incoming fragment forgrow_mol,link_mols, andmake_cycle.min_rel_size/max_rel_size— size of the replaced fragment relative to the whole molecule.
Fragment frequency and sets¶
Each context–core relationship in the database carries a frequency: how many times that pair was observed in the molecules used to build the database.
min_freqkeeps only replacements seen at least that many times — a simple way to favour common, well-precedented chemistry.
A database can hold several fragment sets side by side (for example
chembl, natural_products, my_library). Each set is a separate frequency
column, so the same database can describe how common a fragment is in different
collections.
set_namesselects which set column(s) themin_freqthreshold applies to. When several sets are named, a fragment passes if any of them meets the threshold (OR logic).set_names=None(default) uses all available sets.
See Fragment sets for how to build and use them.
Attachment-point count and dist2¶
- A mutate fragment (
mutate_mol) can have one to four attachment points. - A grow fragment (
grow_mol) has one attachment point (it replaces a single hydrogen). - A linker (
link_mols) or a ring-closing fragment (make_cycle) has two attachment points. - Partial-ring fragments may have two to four attachment points.
For two-attachment-point fragments, dist2 is the topological distance (in
bonds) between the two attachment points, which is stored in the database.
It lets CReM control linker geometry and ring size:
dist(inlink_mols) filters linkers bydist2.ring_size(inmake_cycle) is translated per anchor pair into adist2filter.
Fragmentation modes (frag_mode)¶
When building a database you choose how molecules are fragmented. This is
the --frag-mode option of cremdb_create (and the frag_mode argument of
crem.db.create_db):
frag_mode |
What it cuts | is_ring_closure |
|---|---|---|
acyclic |
acyclic (non-ring) single bonds — classic MMPA cuts | 0 |
ring |
pairs of single bonds inside one ring (ring arcs) + exhaustive acyclic side cuts | 1 |
ring_optimal |
ring arcs + only exo side cuts adjacent to the arc | 1 |
both |
acyclic + ring |
0 and 1 |
both_optimal (default) |
acyclic + ring_optimal |
0 and 1 |
The optimal modes (ring_optimal, both_optimal) emit only the exo side
cuts — a subset of the exhaustive cuts emitted by ring/both — so they
build a smaller database. Because the exo rows are nested inside the exhaustive
ones, an exhaustive ring/both database also satisfies the exo queries, but
not the other way round.
Rows produced by ring cutting are flagged with is_ring_closure = 1. They are
queried by make_cycle when ring_closures=True
and by mutate_mol
when replace_cycles is "partial_all" or "partial_exo". Acyclic rows
(is_ring_closure = 0) drive ordinary mutate/grow/link and
replace_cycles="forced".
The separate, integer mode option (0 all atoms, 1 heavy atoms only,
2 hydrogen atoms only) controls whether hydrogen cuts are produced within the
acyclic fragmenter. min_size=0 / grow_mol need hydrogen-cut rows, which mode
0 produces.
Database format: v1 and v0¶
CReM reads two database formats:
- v1 — the current, deduplicated schema (
PRAGMA user_version = 1) built bycremdb_create. It stores fragment sets, ring-closure provenance, and shared environment/fragment tables. - v0 — the legacy single-table-per-radius layout produced by the
pipeline
(
fragmentation→frag_to_env→env_to_db).
Both formats work with every generation function. The schema page describes them, and Convert v0 to v1 shows how to upgrade.
The only place the format is visible at generation time is property filtering
via **kwargs: those filters read columns from radiusN in a v0 database and
from frags / frags_h in a v1 database.