Build a database (v1)

cremdb_create builds a v1 fragment database directly from a SMILES file in one step. It is the recommended way to create new databases. The same functionality is available programmatically through crem.db.create_db.

Basic command

cremdb_create -i input.smi -o fragments.db -s chembl
  • -i/--input — input SMILES file (plain text, or .zst-compressed).
  • -o/--output — output SQLite database.
  • -s/--set-name — name of the fragment set to create. May also be one or more membership files (see Fragment sets).

Input format

One molecule per line: a SMILES, optionally followed by an ID.

CCO         mol_0001
c1ccccc1    mol_0002

The ID (second column) is optional, but it is required when assigning molecules to sets via membership files. Columns are whitespace-separated by default; use --sep for a custom delimiter.

Common options

cremdb_create -i input.smi -o fragments.db -s chembl \
  --radii 1 2 3 4 5 \
  --ncpu 16 \
  --frag-mode both_optimal \
  --max-heavy-atoms 15
Option Default Description
-r, --radii 1 2 3 4 5 Context radii to build
-c, --ncpu 1 Worker processes (capped at available CPUs)
--max-heavy-atoms 15 Maximum heavy atoms in a core fragment
--mode 0 Acyclic cut mode: 0 all atoms, 1 heavy only, 2 H only
--frag-mode both_optimal Fragmentation source: acyclic, ring, both, ring_optimal, both_optimal (see frag modes)
--keep-stereo off Retain stereochemistry in env/core
--sep whitespace Input column delimiter
--chunk-size 100 Input lines per worker task
--flush-every 100 Chunks accumulated in memory before each DB flush
--prefetch 4 In-flight task batches per worker
--zstd off Force zstd decompression of the input
--log-every off Print a progress line every N chunks
--timings off Print per-flush timing breakdown to stderr
--fragment-error-log off Write fragment validation issues to <output>.errors instead of stderr

--frag-mode and ring-based generation

Ring-derived rows (is_ring_closure = 1) are needed by make_cycle(..., ring_closures=True) and by mutate_mol(..., replace_cycles="partial_all"/"partial_exo"). The optimal modes store only exo side cuts — a subset of the exhaustive ring/both cuts — so they are smaller. Therefore:

  • partial_exo and make_cycle work with any ring-capable mode (ring, both, ring_optimal, both_optimal).
  • partial_all needs the exhaustive ring or both (on an optimal DB it under-matches — the non-exo cuts are absent).

Ordinary mutate/grow/link and replace_cycles="forced" use only acyclic rows and work with any mode. The default both_optimal covers ordinary generation, ring closure, and partial_exo.

Indices are created automatically at the end of the run.

Building on an existing database

Running cremdb_create against an existing database is additive: new radii and new set names are added, and existing data is preserved. This lets you extend a database with more molecules or more sets later.

Large datasets: sharded and parallel builds

For very large inputs, build in shards and merge:

  • --shard-size N — write at most N input structures per shard database created sequentially, then merge the shards into the output at the end. Shard 0 is the output file; later shards get a _NNN suffix.
  • --parallel-shards N — build N shards concurrently, each on a stride of the input, splitting --ncpu evenly across them. Intermediate parts live in <output>.parts/ and are merged with a parallel binary-tree reduction.
# 8 concurrent shard builders across 32 CPUs
cremdb_create -i big_input.smi -o fragments.db -s chembl \
  --parallel-shards 8 --ncpu 32

--shard-size and --parallel-shards > 1 are mutually exclusive.

Merge shards manually with cremdb_merge

Shards or individual v1 databases can also be merged by hand — for example to combine the per-shard databases from --shard-size, or to merge shards built on different machines. cremdb_merge merges source databases into an existing target; it is idempotent and resumable, so already-absorbed sources are skipped.

cremdb_merge -t base.db -i shard_001.db shard_002.db shard_003.db
Option Default Description
-t, --target (required) Target database; must already exist with schema and data
-i, --input (required) One or more source shard databases to merge in
--no-index off Skip index creation after merge (useful when more shards will follow)
--parallel N 1 Merge with a binary-tree reduction using up to N concurrent pair-merges per round

The same operation is available in Python as crem.db.merge_dbs.

Resumable runs

cremdb_create -i input.smi -o fragments.db -s chembl \
  --processed-chunks chunks.done --log-every 100

If the run is interrupted, rerun the same command: chunks recorded in chunks.done are skipped. Parallel and sharded builds manage their own resume markers internally, so simply rerunning the command resumes them.

Naming rules of fragment sets

  • A set name must be a valid SQLite identifier: [A-Za-z_][A-Za-z0-9_]*.
  • The reserved names env_id and core_smi_id are not allowed.

Python API

crem.db.create_db exposes the same build process and accepts either a file path or an iterable of "SMILES [ID]" strings:

from crem.db import create_db

create_db(
    "input.smi",
    "fragments.db",
    set_name="chembl",
    radii=(1, 2, 3, 4, 5),
    ncpu=16,
    frag_mode="both_optimal",
    max_heavy_atoms=15,
)

To assign molecules to multiple sets, pass a dict mapping each set name to a set of IDs (or None for "all molecules"):

create_db(
    "input.smi",
    "fragments.db",
    set_name={"chembl": None, "focused": {"mol_0001", "mol_0007"}},
)

See crem.db for the full signature, and Fragment sets for the set model.