Build a database (v0)¶

The original CReM database builder is a three-stage text pipeline: fragmentation → frag_to_env → env_to_db. It produces a v0 database.

Use cremdb_create for new databases

The one-step cremdb_create builds the richer v1 format (fragment sets, ring-closure provenance, smaller files). Use the v0 pipeline only when you specifically need it; a v0 database can later be converted to v1.

All-in-one wrapper¶

The shipped shell script runs the whole pipeline. It takes the input SMILES file, an output directory for intermediate files and the final database, and an optional CPU count (default 1):

crem_create_frag_db.sh input.smi fragdb_dir 32

Step by step¶

The manual steps below give you control over each stage and over which radii are built.

1. Fragment the input structures¶

fragmentation -i input.smi -o frags.txt -c 32 -v

Option	Default	Description
`-i`, `--input`	—	Input SMILES, optionally with an ID column
`-o`, `--out`	—	Output fragments file
`-s`, `--sep`	whitespace	Input column delimiter
`-d`, `--sep_out`	`,`	Output delimiter
`-m`, `--mode`	`0`	`0` all atoms, `1` heavy only, `2` H only
`-c`, `--ncpu`	`1`	Number of CPUs
`-v`, `--verbose`	off	Print progress

2. Convert fragments to standardized context/core at a radius¶

frag_to_env -i frags.txt -o r3.txt -r 3 -c 32 -v

Option	Default	Description
`-i`, `--input`	—	Fragmented molecules from step 1
`-o`, `--out`	—	Output text file
`-d`, `--sep`	`,`	Input delimiter
`-r`, `--radius`	`1`	Context radius (in bonds)
`-a`, `--max_heavy_atoms`	`20`	Maximum heavy atoms in a core; larger fragments are discarded
`-k`, `--keep_mols`	—	File of molecule names to keep (others ignored)
`-s`, `--keep_stereo`	off	Keep stereochemistry in env/core
`-c`, `--ncpu`	`1`	Number of CPUs
`-v`, `--verbose`	off	Print progress

The output may contain duplicate lines by design — they are counted in the next step.

3. Count occurrences¶

sort and uniq are standard shell utilities:

sort r3.txt | uniq -c > r3_c.txt

This prepends an occurrence count to each unique line, which becomes the freq column.

4. Import into the database¶

env_to_db -i r3_c.txt -o fragments.db -r 3 -c -v

Option	Default	Description
`-i`, `--input`	—	Counted env/core file from step 3
`-o`, `--out`	—	Output SQLite database
`-r`, `--radius`	— (required)	Radius of this table; an existing table for the radius is dropped
`-c`, `--counts`	off	Input has a leading occurrence count → adds a `freq` column
`-n`, `--ncpu`	`1`	Number of CPUs
`-v`, `--verbose`	off	Print progress

Building multiple radii¶

Repeat steps 2–4 for each radius, writing into the same database file. Each radius becomes its own radiusN table.

for r in 1 2 3 4 5; do
  frag_to_env -i frags.txt -o r${r}.txt -r ${r} -c 32 -v
  sort r${r}.txt | uniq -c > r${r}_c.txt
  env_to_db -i r${r}_c.txt -o fragments.db -r ${r} -c -v
done

Next step¶

To use fragment sets, ring closures, and the smaller v1 layout, convert the result: see Convert v0 to v1.