Data Preparation¶
Overview¶
The data pipeline merges a primary solubility dataset with several auxiliary property sources, then creates group-aware train/validation/test splits.
Main sources used by src/tgnn_solv/data/sources.py:
BigSolDBv2.1- primary solubility records with temperature and
ln_x2 - Bradley melting points
- broad
T_mcoverage - curated NIST-like overrides and fusion-property sources
- used for
T_manddH_fusenrichment - Hansen parameters
hansen_d,hansen_p,hansen_h- IDAC / infinite-dilution activity coefficients
ln_gamma_inf
The merged dataframe is built by DataBuilder and then split with the helpers
in split.py and split_registry.py.
Canonical CLI¶
python scripts/data/prepare_data.py \
--output-dir notebooks/data/processed \
--split-mode solute_scaffold \
--seed 42
Useful flags:
python scripts/data/prepare_data.py \
--output-dir notebooks/data/processed \
--split-mode solute_scaffold \
--seed 42 \
--train-ratio 0.8 \
--val-ratio 0.1 \
--test-ratio 0.1 \
--skip-download
--skip-download expects the raw files to already exist under the sibling
raw/ directory.
What the Script Does¶
scripts/data/prepare_data.py performs the same high-level workflow as
notebooks/01_prepare_data.ipynb:
- load the primary solubility source
- filter for SLE-compatible records
- load auxiliary property sources
- merge everything with
DataBuilder - create all canonical split families
- write split metadata to
split_manifest.json - print split sizes and auxiliary-label coverage
Output Files¶
One run writes all supported split families:
train.csv,val.csv,test.csvtrain_solute.csv,val_solute.csv,test_solute.csvtrain_solvent.csv,val_solvent.csv,test_solvent.csvsplit_manifest.json
When those processed splits are later frozen for an article-facing benchmark release, the release builder also records checksums and provenance metadata so that external comparisons can prove they used the same data bundle.
The canonical training path uses:
train.csvval.csvtest.csv
The _solute and _solvent variants are comparison splits used by
split-protocol analyses.
Processed CSV Schema¶
Required columns for model training or evaluation:
solute_smilessolvent_smilestemperatureln_x2
Common mask and auxiliary columns:
has_solubilityT_mhas_T_mdH_fushas_dH_fushansen_dhansen_phansen_hhas_hansenln_gamma_infhas_gamma_infsource
The dataset layer will derive additional non-CSV fields at load time, such as:
- solvent type ids
- pair keys for same-pair temperature batching
- Morgan fingerprints
- RDKit descriptors
- descriptor prior features
- fixed group prior features
- crystal GC priors
Two important transformations do not happen in the CSV layer:
- GC-prior
T_m_gccalibration scripts/training/train.pyfitsgc_prior_tm_scaleandgc_prior_tm_biason the training split only whenuse_gc_priors_crystal=True- DirectGNN descriptor normalization
scripts/training/train_directgnn.pycomputes descriptor mean/std on the training split only and stores them in the checkpoint
Split Modes¶
The split logic lives in src/tgnn_solv/data/split.py.
Supported modes:
solute_scaffold- default and strictest molecular generalization split
solute- grouped by exact solute SMILES
solvent- prevents solvent overlap between splits
All modes use group-preserving assignment rather than naive row-wise random splits.
The maintained article benchmark uses:
solute_scaffold- strict molecular generalization
- the default split mode for benchmark-release manifests
Auxiliary-Only Rows¶
The builder can append auxiliary-only rows for pretraining, for example compounds with fusion-property labels but no solubility measurement. This is part of why the processed CSVs may contain rows where:
has_solubility=Falseln_x2is present only as a placeholder target value
The training loss masks these cases correctly.
Fair-Comparison Guidance¶
When comparing against older literature or simpler baselines:
- report
solute_scaffoldas the strict result - report
soluteonly when matching a less strict protocol - use
scripts/experiments/run_split_comparisons.pyto avoid accidental split drift - use
scripts/experiments/build_benchmark_release.pywhen freezing a benchmark bundle for paper, external baselines, or adapter-based custom models
Data-Loader Feature Paths¶
TGNNSolvDataset computes optional side information lazily and caches it:
- Morgan fingerprints
- full RDKit descriptor vectors for DirectGNN augmentation
- compact descriptor-prior features for
Hansen/V_m - fixed group-count prior features for
Hansen/V_m - crystal GC priors for
T_m,dH_fus,dCp_fus
These are dataset-time features, not precomputed CSV columns.
Notes on the two higher-variance feature paths:
- RDKit descriptor vectors are computed through the shared feature helper and sanitized to finite values before model normalization.
- GC crystal priors are raw per-molecule estimates. The training script may
later calibrate
T_m_gc, but the dataset intentionally exposes the raw prior values.
Reproducibility Notes¶
The project now treats processed splits as first-class artifacts rather than just intermediate CSVs. In practice this means:
- the training and evaluation stack expects the canonical processed contract
under
notebooks/data/processed/ - benchmark bundles can record the exact split files and hashes they used
- custom-model adapters are expected to benchmark against the same CSV schema instead of inventing a parallel input format