Data Preparation¶

Overview¶

The data pipeline merges a primary solubility dataset with several auxiliary property sources, then creates group-aware train/validation/test splits.

Main sources used by src/tgnn_solv/data/sources.py:

BigSolDBv2.1
primary solubility records with temperature and ln_x2
Bradley melting points
broad T_m coverage
curated NIST-like overrides and fusion-property sources
used for T_m and dH_fus enrichment
Hansen parameters
hansen_d, hansen_p, hansen_h
IDAC / infinite-dilution activity coefficients
ln_gamma_inf

The merged dataframe is built by DataBuilder and then split with the helpers in split.py and split_registry.py.

Canonical CLI¶

python scripts/data/prepare_data.py \
    --output-dir notebooks/data/processed \
    --split-mode solute_scaffold \
    --seed 42

Useful flags:

python scripts/data/prepare_data.py \
    --output-dir notebooks/data/processed \
    --split-mode solute_scaffold \
    --seed 42 \
    --train-ratio 0.8 \
    --val-ratio 0.1 \
    --test-ratio 0.1 \
    --skip-download

--skip-download expects the raw files to already exist under the sibling raw/ directory.

What the Script Does¶

scripts/data/prepare_data.py performs the same high-level workflow as notebooks/01_prepare_data.ipynb:

load the primary solubility source
filter for SLE-compatible records
load auxiliary property sources
merge everything with DataBuilder
create all canonical split families
write split metadata to split_manifest.json
print split sizes and auxiliary-label coverage

Output Files¶

One run writes all supported split families:

train.csv, val.csv, test.csv
train_solute.csv, val_solute.csv, test_solute.csv
train_solvent.csv, val_solvent.csv, test_solvent.csv
split_manifest.json

When those processed splits are later frozen for an article-facing benchmark release, the release builder also records checksums and provenance metadata so that external comparisons can prove they used the same data bundle.

The canonical training path uses:

train.csv
val.csv
test.csv

The _solute and _solvent variants are comparison splits used by split-protocol analyses.

Processed CSV Schema¶

Required columns for model training or evaluation:

solute_smiles
solvent_smiles
temperature
ln_x2

Common mask and auxiliary columns:

has_solubility
T_m
has_T_m
dH_fus
has_dH_fus
hansen_d
hansen_p
hansen_h
has_hansen
ln_gamma_inf
has_gamma_inf
source

The dataset layer will derive additional non-CSV fields at load time, such as:

solvent type ids
pair keys for same-pair temperature batching
Morgan fingerprints
RDKit descriptors
descriptor prior features
fixed group prior features
crystal GC priors

Two important transformations do not happen in the CSV layer:

GC-prior T_m_gc calibration
scripts/training/train.py fits gc_prior_tm_scale and gc_prior_tm_bias on the training split only when use_gc_priors_crystal=True
DirectGNN descriptor normalization
scripts/training/train_directgnn.py computes descriptor mean/std on the training split only and stores them in the checkpoint

Split Modes¶

The split logic lives in src/tgnn_solv/data/split.py.

Supported modes:

solute_scaffold
default and strictest molecular generalization split
solute
grouped by exact solute SMILES
solvent
prevents solvent overlap between splits

All modes use group-preserving assignment rather than naive row-wise random splits.

The maintained article benchmark uses:

solute_scaffold
strict molecular generalization
the default split mode for benchmark-release manifests

Auxiliary-Only Rows¶

The builder can append auxiliary-only rows for pretraining, for example compounds with fusion-property labels but no solubility measurement. This is part of why the processed CSVs may contain rows where:

has_solubility=False
ln_x2 is present only as a placeholder target value

The training loss masks these cases correctly.

Fair-Comparison Guidance¶

When comparing against older literature or simpler baselines:

report solute_scaffold as the strict result
report solute only when matching a less strict protocol
use scripts/experiments/run_split_comparisons.py to avoid accidental split drift
use scripts/experiments/build_benchmark_release.py when freezing a benchmark bundle for paper, external baselines, or adapter-based custom models

Data-Loader Feature Paths¶

TGNNSolvDataset computes optional side information lazily and caches it:

Morgan fingerprints
full RDKit descriptor vectors for DirectGNN augmentation
compact descriptor-prior features for Hansen / V_m
fixed group-count prior features for Hansen / V_m
crystal GC priors for T_m, dH_fus, dCp_fus

These are dataset-time features, not precomputed CSV columns.

Notes on the two higher-variance feature paths:

RDKit descriptor vectors are computed through the shared feature helper and sanitized to finite values before model normalization.
GC crystal priors are raw per-molecule estimates. The training script may later calibrate T_m_gc, but the dataset intentionally exposes the raw prior values.

Reproducibility Notes¶

The project now treats processed splits as first-class artifacts rather than just intermediate CSVs. In practice this means:

the training and evaluation stack expects the canonical processed contract under notebooks/data/processed/
benchmark bundles can record the exact split files and hashes they used
custom-model adapters are expected to benchmark against the same CSV schema instead of inventing a parallel input format