Evaluation Guide¶
Overview¶
Different maintained evaluation surfaces answer different questions:
- library inference helpers in
src/tgnn_solv/inference.py - single-point prediction, temperature scans, checkpoint I/O, human-readable reports
- library uncertainty helpers in
src/tgnn_solv/uncertainty.py - MC-dropout, deep ensembles, calibration summaries
- library applicability-domain helper in
src/tgnn_solv/domain.py - inference-time OOD screening against the training distribution
scripts/evaluation/evaluate_complete.py- quick checkpoint evaluation with figure-ready arrays
scripts/evaluation/benchmark_tgnn_solv.py- richer
Evaluator-backed benchmark report scripts/evaluation/validate_physics.py- TGNN physical-parameter diagnostics
scripts/experiments/run_split_comparisons.py- fair comparison across scaffold, solute, and solvent splits
scripts/experiments/run_full_budget_experiment.py- full-budget TGNN-vs-DirectGNN diagnostic export
scripts/experiments/run_medium_budget_comparison.py- full-split medium-budget architecture comparison
The grouped scripts/evaluation/ and scripts/experiments/ paths are the
preferred navigation surface. Legacy top-level script paths still work for
backward compatibility.
The same maintained inference, uncertainty, calibration, and OOD surfaces are also exposed interactively in Experiment Lab.
External and Custom Benchmarks¶
The repository now maintains the competitor and custom-model benchmark surfaces in the same canonical artifact format used elsewhere in the project.
Main entry points:
scripts/run_fastsolv.py- FastSolv pretrained inference and scratch training/evaluation on repo splits
scripts/run_solprop.py- SolProp zero-shot prediction and train-split calibration on repo splits
scripts/experiments/run_external_baseline_benchmark.py- orchestrates FastSolv and SolProp together and writes one comparison bundle
scripts/evaluation/benchmark_custom_model.py- benchmarks an arbitrary prediction CSV or a custom command-generated CSV against the canonical split
Artifact contract:
report.jsonpredictions.csvsummary.csv
That shared contract is what lets Results & Plots, the artifact registry, and
the lab compare maintained baselines and custom models without special-case UI.
Current SolProp benchmark policy:
- the maintained stable mode is room-temperature SolProp inference at
298.15 K trainmode then fits a lightweight calibrator on the repo train split, with optional true-temperature inputtrain-nativeretrains the native SolProp architecture directly on TGNN-Solvln(x2)targets and is the maintained architecture-level comparison mode--temperature-dependentremains available, but upstream SolProp is unstable on part of our chemistry, so the wrapper may fall back row-wise
Inference API¶
The core inference helpers live in src/tgnn_solv/inference.py.
load_model(...)¶
from tgnn_solv.inference import load_model
model, cfg = load_model("checkpoints/tgnn_solv_trained.pt")
Behavior:
- reconstructs
TGNNSolvConfigfrom the checkpoint payload - restores compatible weights even if the checkpoint contains extra keys
- preserves saved metadata when present
predict_solubility(...)¶
from tgnn_solv.inference import predict_solubility
result = predict_solubility(
model,
solute_smiles="CC(=O)Nc1ccc(O)cc1",
solvent_smiles="CCO",
T=298.15,
)
predict_solubility returns more than only ln(x2):
- final
x2andln_x2 - ideal-solubility and activity-coefficient decomposition
- crystal properties
T_m,dH_fus,dCp_fus - NRTL outputs such as
tau_12,tau_21,alpha_12 - Hansen predictions and
Ra - correction magnitude and gate value
- optional GC crystal priors when the checkpoint uses them
- optional direct-path sigma outputs when the model emits them
Inference-time feature handling is config-aware:
- if
use_morgan_features=True, Morgan fingerprints are generated on the fly - if
use_descriptor_priors=True, descriptor-prior features are generated on the fly - if
use_group_priors=True, fixed group-prior features are generated on the fly - if
use_gc_priors_crystal=True, crystal GC priors are recomputed on the fly with fallback priors when needed
Normal inference never injects oracle T_m or dH_fus. Oracle substitution is
reserved for diagnostics unless explicitly forced by an evaluation script.
temperature_scan(...)¶
Use this for a solubility curve across temperature:
from tgnn_solv.inference import temperature_scan
scan_df = temperature_scan(
model,
solute_smiles="CC(=O)Nc1ccc(O)cc1",
solvent_smiles="CCO",
T_min=270.0,
T_max=340.0,
n_points=15,
)
The returned DataFrame includes:
Tx2ln_x2x_idealgamma_2correction
interpret_prediction(...)¶
interpret_prediction(result) turns a prediction dict into a readable report
with:
- solubility magnitude
- decomposition into crystal and non-ideal terms
- crystal-property sanity cues
- NRTL parameters
- Hansen-distance interpretation
- direct-path sigma when available
Uncertainty Estimation¶
The uncertainty helpers live in src/tgnn_solv/uncertainty.py.
They now support both maintained model families:
TGNN-SolvDirectGNN
MCDropoutPredictor¶
This is the low-friction post-hoc option when you have one checkpoint:
from tgnn_solv.uncertainty import MCDropoutPredictor
mc = MCDropoutPredictor(model, n_samples=30)
pred = mc.predict("CC(=O)Nc1ccc(O)cc1", "CCO", T=298.15)
For TGNN-Solv, it returns mean, std, and 5th/95th percentiles for:
ln_x2x2gamma_2T_mdH_fusPhiln_gamma_2- correction magnitude
For DirectGNN, the maintained uncertainty payload is intentionally smaller and
stays in the direct-prediction space:
ln_x2x2
EnsemblePredictor¶
This is the higher-fidelity option when you have multiple trained checkpoints:
from tgnn_solv.uncertainty import EnsemblePredictor
ens = EnsemblePredictor([model1, model2, model3])
pred = ens.predict("CC(=O)Nc1ccc(O)cc1", "CCO", T=298.15)
The notebook examples focus on MC-dropout because it only needs one model, but the ensemble API is maintained and available for multi-seed work.
calibration_report(...)¶
calibration_report(predictions, true_ln_x2) summarizes whether the predicted
intervals are sensible. The current report fields are:
n_samplesPICP_90MPIWMAERMSEsharpness
Applicability Domain and OOD Detection¶
The inference-time OOD helper is ApplicabilityDomain in
src/tgnn_solv/domain.py.
What is currently implemented¶
The current AD score combines two signals:
- Mahalanobis distance in the learned pair-representation space
- nearest-neighbor Morgan Tanimoto similarity for solute and solvent
It also reports:
- whether the exact solute was seen in training
- whether the exact solvent was seen in training
- a combined
confidencescore
Important implementation note:
- older high-level descriptions may mention leverage
- the maintained implementation currently does not use leverage in the decision path or output payload
Minimal AD workflow¶
import pandas as pd
from tgnn_solv.data import PROCESSED_DIR, make_loaders
from tgnn_solv.domain import ApplicabilityDomain
train_df = pd.read_csv(PROCESSED_DIR / "train.csv")
val_df = pd.read_csv(PROCESSED_DIR / "val.csv")
test_df = pd.read_csv(PROCESSED_DIR / "test.csv")
train_loader, _, _ = make_loaders(train_df, val_df, test_df, batch_size=cfg.batch_size)
ad = ApplicabilityDomain(model, train_loader)
score = ad.score("CC(=O)Nc1ccc(O)cc1", "CCO", T=298.15)
print(score["in_domain"], score["confidence"])
print(ad.report("CC(=O)Nc1ccc(O)cc1", "CCO", T=298.15))
Current scoring behavior:
mahalanobis <= mahalanobis_cutoff- latent-space criterion passes
- both
tanimoto_soluteandtanimoto_solvent >= tanimoto_threshold - fingerprint criterion passes
in_domain- true only if both criteria pass
confidence- average of a Mahalanobis-derived confidence term and the smaller of the two Tanimoto similarities
This is not wired automatically into predict_solubility. If you want OOD
screening at inference time, call ApplicabilityDomain alongside prediction.
Experiment Lab Evaluation Workbench¶
Experiment Lab exposes the maintained evaluation stack through four
interactive inference modes:
Run & inspect- single-system prediction, RDKit structure view, and temperature scan for both
TGNN-SolvandDirectGNN History & compare- persistent saved runs under
results/lab_runs/inference_history/ Uncertainty lab- ensemble vs MC-dropout review with saved comparison sessions
Calibration dashboard- batch
PICP_90,MPIW,MAE,RMSE, and parity analysis with saved history
These GUI views still delegate to the same underlying APIs:
predict_solubilitypredict_direct_solubilitytemperature_scantemperature_scan_directMCDropoutPredictorEnsemblePredictorcalibration_reportApplicabilityDomain
Related results surface:
Results & Plots -> Benchmark Studio- canonical benchmark-bundle comparison across maintained, external, and custom models
Current scope note:
Run & inspect,Uncertainty lab, andCalibration dashboardnow support bothTGNN-SolvandDirectGNNApplicabilityDomainremainsTGNN-Solv-specific because it relies on the physics model's pair-representation extraction path
Thermodynamic Stress Suite¶
Once you already have a canonical predictions.csv, the maintained stress
suite can slice that bundle into harder evaluation regimes:
python scripts/evaluation/run_thermo_stress_suite.py \
--predictions-csv results/custom_benchmarks/my_model/predictions.csv \
--train-data notebooks/data/processed/train.csv \
--output results/custom_benchmarks/my_model/stress_suite.json
Current slices include:
- temperature extremes
- very low vs moderate/high solubility
- missing
T_m/dH_fussupervision - water vs organic solvents
- extreme
γ∞-like regions when those columns are available - seen vs unseen chemistry relative to the train split
scripts/evaluation/evaluate_complete.py¶
Use this for lightweight checkpoint evaluation:
python scripts/evaluation/evaluate_complete.py \
--test-data notebooks/data/processed/test.csv \
--tgnn-checkpoint checkpoints/tgnn_solv_trained.pt \
--output results/full_evaluation.json \
--verbose
Outputs include:
- overall regression metrics
- temperature-stratified metrics
- solubility-range metrics
- solvent-type metrics
- auxiliary-label-availability metrics
true_ln_x2/pred_ln_x2arrays for plotting
Use this as the default quick report.
scripts/evaluation/benchmark_tgnn_solv.py¶
Use this when you want the richer Evaluator path:
python scripts/evaluation/benchmark_tgnn_solv.py \
--checkpoint checkpoints/tgnn_solv_trained.pt \
--test-data notebooks/data/processed/test.csv \
--output benchmarks/results.json
It shares the same broad report schema as evaluate_complete.py, but is more
convenient for deeper benchmark summaries.
scripts/evaluation/validate_physics.py¶
Use this when you care about TGNN intermediates rather than only final solubility error:
python scripts/evaluation/validate_physics.py \
--checkpoint checkpoints/tgnn_solv_trained.pt \
--test-data notebooks/data/processed/test.csv \
--output results/physics_validation.json \
--device cuda
This script understands the current optional feature paths, including:
- Morgan augmentation
- descriptor priors
- fixed group priors
- crystal GC priors
scripts/experiments/run_split_comparisons.py¶
Use this for fair comparison across split protocols:
python scripts/experiments/run_split_comparisons.py \
--processed-dir notebooks/data/processed \
--splits "solute_scaffold,solute,solvent" \
--models "tgnn_solv,direct_gnn,rf_baseline,rf_morgan,rf_hybrid" \
--config configs/paper_config.yaml \
--n-seeds 3 \
--output results/split_comparisons.json
It:
- resolves canonical train/val/test triplets for each split family
- runs the requested models on matched data
- stores per-split artifacts under
results/split_comparisons/ - aggregates metrics across seeds
scripts/experiments/run_full_budget_experiment.py¶
This is the most detailed single diagnostic runner in the repo:
python scripts/experiments/run_full_budget_experiment.py \
--config configs/paper_config_tuned.yaml \
--train-data notebooks/data/processed/train.csv \
--val-data notebooks/data/processed/val.csv \
--test-data notebooks/data/processed/test.csv \
--seeds 42 \
--output-dir results/full_budget_experiment \
--device cuda
It does all of the following in one command:
- trains TGNN-Solv on the full paper budget
- trains DirectGNN on an equivalent total epoch budget
- evaluates both on the same test split
- exports TGNN intermediate physical parameters for every test sample
- reruns TGNN inference in forced-oracle mode
- writes an interpretation guide alongside the metrics
Per-seed and aggregate artifacts include:
metrics.jsondiagnostics.jsontgnn_intermediates.csvREADME.md
The intermediate CSV includes, when available:
T_m_pred,dH_fus_pred,dCp_fus_predT_m_solver,dH_fus_solver,dCp_fus_solverT_m_gcfor GC-prior crystal runstau_12_pred,tau_21_pred,alpha_predln_gamma2_predPhi_predln_x2_physicsln_x2_final- correction magnitude and gate outputs
- true
T_m/dH_fus - oracle usage masks
The runner reuses resumable per-seed checkpoints created by the main training CLIs.
scripts/experiments/run_medium_budget_comparison.py¶
Use this for the full-scaffold medium-budget architecture comparison:
python scripts/experiments/run_medium_budget_comparison.py \
--train-data notebooks/data/processed/train.csv \
--val-data notebooks/data/processed/val.csv \
--test-data notebooks/data/processed/test.csv \
--output-dir results/medium_budget \
--device cuda
It trains and evaluates:
tgnn_tunedtgnn_gc_priorstgnn_no_bridgetgnn_combined_no_oracledirectgnn_tuneddirectgnn_descriptorsrf_descriptors
Top-level outputs:
results/medium_budget/summary.jsonresults/medium_budget/comparison_table.mdresults/medium_budget/per_model/<model>/...
For TGNN models, per-model outputs include:
metrics.jsonstandard_intermediates.csvoracle_tm_intermediates.csvconfig.yamlresolved_config.json- training logs and checkpoints
The combined TGNN comparison derives a no-oracle training config from
paper_config_combined.yaml and still evaluates oracle mode afterward.
FastSolv and Other External Comparisons¶
Preferred FastSolv wrapper:
python scripts/external/run_fastsolv.py compare \
--input notebooks/data/processed/test.csv \
--tgnn-checkpoint checkpoints/tgnn_solv_trained.pt \
--metrics results/fastsolv_compare.json
SolProp remains an optional separate-environment workflow.
Practical Guidance¶
Use:
evaluate_complete.py- for quick checkpoint reports
benchmark_tgnn_solv.py- for richer benchmark summaries
validate_physics.py- for TGNN-only physical diagnostics
ApplicabilityDomain- for inference-time OOD screening against the training split
MCDropoutPredictor- for single-checkpoint uncertainty
EnsemblePredictor- for multi-checkpoint uncertainty
run_split_comparisons.py- for fair split-protocol comparisons
run_full_budget_experiment.py- for full-budget TGNN-vs-DirectGNN diagnosis
run_medium_budget_comparison.py- for the full-split medium-budget architecture study