Experiments and Benchmarks¶

This page summarizes the maintained experiment surfaces that are most useful for architecture decisions and reproducible comparisons.

At a Glance¶

Reproduce the maintained article workflow¶

Use the structured reproduction runner when you want the maintained end-to-end article workflow.

Open reproduction guide

Run the medium-budget comparison¶

Use the full-scaffold medium-budget runner for architecture triage across TGNN variants, DirectGNN variants, and RF.

Run the full-budget diagnostic study¶

Use the matched-budget TGNN-vs-DirectGNN study when you need physical intermediates and oracle diagnostics.

Compare split protocols¶

Use the split-comparison runner when you need scaffold, solute, and solvent generalization results from one consistent code path.

Canonical Reproduction¶

The maintained reproduction runner is:

python scripts/experiments/reproduce_paper.py --profile article

The legacy shell wrapper still works:

bash reproduce.sh

Use this when you want the closest thing to the repository's current article-comparison workflow. It orchestrates data preparation, tuned TGNN multi-seed training, medium-budget comparison, external FastSolv/SolProp benchmarking, evaluation, split comparisons, supplementary tables, and figure generation.

See Reproducing the Paper for the exact sequence and scope boundary.

Medium-Budget Architecture Comparison¶

This is the maintained architecture-triage runner on the full scaffold split:

python scripts/experiments/run_medium_budget_comparison.py \
    --train-data notebooks/data/processed/train.csv \
    --val-data notebooks/data/processed/val.csv \
    --test-data notebooks/data/processed/test.csv \
    --output-dir results/medium_budget \
    --device cuda

It evaluates:

tuned TGNN
TGNN + GC priors
TGNN + no bridge
TGNN + GC priors + no bridge
tuned DirectGNN
DirectGNN + descriptors
RF on descriptors

Current follow-up lanes built around the same scaffold split and matched-budget logic include:

TGNN + descriptor augmentation
TGNN + GPS encoder
Stage 0 pretrained TGNN variants

Expected outputs:

results/medium_budget/summary.json
results/medium_budget/comparison_table.md
results/medium_budget/per_model/<model>/...

Use this runner when you want a fair comparison between the main maintained architectural choices without paying full paper-scale cost.

Full-Budget TGNN-vs-DirectGNN Diagnostic Study¶

Use the full-budget experiment when you need rich physical diagnostics in addition to headline metrics:

python scripts/experiments/run_full_budget_experiment.py \
    --config configs/paper_config_tuned.yaml \
    --train-data notebooks/data/processed/train.csv \
    --val-data notebooks/data/processed/val.csv \
    --test-data notebooks/data/processed/test.csv \
    --seeds 42 \
    --output-dir results/full_budget_experiment \
    --device cuda

This experiment exports:

TGNN metrics
DirectGNN metrics
oracle-evaluated TGNN metrics
tgnn_intermediates.csv
detailed diagnostics JSON

Use it when you need to inspect whether errors are coming from:

crystal property prediction
NRTL parameterization
correction magnitude
oracle-vs-predicted solver inputs

Split-Wise Comparison¶

Use the split-comparison runner when you need the same model family evaluated under different generalization protocols:

python scripts/experiments/run_split_comparisons.py \
    --processed-dir notebooks/data/processed \
    --splits "solute_scaffold,solute,solvent" \
    --models "tgnn_solv,direct_gnn,rf_baseline,rf_morgan,rf_hybrid" \
    --config configs/paper_config.yaml \
    --output results/split_comparisons.json

This is the safest way to avoid split drift when comparing:

scaffold generalization
exact-solute generalization
solvent-held-out generalization

Hyperparameter Tuning¶

For CLI-driven tuning:

python scripts/experiments/run_optuna.py \
    --models tgnn_solv,tgnn_solv_gps,tgnn_solv_descriptors,direct_gnn,direct_gnn_descriptors \
    --n-trials 20

For interactive tuning and analysis, use 08_optuna_tuning.ipynb.

Current scope note:

Stage 0 pretraining is intentionally not part of the default Optuna loop because rerunning ZINC-scale pretraining inside every trial would dominate the cost of the actual TGNN search

Visual Orchestration¶

For the same workflows in an interactive control surface, use Experiment Lab:

Pipeline Studio
Airflow-style DAG editing, repo-backed presets, and shell export
Planner
kanban board, experiment schedule, and follow-up tasks derived from saved lab history
HPO Lab
Optuna launcher and study dashboard

The visual surfaces are not separate research code paths. They wrap the same maintained CLI entry points documented on this page.

Ablations and Targeted Studies¶

Several maintained but more research-oriented entry points live under scripts/experiments/:

run_ablation.py
learning_curves.py
temperature_extrapolation.py
statistical_tests.py
generate_supplementary.py
build_benchmark_release.py

Use the Script Reference if you need the maturity level and intended role of each script before running it.

Which Runner Should You Use?¶

If you want to...	Use this
reproduce the repository's current article-comparison workflow	`scripts/experiments/reproduce_paper.py --profile article`
keep the old shell entrypoint for compatibility	`reproduce.sh`
compare maintained architectures on the full scaffold split	`run_medium_budget_comparison.py`
inspect TGNN physical intermediates and oracle diagnostics	`run_full_budget_experiment.py`
compare scaffold, solute, and solvent protocols	`run_split_comparisons.py`
tune hyperparameters	`run_optuna.py` or `08_optuna_tuning.ipynb`

Common Output Pattern¶

Most experiment runners write machine-readable artifacts under results/. The common pattern is:

aggregate JSON summary
per-model or per-seed subdirectories
markdown comparison tables for quick review
CSV exports when intermediate predictions matter
canonical benchmark bundles with sidecars when the output is meant to be comparable across families:
summary.csv
report.json
predictions.csv
run_manifest.json
benchmark_card.json

That convention is intentional so downstream reporting and figure-generation scripts can consume the outputs consistently.

When a local benchmark snapshot should become paper-facing instead of just inspectable, freeze it with:

python scripts/experiments/build_benchmark_release.py ...

For post-benchmark robustness slicing on an existing predictions.csv, use:

python scripts/evaluation/run_thermo_stress_suite.py ...