Repository Audit¶
This audit summarizes the current repo shape, with emphasis on whether the documented workflows actually match the code.
Executive Summary¶
The repository is now coherent around a reproducible CLI path:
scripts/data/prepare_data.pyscripts/training/train.pyscripts/experiments/run_seeds.pyscripts/evaluation/evaluate_complete.pyscripts/experiments/run_split_comparisons.pyscripts/experiments/generate_paper_figures.pyreproduce.sh
The main improvements relative to earlier states are:
- docs now describe the real maintained scripts instead of hypothetical ones
- DirectGNN has a documented standalone CLI and descriptor-augmentation mode
- crystal GC priors, Walden/oracle modes, and the full-budget experiment runner are documented as explicit research features
- the script surface is now grouped by purpose under
scripts/data,scripts/training,scripts/evaluation,scripts/experiments, andscripts/external - the internal package surface is now grouped by purpose under
tgnn_solv.core,chemistry,models,physics,training,evaluation, andresearch - the main training CLIs now support resumable checkpoints
- the medium-budget full-split architecture runner is documented alongside the full-budget diagnostic path
- split-wise comparison and baseline workflows are described consistently with the live code
- benchmark bundles now carry machine-readable provenance sidecars
- custom models can join the benchmark workflow through a formal adapter API
- DirectGNN now participates in the maintained uncertainty / calibration path
Current Strengths¶
- clear separation between model, solver, data, and reporting code
- strong numerical test coverage around the physics core
- canonical split registry for scaffold, solute, and solvent protocols
- one maintained no-physics baseline (
DirectGNN) and multiple descriptor baselines for fair comparison - full-budget diagnostic runner that exports solver intermediates rather than only top-line error metrics
Known Gaps¶
1. Resume Support Is Better, But Not Universal¶
The main TGNN and DirectGNN training CLIs now support resumable checkpoints, and the two heavy experiment runners reuse those checkpoints automatically.
Implication:
- single-run training is preemption-safe
- some lighter wrappers still do not add their own orchestration around partial-progress recovery
2. Optional Dependency Stacks Remain Environment-Sensitive¶
FastSolv and SolProp are still external ecosystems with their own dependency requirements.
Status:
- wrappers are documented honestly as optional
- unrelated script help paths no longer depend on those imports
3. Research Scripts Are Unevenly Hardened¶
Useful research scripts exist for:
- ablations
- learning curves
- temperature extrapolation
- physics validation
- full-budget diagnostics
- medium-budget architecture comparison
They are valuable, but they are not all at the same regression-hardening level as the canonical train/eval path.
4. No Fixed Repo-Wide Metric Target¶
The repository intentionally does not claim one immutable benchmark number that every environment must match exactly.
Implication:
- reproduction should be validated via generated artifacts and consistent schemas, not by a single hardcoded MAE target in the docs
5. Benchmark Release Freezing Is Now Possible, But Still Manual¶
The repo now has a checksum-based release-manifest builder for processed splits and benchmark bundles.
Implication:
- provenance is much better than before
- a truly public frozen release still depends on choosing which bundles and checkpoints to curate and publish
Intentional Overlap¶
Some overlap is useful and should remain:
evaluate_complete.pyvsbenchmark_tgnn_solv.py- quick plot-ready eval vs richer benchmark path
- notebook vs CLI workflows
- exploration vs reproducibility
run_split_comparisons.pyvsrun_full_budget_experiment.py- split-protocol fairness vs one deep budget-matched diagnostic study
Recommended Usage Today¶
Reproducible default path¶
Use:
scripts/data/prepare_data.pyscripts/training/train.pyscripts/experiments/run_seeds.pyscripts/evaluation/evaluate_complete.pyscripts/experiments/run_split_comparisons.pyreproduce.sh
Best baseline-comparison path¶
Use:
scripts/training/train_directgnn.pyconfigs/paper_config_directgnn_descriptors.yamlpython -m tgnn_solv.baselines.rf_baselinescripts/experiments/run_split_comparisons.py
Best bottleneck-diagnosis path¶
Use:
scripts/experiments/run_full_budget_experiment.pyscripts/evaluation/validate_physics.py
Open Follow-Up Opportunities¶
These are still good future hardening targets:
- expand resume-aware orchestration to more multi-run wrappers
- consolidate overlapping benchmark/report formatting helpers even further
- expand regression coverage for the research-heavy experiment scripts
- decide whether any current experimental config variants should be promoted to canonical status
- decide which checkpoints and release manifests deserve first-class public publication rather than only local generation