Results¶

This page explains how to read the benchmark artifacts that currently live in the repository, which bundles are canonical, and which ones should still be treated as provisional diagnostics rather than final scientific claims.

Read This First¶

The repo contains several different classes of results:

structured reproduction summaries
quick checkpoint evaluations
canonical benchmark bundles
proxy-budget diagnostic experiments
split-comparison outputs
medium-budget and full-budget benchmark runners
targeted research probes, such as descriptor-recovery analyses

Do not compare numbers across different sections unless the following are matched:

split protocol
training budget
seed count
feature path
model family

For this project, the strict default split is solute_scaffold.

Recommended Benchmark Hierarchy¶

When reporting results from TGNN-Solv, prefer this order of evidence:

full scaffold split over proxy subsets
matched training budgets across TGNN and DirectGNN
multi-seed aggregates over single-seed runs
metric bundles that include both top-line error and TGNN bottleneck diagnostics

That is why the maintained benchmark pages focus on:

scripts/experiments/reproduce_paper.py --profile article
scripts/experiments/run_medium_budget_comparison.py
scripts/experiments/run_full_budget_experiment.py
scripts/experiments/run_split_comparisons.py
scripts/experiments/run_external_baseline_benchmark.py

Canonical Artifact Contract¶

The project now relies on one shared benchmark artifact format for maintained external baselines and custom models:

summary.csv
report.json
predictions.csv
run_manifest.json
benchmark_card.json

This is what powers:

Results & Plots -> Benchmark Studio in the lab
artifact registry and diff views
supplementary external-benchmark tables

If a benchmark bundle does not follow that contract, treat it as ad hoc output rather than a first-class comparable result.

The sidecars matter because they carry:

file-level provenance and checksums
split / model-family metadata
capability flags such as uncertainty availability
machine-readable inputs for Benchmark Studio, lineage, and release freezing

Reproduction Summaries¶

Structured paper-reproduction runs now write:

results/reproduction/core_summary.json
results/reproduction/article_summary.json
results/reproduction/full_summary.json

These are orchestration summaries, not benchmark claims by themselves. Their purpose is to record:

which maintained profile was used
which steps completed, failed, or were skipped
where the resulting benchmark and figure artifacts landed

Committed Diagnostic / Proxy Results¶

The checked-in result bundles under results/ are mostly diagnostic or research-stage outputs. They are useful for understanding failure modes, but they should not be treated as the repository's final benchmark claims.

Proxy diagnostic comparison¶

Source artifacts:

results/diagnostic_experiments/summary.json
results/diagnostic_experiments/comparison_table.md

Representative proxy numbers currently committed:

Experiment	Test MAE	Notes
TGNN tuned proxy	`2.2275`	reference proxy checkpoint
TGNN tuned proxy with oracle `T_m` substitution	`2.2133`	only a small improvement on this proxy slice
DirectGNN + descriptors proxy	`2.0110`	strongest committed proxy result in this bundle
TGNN combined proxy	`5.9603`	clearly unstable / underperforming in this diagnostic run

Important caveat:

this bundle was run on a small proxy setup, not the maintained full-scaffold architecture-comparison budget

Descriptor-prior medium scaffold diagnostic¶

Source artifact:

results/descriptor_prior_medium_scaffold.json

Currently committed numbers:

Model	MAE	RMSE	R²
`tgnn_medium`	`2.2561`	`3.0619`	`-0.1660`
`tgnn_priors_medium`	`2.2614`	`3.0674`	`-0.1702`
`direct_gnn_medium`	`2.0692`	`2.8540`	`-0.0130`

Interpretation:

these are medium-sized diagnostic runs on a specific scaffold slice
they are useful for trend inspection, not as the primary benchmark table for the project site

Maintained Architecture Comparison Bundles¶

Medium-budget architecture study¶

The main in-repo architecture-comparison target is the full-scaffold medium-budget run under:

results/medium_budget/

At the moment, the repository contains the in-progress TGNN artifact layout:

results/medium_budget/per_model/tgnn_tuned/checkpoint.pt
results/medium_budget/per_model/tgnn_tuned/train.log

The runner is designed to produce:

results/medium_budget/summary.json
results/medium_budget/comparison_table.md
results/medium_budget/per_model/<model>/...

That is the benchmark bundle the site should treat as the main in-repo architecture comparison once complete.

Recent follow-up lanes derived from the same scaffold split and budget include:

regularized TGNN variants
TGNN with descriptor augmentation
TGNN with the GPS encoder
Stage 0 pretrained TGNN variants

Those runs should be treated as queued / follow-up comparison tracks until their own summaries land under results/.

External article-comparison bundle¶

The maintained external competitor bundle now lives under:

results/external_baselines/article_benchmark/

This is where FastSolv and native-retrained SolProp land when they are run through the shared orchestrator. Once present, this bundle should be compared against the in-repo model families through:

Results & Plots -> Benchmark Studio
supplementary Table S10
checksum-based benchmark release manifests built from those bundles

Custom benchmark bundles¶

Arbitrary user-provided model predictions should land under:

results/custom_benchmarks/<model_name>/

They use the same canonical bundle contract and are therefore intentionally comparable to maintained model families inside the GUI.

If the custom model is implemented through the adapter API instead of a raw CSV, the bundle also captures that fact in its benchmark card.

Frozen Release Manifests¶

When a benchmark snapshot should become paper-facing rather than just locally inspectable, freeze it with:

python scripts/experiments/build_benchmark_release.py \
    --release-name article-benchmark \
    --version 0.1.0 \
    --processed-dir notebooks/data/processed \
    --bundle-root results/external_baselines/article_benchmark \
    --bundle-root results/custom_benchmarks \
    --out-dir results/releases/article_benchmark_v0_1_0

The produced release_manifest.json records checksums for:

processed split CSVs
split_manifest.json
benchmark bundles and their sidecars
optional checkpoints included in the release

Experiment Lab Histories¶

The interactive Experiment Lab writes additional repo-local analysis history under:

results/lab_runs/inference_history/
results/lab_runs/uncertainty_history/
results/lab_runs/calibration_history/

These JSON artifacts are not benchmark claims by themselves. They are review artifacts used for:

saved single-system case studies
ensemble vs MC-dropout uncertainty comparison
batch calibration sessions
lineage tracing back to checkpoints and datasets
planner and pipeline follow-up intake inside the GUI

Descriptor-Recovery Probe¶

One recent diagnostic probe asks whether the solute graph embedding already linearly recovers standard RDKit descriptors.