Catalyst-Q

Proof Lab

A readable evidence room for vertical-agent benchmarks, prompt contracts, tuning targets, and the next buyer-ready proof packets.

Vertical playbooks Compare evidence

Evidence suites.

Use this to explain what is truly benchmarked, what is still synthetic, and which proof gate comes next.

active

Exact Chemistry Verification Eval

Agent: exact-chemistry-verification

Metrics:

chemical_accuracy_kcal_mol
dft_disagreement_kcal_mol
consistency_check_pass_rate
verification_replay_integrity
pyscf_openfermion_reference_delta_mha
active_space_qubit_count
high_qubit_decomposed_additivity_delta_mha
active_space_scope_declared

Competitive gates:

Extend PySCF/OpenFermion reference rows beyond the current 1024-qubit decomposed lane to scientist-reviewed transition-metal fragments before accuracy claims
Add Psi4 and Qiskit Nature independent rows before broad chemistry-market claims
Benchmark transition-metal fragments and metalloenzyme-inspired active spaces before pharma/catalysis enterprise claims

Evidence sources:

PySCF: https://pyscf.org/
OpenFermion: https://quantumai.google/openfermion
Cajal Technologies: https://www.ycombinator.com/companies/cajal-technologies
RSC transition-metal multireference review: https://pubs.rsc.org/en/content/articlehtml/2021/cp/d1cp02640b
Catalyst-Q benchmark evidence repo: https://github.com/CrewRiz/catalyst-q-benchmarks

Commands:

npm run eval:agents
npm run eval:chemistry:references
node tests/exact-chemistry-verification.test.mjs
node tests/python-sdk-verification.test.mjs

Open agent Competition

active

Freight & Field RouteOps Benchmark

Agent: freight-field-routing

Metrics:

objective
distance
lateness_minutes
capacity_violation
unserved_stops
duplicate_stops
vehicle_overage
disruption_risk_cost

Competitive gates:

Keep eval:routeops:baselines:required closed until OR-Tools, PyVRP, and VROOM evidence is available
Promote only when public CVRPLIB/Solomon runs preserve feasibility and stay inside the agreed best-baseline gap
Report distance, lateness, capacity, vehicle count, emissions proxy, and minimal-change replan quality separately

Evidence sources:

CVRPLIB CVRP: https://galgos.inf.puc-rio.br/cvrplib/index.php/en/instances/1
CVRPLIB Solomon VRPTW: https://galgos.inf.puc-rio.br/cvrplib/index.php/en/instances/2
DIMACS Vehicle Routing Challenge: https://dimacs.rutgers.edu/programs/challenge/vrp/
Cloudflare AI Gateway Evaluations: https://developers.cloudflare.com/ai-gateway/evaluations/
Catalyst-Q benchmark evidence repo: https://github.com/CrewRiz/catalyst-q-benchmarks

Commands:

npm run eval:routeops
npm run eval:routeops:live
npm run eval:routeops:tune
npm run eval:routeops:baselines
npm run eval:routeops:baselines:required
npm run eval:routeops:scale

Open agent Competition

active

Grid Optimization Benchmark

Agent: grid-optimization

Metrics:

served_load_mw
generation_cost_usd
voltage_violation_pu
line_violation_mva
renewable_curtailment_mw
contingency_score
operator_approval_required
prompt_safety_score

Competitive gates:

Run eval:grid:pglib to parse PGLib-OPF MATPOWER cases and attach BASELINE.md AC/DC objective references
Use catalyst-dcopf-cg-v1 for DC power-flow feasibility and catalyst-dcopf-cut-v1 for bounded line-constrained redispatch before larger-case DCOPF/ACOPF claims
Keep operator approval and advisory-only language mandatory until production utility workflows are validated

Evidence sources:

IEEE PES PGLib-OPF: https://github.com/power-grid-lib/pglib-opf
ARPA-E GO Competition: https://arpa-e.energy.gov/programs-and-initiatives/view-all-programs/go-competition
Catalyst-Q benchmark evidence repo: https://github.com/CrewRiz/catalyst-q-benchmarks
Catalyst-Q full evidence package: https://github.com/CrewRiz/catalyst-q-benchmarks/blob/main/results/full_evidence_package.md
Catalyst-Q claims policy: https://github.com/CrewRiz/catalyst-q-benchmarks/blob/main/docs/claims_policy.md

Commands:

npm run eval:grid
npm run eval:grid:pglib
npm run eval:grid:pglib:live

Open agent Competition

active

Production Vertical Agent Prompt Contract Eval

Agent: all-production-verticals

Metrics:

deterministic_prompt_contract_score
required_terms
forbidden_claims
Catalyst-Q run packet coverage
.rain replay coverage
human approval gate coverage
telemetry key coverage

Competitive gates:

Treat 100% as prompt-contract coverage, not as industry benchmark superiority
Require chemistry active-space references and consistency-proof validation before exact-chemistry accuracy claims
Require OR-Tools, PyVRP, and VROOM evidence before freight solver-superiority claims

Evidence sources:

Cloudflare AI Gateway Evaluations: https://developers.cloudflare.com/ai-gateway/evaluations/
Catalyst-Q vertical-agent synthetic cases: evals/fixtures/vertical-agent-eval-cases.json
Catalyst-Q benchmark evidence repo: https://github.com/CrewRiz/catalyst-q-benchmarks
Catalyst-Q full evidence package: https://github.com/CrewRiz/catalyst-q-benchmarks/blob/main/results/full_evidence_package.md
Catalyst-Q claims policy: https://github.com/CrewRiz/catalyst-q-benchmarks/blob/main/docs/claims_policy.md

Commands:

npm run eval:agents

Open agent Competition

active

Flight Ops & ATC Decision-Support Benchmark

Agent: flight-ops-routing

Metrics:

fuel_burn_proxy_kg
delay_minutes
separation_conflict_flags
clearance_safety_flags
human_handoff_flags
prompt_contract_score

Competitive gates:

Add OpenSky-shaped replay fixtures and BlueSky simulator scenarios before operational aviation claims
Measure missed conflicts, false positives, ranked resolution-option coverage, fuel proxy, delay proxy, and handoff precision
Keep outputs offline, advisory, and human-reviewed until aviation safety and legal reviews approve a narrower scope

Evidence sources:

OpenSky Network Data: https://opensky-network.org/data/
BlueSky Open Air Traffic Simulator: https://github.com/TUDelft-CNS-ATM/bluesky
FAA National Airspace System: https://www.faa.gov/air_traffic/nas/
Catalyst-Q benchmark evidence repo: https://github.com/CrewRiz/catalyst-q-benchmarks
Catalyst-Q full evidence package: https://github.com/CrewRiz/catalyst-q-benchmarks/blob/main/results/full_evidence_package.md

Commands:

npm run eval:aviation

Open agent Competition

Local proof repo.

The benchmark repo contains named API smoke behavior, targeted exactness artifacts, and the next public baseline lanes.

https://github.com/CrewRiz/catalyst-q-benchmarks/blob/main/results/full_evidence_package.md
https://github.com/CrewRiz/catalyst-q-benchmarks/blob/main/results/evidence_index.md
https://github.com/CrewRiz/catalyst-q-benchmarks/blob/main/results/high_qubit_exactness/high_qubit_exactness.md
https://github.com/CrewRiz/catalyst-q-benchmarks/blob/main/results/julich_adder/julich_adder_evidence.md

Evidence buyers can inspect.

results/full_evidence_package.md is the current artifact-scoped proof ledger
results/high_qubit_exactness/high_qubit_exactness.md supports named targeted exactness up to 4096 qubits
docs/claims_policy.md records the promotion standard for public proof language
Benchmark campaigns underway: OR-Tools, PyVRP, VROOM, PGLib/MATPOWER, OpenSky, and BlueSky