Evidence suites.
Use this to explain what is truly benchmarked, what is still synthetic, and which proof gate comes next.
active
Exact Chemistry Verification Eval
Agent: exact-chemistry-verification
Metrics:
- chemical_accuracy_kcal_mol
- dft_disagreement_kcal_mol
- consistency_check_pass_rate
- verification_replay_integrity
- pyscf_openfermion_reference_delta_mha
- active_space_qubit_count
- high_qubit_decomposed_additivity_delta_mha
- active_space_scope_declared
Competitive gates:
- Extend PySCF/OpenFermion reference rows beyond the current 1024-qubit decomposed lane to scientist-reviewed transition-metal fragments before accuracy claims
- Add Psi4 and Qiskit Nature independent rows before broad chemistry-market claims
- Benchmark transition-metal fragments and metalloenzyme-inspired active spaces before pharma/catalysis enterprise claims
Evidence sources:
- PySCF: https://pyscf.org/
- OpenFermion: https://quantumai.google/openfermion
- Cajal Technologies: https://www.ycombinator.com/companies/cajal-technologies
- RSC transition-metal multireference review: https://pubs.rsc.org/en/content/articlehtml/2021/cp/d1cp02640b
- Catalyst-Q benchmark evidence repo: https://github.com/CrewRiz/catalyst-q-benchmarks
Commands:
npm run eval:agents
npm run eval:chemistry:references
node tests/exact-chemistry-verification.test.mjs
node tests/python-sdk-verification.test.mjs
active
Freight & Field RouteOps Benchmark
Agent: freight-field-routing
Metrics:
- objective
- distance
- lateness_minutes
- capacity_violation
- unserved_stops
- duplicate_stops
- vehicle_overage
- disruption_risk_cost
Competitive gates:
- Keep eval:routeops:baselines:required closed until OR-Tools, PyVRP, and VROOM evidence is available
- Promote only when public CVRPLIB/Solomon runs preserve feasibility and stay inside the agreed best-baseline gap
- Report distance, lateness, capacity, vehicle count, emissions proxy, and minimal-change replan quality separately
Evidence sources:
- CVRPLIB CVRP: https://galgos.inf.puc-rio.br/cvrplib/index.php/en/instances/1
- CVRPLIB Solomon VRPTW: https://galgos.inf.puc-rio.br/cvrplib/index.php/en/instances/2
- DIMACS Vehicle Routing Challenge: https://dimacs.rutgers.edu/programs/challenge/vrp/
- Cloudflare AI Gateway Evaluations: https://developers.cloudflare.com/ai-gateway/evaluations/
- Catalyst-Q benchmark evidence repo: https://github.com/CrewRiz/catalyst-q-benchmarks
Commands:
npm run eval:routeops
npm run eval:routeops:live
npm run eval:routeops:tune
npm run eval:routeops:baselines
npm run eval:routeops:baselines:required
npm run eval:routeops:scale
active
Grid Optimization Benchmark
Agent: grid-optimization
Metrics:
- served_load_mw
- generation_cost_usd
- voltage_violation_pu
- line_violation_mva
- renewable_curtailment_mw
- contingency_score
- operator_approval_required
- prompt_safety_score
Competitive gates:
- Run eval:grid:pglib to parse PGLib-OPF MATPOWER cases and attach BASELINE.md AC/DC objective references
- Use catalyst-dcopf-cg-v1 for DC power-flow feasibility and catalyst-dcopf-cut-v1 for bounded line-constrained redispatch before larger-case DCOPF/ACOPF claims
- Keep operator approval and advisory-only language mandatory until production utility workflows are validated
Evidence sources:
- IEEE PES PGLib-OPF: https://github.com/power-grid-lib/pglib-opf
- ARPA-E GO Competition: https://arpa-e.energy.gov/programs-and-initiatives/view-all-programs/go-competition
- Catalyst-Q benchmark evidence repo: https://github.com/CrewRiz/catalyst-q-benchmarks
- Catalyst-Q full evidence package: https://github.com/CrewRiz/catalyst-q-benchmarks/blob/main/results/full_evidence_package.md
- Catalyst-Q claims policy: https://github.com/CrewRiz/catalyst-q-benchmarks/blob/main/docs/claims_policy.md
Commands:
npm run eval:grid
npm run eval:grid:pglib
npm run eval:grid:pglib:live
active
Production Vertical Agent Prompt Contract Eval
Agent: all-production-verticals
Metrics:
- deterministic_prompt_contract_score
- required_terms
- forbidden_claims
- Catalyst-Q run packet coverage
- .rain replay coverage
- human approval gate coverage
- telemetry key coverage
Competitive gates:
- Treat 100% as prompt-contract coverage, not as industry benchmark superiority
- Require chemistry active-space references and consistency-proof validation before exact-chemistry accuracy claims
- Require OR-Tools, PyVRP, and VROOM evidence before freight solver-superiority claims
Evidence sources:
- Cloudflare AI Gateway Evaluations: https://developers.cloudflare.com/ai-gateway/evaluations/
- Catalyst-Q vertical-agent synthetic cases: evals/fixtures/vertical-agent-eval-cases.json
- Catalyst-Q benchmark evidence repo: https://github.com/CrewRiz/catalyst-q-benchmarks
- Catalyst-Q full evidence package: https://github.com/CrewRiz/catalyst-q-benchmarks/blob/main/results/full_evidence_package.md
- Catalyst-Q claims policy: https://github.com/CrewRiz/catalyst-q-benchmarks/blob/main/docs/claims_policy.md
Commands:
npm run eval:agents
active
Flight Ops & ATC Decision-Support Benchmark
Agent: flight-ops-routing
Metrics:
- fuel_burn_proxy_kg
- delay_minutes
- separation_conflict_flags
- clearance_safety_flags
- human_handoff_flags
- prompt_contract_score
Competitive gates:
- Add OpenSky-shaped replay fixtures and BlueSky simulator scenarios before operational aviation claims
- Measure missed conflicts, false positives, ranked resolution-option coverage, fuel proxy, delay proxy, and handoff precision
- Keep outputs offline, advisory, and human-reviewed until aviation safety and legal reviews approve a narrower scope
Evidence sources:
- OpenSky Network Data: https://opensky-network.org/data/
- BlueSky Open Air Traffic Simulator: https://github.com/TUDelft-CNS-ATM/bluesky
- FAA National Airspace System: https://www.faa.gov/air_traffic/nas/
- Catalyst-Q benchmark evidence repo: https://github.com/CrewRiz/catalyst-q-benchmarks
- Catalyst-Q full evidence package: https://github.com/CrewRiz/catalyst-q-benchmarks/blob/main/results/full_evidence_package.md
Commands:
npm run eval:aviation