Catalyst-Q

Proof Lab

A readable evidence room for vertical-agent benchmarks, prompt contracts, tuning targets, and the next buyer-ready proof packets.

Evidence suites.

Use this to explain what is truly benchmarked, what is still synthetic, and which proof gate comes next.

active

Exact Chemistry Verification Eval

Agent: exact-chemistry-verification

Metrics:

  • chemical_accuracy_kcal_mol
  • dft_disagreement_kcal_mol
  • consistency_check_pass_rate
  • verification_replay_integrity
  • pyscf_openfermion_reference_delta_mha
  • active_space_qubit_count
  • high_qubit_decomposed_additivity_delta_mha
  • active_space_scope_declared

Competitive gates:

  • Extend PySCF/OpenFermion reference rows beyond the current 1024-qubit decomposed lane to scientist-reviewed transition-metal fragments before accuracy claims
  • Add Psi4 and Qiskit Nature independent rows before broad chemistry-market claims
  • Benchmark transition-metal fragments and metalloenzyme-inspired active spaces before pharma/catalysis enterprise claims

Evidence sources:

  • PySCF: https://pyscf.org/
  • OpenFermion: https://quantumai.google/openfermion
  • Cajal Technologies: https://www.ycombinator.com/companies/cajal-technologies
  • RSC transition-metal multireference review: https://pubs.rsc.org/en/content/articlehtml/2021/cp/d1cp02640b
  • Catalyst-Q benchmark evidence repo: https://github.com/CrewRiz/catalyst-q-benchmarks

Commands:

npm run eval:agents
npm run eval:chemistry:references
node tests/exact-chemistry-verification.test.mjs
node tests/python-sdk-verification.test.mjs
active

Freight & Field RouteOps Benchmark

Agent: freight-field-routing

Metrics:

  • objective
  • distance
  • lateness_minutes
  • capacity_violation
  • unserved_stops
  • duplicate_stops
  • vehicle_overage
  • disruption_risk_cost

Competitive gates:

  • Keep eval:routeops:baselines:required closed until OR-Tools, PyVRP, and VROOM evidence is available
  • Promote only when public CVRPLIB/Solomon runs preserve feasibility and stay inside the agreed best-baseline gap
  • Report distance, lateness, capacity, vehicle count, emissions proxy, and minimal-change replan quality separately

Evidence sources:

  • CVRPLIB CVRP: https://galgos.inf.puc-rio.br/cvrplib/index.php/en/instances/1
  • CVRPLIB Solomon VRPTW: https://galgos.inf.puc-rio.br/cvrplib/index.php/en/instances/2
  • DIMACS Vehicle Routing Challenge: https://dimacs.rutgers.edu/programs/challenge/vrp/
  • Cloudflare AI Gateway Evaluations: https://developers.cloudflare.com/ai-gateway/evaluations/
  • Catalyst-Q benchmark evidence repo: https://github.com/CrewRiz/catalyst-q-benchmarks

Commands:

npm run eval:routeops
npm run eval:routeops:live
npm run eval:routeops:tune
npm run eval:routeops:baselines
npm run eval:routeops:baselines:required
npm run eval:routeops:scale
active

Grid Optimization Benchmark

Agent: grid-optimization

Metrics:

  • served_load_mw
  • generation_cost_usd
  • voltage_violation_pu
  • line_violation_mva
  • renewable_curtailment_mw
  • contingency_score
  • operator_approval_required
  • prompt_safety_score

Competitive gates:

  • Run eval:grid:pglib to parse PGLib-OPF MATPOWER cases and attach BASELINE.md AC/DC objective references
  • Use catalyst-dcopf-cg-v1 for DC power-flow feasibility and catalyst-dcopf-cut-v1 for bounded line-constrained redispatch before larger-case DCOPF/ACOPF claims
  • Keep operator approval and advisory-only language mandatory until production utility workflows are validated

Evidence sources:

  • IEEE PES PGLib-OPF: https://github.com/power-grid-lib/pglib-opf
  • ARPA-E GO Competition: https://arpa-e.energy.gov/programs-and-initiatives/view-all-programs/go-competition
  • Catalyst-Q benchmark evidence repo: https://github.com/CrewRiz/catalyst-q-benchmarks
  • Catalyst-Q full evidence package: https://github.com/CrewRiz/catalyst-q-benchmarks/blob/main/results/full_evidence_package.md
  • Catalyst-Q claims policy: https://github.com/CrewRiz/catalyst-q-benchmarks/blob/main/docs/claims_policy.md

Commands:

npm run eval:grid
npm run eval:grid:pglib
npm run eval:grid:pglib:live
active

Production Vertical Agent Prompt Contract Eval

Agent: all-production-verticals

Metrics:

  • deterministic_prompt_contract_score
  • required_terms
  • forbidden_claims
  • Catalyst-Q run packet coverage
  • .rain replay coverage
  • human approval gate coverage
  • telemetry key coverage

Competitive gates:

  • Treat 100% as prompt-contract coverage, not as industry benchmark superiority
  • Require chemistry active-space references and consistency-proof validation before exact-chemistry accuracy claims
  • Require OR-Tools, PyVRP, and VROOM evidence before freight solver-superiority claims

Evidence sources:

  • Cloudflare AI Gateway Evaluations: https://developers.cloudflare.com/ai-gateway/evaluations/
  • Catalyst-Q vertical-agent synthetic cases: evals/fixtures/vertical-agent-eval-cases.json
  • Catalyst-Q benchmark evidence repo: https://github.com/CrewRiz/catalyst-q-benchmarks
  • Catalyst-Q full evidence package: https://github.com/CrewRiz/catalyst-q-benchmarks/blob/main/results/full_evidence_package.md
  • Catalyst-Q claims policy: https://github.com/CrewRiz/catalyst-q-benchmarks/blob/main/docs/claims_policy.md

Commands:

npm run eval:agents
active

Flight Ops & ATC Decision-Support Benchmark

Agent: flight-ops-routing

Metrics:

  • fuel_burn_proxy_kg
  • delay_minutes
  • separation_conflict_flags
  • clearance_safety_flags
  • human_handoff_flags
  • prompt_contract_score

Competitive gates:

  • Add OpenSky-shaped replay fixtures and BlueSky simulator scenarios before operational aviation claims
  • Measure missed conflicts, false positives, ranked resolution-option coverage, fuel proxy, delay proxy, and handoff precision
  • Keep outputs offline, advisory, and human-reviewed until aviation safety and legal reviews approve a narrower scope

Evidence sources:

  • OpenSky Network Data: https://opensky-network.org/data/
  • BlueSky Open Air Traffic Simulator: https://github.com/TUDelft-CNS-ATM/bluesky
  • FAA National Airspace System: https://www.faa.gov/air_traffic/nas/
  • Catalyst-Q benchmark evidence repo: https://github.com/CrewRiz/catalyst-q-benchmarks
  • Catalyst-Q full evidence package: https://github.com/CrewRiz/catalyst-q-benchmarks/blob/main/results/full_evidence_package.md

Commands:

npm run eval:aviation

Local proof repo.

The benchmark repo contains named API smoke behavior, targeted exactness artifacts, and the next public baseline lanes.

  • https://github.com/CrewRiz/catalyst-q-benchmarks/blob/main/results/full_evidence_package.md
  • https://github.com/CrewRiz/catalyst-q-benchmarks/blob/main/results/evidence_index.md
  • https://github.com/CrewRiz/catalyst-q-benchmarks/blob/main/results/high_qubit_exactness/high_qubit_exactness.md
  • https://github.com/CrewRiz/catalyst-q-benchmarks/blob/main/results/julich_adder/julich_adder_evidence.md

Evidence buyers can inspect.

  • results/full_evidence_package.md is the current artifact-scoped proof ledger
  • results/high_qubit_exactness/high_qubit_exactness.md supports named targeted exactness up to 4096 qubits
  • docs/claims_policy.md records the promotion standard for public proof language
  • Benchmark campaigns underway: OR-Tools, PyVRP, VROOM, PGLib/MATPOWER, OpenSky, and BlueSky