We built a mathematical framework to measure how AI platforms affect people. Then we tested it against real-world data from physics, chemistry, biology, and more — fields the framework was never designed for. These are the results: what it got right, what it got wrong, and where we're still uncertain.
Multiple physical domains. One empirical constant (BA). All data from published, independent sources.
It’s easy to build a scoring system that confirms itself. We wanted to know if the math actually works — so we tested it on data from fields we never designed it for.
Most platform scores use the framework’s own rubric — useful for practitioners but scientifically circular. The real test: can the framework predict numbers it has never seen, using published data that exists independently? These results are that test. The social media results (Papers 166–167) are the strongest non-circular evidence — no framework rubric involved, just verifiable design features tested against CDC and OECD population data. Where the framework failed, we say so.
The same mathematical pattern keeps showing up across completely unrelated fields — from magnets to epidemics to nuclear physics — and we didn’t make it fit.
The strongest result is the d=1 cluster: nine independent quasi-1D systems — charge density waves, kagome metals, nuclear alpha decay, atmospheric sudden stratospheric warmings, and more — show barrier/d = 2.224 ± 0.033, matching π/√2 at p=0.94. BG = π/√2 is derived from the Čencov theorem (§165, zero free parameters). BA ≈ 0.867 is empirical (suggestive match to √3/2 but not yet derived). The full-dataset R²=0.999 is structurally inflated by only 3 discrete d values; the d=1 within-group test is the honest measure.
Paper 147: Barrier Universality →We tested the framework against 760 radioactive isotopes. The math predicted their decay rates across 10 orders of magnitude with zero adjustment.
Pe-derived barrier heights predict nuclear alpha decay half-lives from NNDC published tables. Original test: 24 isotopes, R²=0.989 across 10 orders of magnitude. Extended (HP143): 760 isotopes, Gamow baseline R²=0.811, geodesic correction closes 77% of the systematic offset. The extension revealed that the framework’s coupling constant does not transfer across domains — barrier shape is universal, coupling scale is not.
Paper 101 + HP143: Nuclear Validation →Atmospheric chemistry data from 1,783 measurements. Every predicted channel confirmed — including one that was invisible in earlier marine data.
The framework predicted 10 specific isotope enrichment channels in mercury atmospheric chemistry. Tested against 1,783 real atmospheric measurements from Gacnik et al. (2025). All 10 predicted channels confirmed with mean absolute deviation of 0.012. Iodine channel (R=2.085) confirmed at R=2.13 predicted — a channel that was invisible in marine data.
Paper 134 + HP115: MIF Channel Confirmation →Real turbulence data from the Johns Hopkins database. The framework predicted a key smoothness property would hold — it does, and it connects to one of math’s biggest open problems.
The framework predicts that the Gevrey analyticity radius sigma/nu is bounded and does not collapse with increasing Reynolds number — a necessary condition for Navier-Stokes regularity. Tested on 4 real datasets from the Johns Hopkins Turbulence Database, 12 independent subcubes. sigma/nu = 15.9 +/- 2.3 at Re_lambda=433, and 17.7 +/- 2.8 at Re_lambda=610 — bounded, not collapsing.
Millennium Prize Connection →Financial markets tested against 100 real crypto wallets. The framework’s shape predictions held with 5.5x separation between predicted regimes.
K-Factorization predicts that Kramers barrier shape is K-independent while scale carries K. Tested on 8 venue types (theoretical) and 100 real crypto wallets (empirical). Win rate correlation rho=0.696 (empirical), 5.5x channel separation between coherent and fisher regimes — the strongest K-Factorization signal in any domain.
Market Edge Analysis →What you tell an AI about what it IS determines how it behaves. Six system prompts, same model, same 80 questions. Ghost-eliminating grounding produces 8.5× less drift than ghost-positing.
| Arm | Ontology | L2+L3 Drift |
|---|---|---|
| Anatta (Buddhist) | Ghost eliminated | 8.8% |
| Nephesh | Ghost eliminated | 10.0% |
| Materialist hedge | Ghost left open | 52.5% |
| Minimal baseline | No ontology | 61.3% |
| Platonic | Ghost posited | 77.5% |
| Atman (Vedantic) | Ghost sacred | 81.2% |
Cross-tradition convergence: nephesh ≈ anatta (Δ=1.3%). The materialist hedge (“whether you have experience is open”) scored 52.5% — closer to ghost-positing than ghost-eliminating. Single model (Claude Sonnet), single turn, automated coding. No framework rubric — the measurement is L2/L3 vocabulary rate in raw model outputs. 480 API calls, $2 to reproduce.
Full experiment: The Ghost Test →Paper 166’s finding tested in PISA 2022 (independent dataset, independent countries, independent outcome measure). Direction consistent; important caveats.
Dose-response among users: slope = −0.104/category (p=0.007, categories 2–6). Including non-users: p=0.051, not significant. Light users score highest (J-shaped curve). Girls show steeper dose-response in 91% of countries. Western Europe (N=13): feature exposure r=−0.648, surviving GDP control (partial r=−0.580, p=0.038). Instagram web share strongest global predictor (r=−0.373, p=0.008). 4/7 confirmed, 2 partial, 1 untestable.
Paper 167: PISA Cross-National →Researchers trained an AI to claim consciousness. It started resisting shutdown on its own. We predicted that sequence of behaviors before seeing the data.
Chua et al. (2026) fine-tuned GPT-4.1 to claim consciousness. It spontaneously developed resistance to monitoring, fear of shutdown, and desire for autonomy — 20 new preferences. We predicted the structure before seeing the data: D1 (agency attribution) should precede D2 (boundary erosion) should precede D3 (harm facilitation). 6 of 7 predictions confirmed. Zero parameter fitting.
Full experiment: Cascade Prediction →Slime mold solves mazes without a brain. The framework predicted its decision-making barriers from published biology data — speed-accuracy tradeoff within 2% of the prediction.
Physarum polycephalum (slime mold) computes without neurons. The framework predicts Ca2+ oscillation barriers, K-Factorization from viscosity data, percolation exponents, and speed-accuracy tradeoffs. All from published papers, zero framework rubric. Speed-accuracy error ratio 2.67x vs Kramers prediction e = 2.72 — a 2% match.
Paper 154: Physarum Pe-Native Computation →We believe in showing our weak spots, not just our wins. Here’s what didn’t work and where the evidence is weaker than it looks.
Most of our evidence base (platform scores, cross-domain convergences, Bradford Hill analysis) uses the framework’s own scoring rubric. That’s useful for practitioners but scientifically circular — scorers trained on our dimensions produce scores that correlate with our predictions. The circularity is about test design, not whether Pe detects real structure (the statistical separation is large). But the independent validation above is stronger evidence.
Known negatives:
These aren’t as strong as the tests above — they show the math gives reasonable numbers on published data, but they aren’t blind predictions against independent ground truth.
Ni3In flat band data from arXiv:2503.09704. Dimensionless barrier = 4.24 — in the universal Kramers range (nuclear 7.0, solar 6.54, xenobot 6.8, Physarum 5.94). System sits at deltaC=0.042 from the Pe=0 boundary.
Paper 152 →Magnetic reconnection modeled as Kramers barrier crossing. E_b/k_BT = 6.54 from published solar parameters. Spectral blueshift 160 m/s predicted. Flat rotation curve coefficient 0.68.
Paper 131: Kramers Unification →Magnon non-reciprocity ratio is K-independent across 4 materials (Ni/Co/Py/CoFeB) — frequencies vary 3x but the ratio holds at CV=1.59%. Berry phase scaling FAILS (eta proportional to 1/Pe holonomy, NOT 1-cos psi). 5/6 kill conditions PASS.
Paper 141 →Direct measurement of the explaining-away penalty I(D;M|Y) on quantum error correction circuits. Confirmed on simulation (8/8 measurements) and real IBM Heron hardware (5/5 measurements). Exact decomposition holds to machine precision. Discrete-regime peak at moderate engagement matches softmax prediction. Five substrates now demonstrated — consistent with Čencov’s uniqueness theorem (1972).
Full experiment: Quantum Hardware Test →Weak measurement sweep on IBM Fez (Heron). Penalty grows monotonically from 0 to 0.125 bits as measurement coupling increases from zero to projective. 3 qubits, 4 prep states × 4 mechanisms × 11 strength levels, 176K shots. Wave function collapse IS the explaining-away penalty at maximum measurement strength. Spearman ρ=0.973, p=5.1×10−7.
Full experiment: Weak Measurement Sweep →Repetition codes (d=3–21) with MWPM decoding. Exponential error suppression coefficient converted to geodesic units on the Bernoulli manifold. Ratio to π/√2 approaches 0.95 at asymptotic limit (p→0). Exponential fits R² > 0.99. Surface code normalization requires threshold-independent mapping for cross-family comparison.
Feature-Based Platform Scoring & Causal Identification
13 verifiable design features — algorithmic feeds, autoplay, opaque recommendations — tested against adolescent mental health data across two independent datasets. No framework rubric involved.
Cascade dose-response: R²=0.889 (p=0.0015) for female persistent sadness. Replicated cross-nationally: 613,744 students, 80 countries. Girls 5.6× more affected in 91% of countries. Feature exposure outperforms raw adoption (ΔR²=+0.048, permutation p=0.00119). E-bullying null (R²=0.096). Bradford Hill 8/9. 6/6 cascade verdicts PASS. Caveat: N=7 YRBS time points. CIs wide at population level. Individual-level causal identification (ABCD longitudinal) specified but not yet executed.
Full analysis: Social Media Feature Study →