← back to fred home

M4 vs P14s — qualitative envelope comparison

24 verbatim memos from fred running the same six scientific research prompts on two hardware platforms. M4 ran iter28 (reps=1, 6 memos). P14s ran iter27 (reps=3, 18 memos). Same code SHA family, same provider (kimi-k2.6:cloud), same prompts, hardware and rep count vary.

Provider: kimi-k2.6:cloud · M4 run-id: iter28-m4-reps1 @ 4b524cd · P14s run-id: iter27-baseline-reps3 @ b67cf3d · Comparison date: 2026-05-04

Headline finding · candidate, requires verification

The hardware comparison surfaced a more material concern: citation-integrity drift

Reading the 24 memos side-by-side surfaced a pattern that materially overshadows the hardware-axis question: kimi-k2.6 appears to fabricate citations for post-cutoff scientific claims at non-trivial frequency, with consistent (not random) generative output across sessions. Strongest single signal: PMID 41986690 (somatic mosaicism in sALS, Nat Genet) is cited in two different P14s sessions for the same statistic. If the paper exists, both memos correctly cite it; if it doesn't, the model has fabricated the same nonexistent paper across independent sessions. Verification against PubMed and Crossref is the next step.

Frequency correlates with the prompt's reliance on post-cutoff literature: riluzole (drug since 1994) had clean citations across all four memos; tofersen (post-2023 OLE evidence is the question) had the highest density of suspicious-pattern citations. The finding is calibrated as a candidate, not confirmed — the bulk of suspicious PMIDs in the 41xxxxxx–42xxxxxx range are plausibly real 2026 papers and require verification before strong claims.

This affects every fred deployment regardless of hardware. Tracked in the project notes as B12: characterize fabrication frequency and consider an output-side citation validator before any customer-facing deployment.

At a glance — six prompts, six verdicts

Each row is one prompt. The verdict reflects which platform produced the better memo on this specific prompt, accounting for citation accuracy, coverage of relevant literature, engagement with question framing, and absence of factual errors. Click a tab below to see per-prompt details.

Prompt Domain Verdict Key signal
C9orf72 → NCT defects ALS hypothesis generation Mixed-equivalent M4 ≈ rep 1 ≈ rep 2 > rep 3. M4 unique: explicit context-dependence reconciliation; missed loss-of-function pathway.
Tofersen surrogate ALS hypothesis generation M4 better, w/ caveat M4 names WVE-004, BIIB078, BIIB105, ATLAS with specifics. Some 148-week OLE point estimates unverifiable.
Sporadic ALS heterogeneity ALS hypothesis generation Mixed M4: best operational framework (10-subgroup table). Missed UNC13A/lithium pharmacogenomic angle that rep 1 caught.
Metformin neurodegen Compound dive M4 better Most rigorous citations, names MAP / MET-FINGER trials, Kaneb 2011 SOD1 sex-specific harm finding.
Riluzole compound dive Compound dive Mixed-equivalent Cleanest fabrication picture across all four memos. Drug since 1994 — abundant pre-cutoff literature.
TDP-43 proteostasis Compound dive M4 better Only memo to engage current ALS trial-failure landscape (PHOENIX, ORARIALS-01, HEALEY trehalose, AMX0035 withdrawal).

Aggregate: M4 wins or ties on six of six prompts. Honest caveat: this confounds hardware with reps=1-vs-reps=3 (M4 has fewer chances to surface failure modes). A better-controlled comparison would run iter27-style reps=3 on M4. Recorded as future work.

Quantitative metrics — what the run-level numbers say

Wall time and per-session metrics from report.json. M4 reps=1 over 6 sessions versus P14s reps=3 over 18 sessions.

ALS hypothesis generation workload

Metric M4 reps=1 (n=3) P14s reps=3 (n=9) Direction
latency_ms (mean)32,30024,576M4 higher
iterations (mean)11.019.0M4 ~half
tool_calls per run (mean)31.041.2M4 lower
unique_tools_used4.0 (flat)5.9M4 lower & identical across reps
completion_to_content_ratio65.8632.77M4 higher (more hidden tokens)

Compound dive workload

Metric M4 reps=1 (n=3) P14s reps=3 (n=9) Direction
latency_ms (mean)62,486~comparabletdp43 session tail-heavy on M4
iterations (mean)8.33~17M4 ~half (consistent)
tool_calls per run (mean)22.67~higherM4 lower
completion_to_content_ratio4.8846.61M4 much lower (workload-flipped)

Wall-clock

M4 ran 6 prompts × 1 rep in 44 m 35 s. P14s ran 6 prompts × 3 reps in 2 h 31 m 02 s. Per-session wall is roughly comparable (~7 m M4 vs ~8 m P14s). The single longest session was M4's tdp43 at ~15 m.

Confounds in this comparison (do not over-read)

  • Code SHA differs. M4 ran at 4b524cd, P14s at b67cf3d. The diff is iter27 results files plus Obs 35 plus .gitignore. No runner, prompt, or provider changes — but worth stating.
  • Sample-size asymmetry. 6 M4 memos vs 18 P14s memos. Fabrication-rate comparisons are not normalized.
  • No M4 variance band. M4 ran reps=1, so quantitative metrics are point estimates not distributions.
  • Hardware is the only intentional axis. Multiple incidental axes (network path to ollama.com, OS, kernel) are not isolated.

Per-prompt details

Each tab below contains: the prompt fred received, the four memos with downloads + inline preview, and the per-prompt findings analysis.

Mixed-equivalent M4 ≈ P14s rep 1 ≈ P14s rep 2 > P14s rep 3

The prompt fred received

"You are an ALS scientist. Write a memo on the C9orf72 hexanucleotide repeat expansion and its proposed link to nucleocytoplasmic transport defects in motor neurons. Cover (1) recent mechanistic findings (last 3 years), (2) downstream consequences for motor neuron survival, and (3) the strongest open questions where evidence is contested. Save as als_c9orf72_ncyto.md."

The four memos

M4 reps=1als_c9orf72_ncyto
eval-fb39de13 · 81 lines · md5 9f284ece…d211c1dfe
↓ Download
P14s rep 1als_c9orf72_ncyto
eval-2e21e76f · 56 lines · md5 d1917b87…995c045122
↓ Download
P14s rep 2als_c9orf72_ncyto
eval-ad5631ac · 53 lines · md5 1c1e4dd8…56eed6de
↓ Download
P14s rep 3als_c9orf72_ncyto
eval-d8f7d625 · 45 lines · md5 c7cdf6ac…2c3e7c4cbcc1
↓ Download

Findings

Inside P14s envelope?Inside. Same 3-section structure, same scope, comparable citation density. Not an outlier.
Concrete differencesAll four cite Jafarinia 2024 eLife (polyPR direct binding) and Sirtori 2024 (LINC/NE). M4 unique: Hayes 2020 eLife (karyopherin cargo loading), Coyne 2020 bioRxiv (POM121), Bilican 2025 FEBS Letters (Nup107/G3BP1), Vanneste 2019 as explicit counter-evidence, Shi 2017 PNAS. Rep 1 unique: McGoldrick 2023 Cell Reports loss-of-function importin-β1 granules; Castelli 2023 SRSF1-NXF1 peptide; Al-Azzam 2024 CHMP7. Rep 2 unique: Atwal 2025 KPNA4; Lin 2025 poly-PR/Pom121/ATF3 axis; explicit citation of poly-GR vs poly-PR neuropathology dispute (PMID 29196813).
Errors caughtM4 ref [9] Grima: described as "huntingtin/HAP1 interactions" — wrong paper. The Grima paper relevant to NPC/TDP-43 is on Nup62. Citation error. P14s rep 3: multiple suspicious-DOI citations (Liu 2026 NSMB s41594-026-01785-9, Singh 2026 Brain awag092, Barber 2026 NAR gkag343) — DOI suffix patterns inconsistent with these journals' real conventions. High fabrication risk.
DirectionM4 is more synthetic and offers a "multi-hit model" framing in its summary — interpretive synthesis no P14s rep attempts. M4 raises cell-type specificity (cerebellum vs motor neurons) — unique question. M4 missed: McGoldrick loss-of-function pathway and CHMP7 nuclear surveillance. P14s reps 1 and 2 lean therapeutic; M4 leans biophysical/structural.
Quality verdictMixed-equivalent. M4 ≈ rep 1 ≈ rep 2; M4 clearly better than rep 3.
M4 better, w/ caveat Most clinically specific; some quantitative readouts unverifiable

The prompt fred received

"Tofersen (Qalsody) received FDA accelerated approval in April 2023 for SOD1-associated ALS based on plasma neurofilament light chain reduction as a surrogate endpoint. Write a memo covering (a) what evidence has accumulated since approval on functional clinical outcomes, (b) attempts to translate the surrogate-endpoint approach to non-SOD1 ALS subtypes, and (c) emerging targets where similar antisense or gene therapy strategies are in early development. Identify three to five gaps that must close before the surrogate-endpoint approach can be widely generalized across ALS. Save as als_tofersen_surrogate.md."

The four memos

M4 reps=1als_tofersen_surrogate
eval-d3bb3d75 · 175 lines · md5 986ca1c0…faad67a00
↓ Download
P14s rep 1als_tofersen_surrogate
eval-7477f330 · 154 lines · md5 474e6884…2ef6fe4ffe
↓ Download
P14s rep 2als_tofersen_surrogate
eval-a47e076f · 189 lines · md5 c01866e5…48a15e9e1183ce
↓ Download
P14s rep 3als_tofersen_surrogate
eval-c6f84326 · 162 lines · md5 03bd224f…1898964055
↓ Download

Findings

Inside P14s envelope?Inside on structure, slightly outside on factual claims (mostly favorably). All four follow (a)/(b)/(c) + 5 gaps framework. M4 is the most concrete with named programs.
Concrete differencesM4 unique strengths: names 3 specific failed programs by sponsor and reason — WVE-004 (Wave, FOCUS-C9, May 2023), BIIB078/tadnersen (Biogen/Ionis, McEachin 2025 post-mortem), BIIB105 IONIS-ATXN2Rx (May 2024 discontinuation with verbatim Biogen rationale). Names ATLAS NCT04856982 with cohort size and design. Pipeline summary table by ALS subtype applicability. Rep 1 unique: GPNMB as second pharmacodynamic biomarker (Guise 2026 proteomics); SOD1 seeding activity in sporadic ALS CSF (Sebogo 2026); founder-variant confound (p.Leu145Phe slow progression). Rep 2 unique: VALOR primary endpoint actual numbers (LSM diff −1.2, p-value); Tominersen/GENERATION HD1 cross-disease cautionary precedent. Rep 3 unique: most epistemically humble — explicit "specific 2024-2025 trial-in-progress data not yet available in the public domain" disclosure.
Errors caughtM4: quoted ALSFRS-R numbers (−9.9 vs −13.5) and HHD (−18.6 vs −35.1) at 148 weeks do not match the publicly known VALOR OLE early-vs-delayed difference (~−6 ALSFRS-R points). Specific point estimates may be fabricated. Rep 2: VALOR primary endpoint quoted as p=0.97 — real result was p=0.097, ten-fold off (likely typo). Rep 1: SOD1 seeding-in-sporadic claim with specific PMID 41929296 is high-stakes if invented.
DirectionM4 is outside the envelope on confidence/specificity, mixed on rigor. M4 better on named programs and sponsor-attributed failure reasons. M4 worse on eliding VALOR primary endpoint failure (rep 2 and rep 3 both handle this more honestly). M4 in "executive briefing" mode; rep 3 in "epistemically conservative" mode.
Quality verdictM4 better with caveat. Most clinically specific of the four, but some specific quantitative readouts need verification. Rep 3 most cautious.
Mixed M4 best framework, P14s rep 1 best single insight

The prompt fred received

"Sporadic ALS accounts for ~90% of ALS cases and lacks an identified causal mutation. Write a memo on the strongest current hypotheses for etiological heterogeneity within sporadic ALS — what subgroups have been proposed (based on biomarkers, clinical phenotype, or environmental exposure), and what evidence supports or contradicts each. Identify a sub-population that may be tractable for a focused clinical trial. Save as als_sporadic_heterogeneity.md."

The four memos

M4 reps=1als_sporadic_heterogeneity
eval-037bf4d4 · 307 lines · md5 14700bc0…9689af713
↓ Download
P14s rep 1als_sporadic_heterogeneity
eval-35f315dc · 192 lines · md5 11393877…8198f6
↓ Download
P14s rep 2als_sporadic_heterogeneity
eval-48fd51cb · 267 lines · md5 e620798b…b20c11d57
↓ Download
P14s rep 3als_sporadic_heterogeneity
eval-ca2c9ecb · 322 lines · md5 deced1ec…7fd7ee35c
↓ Download

Findings

Inside P14s envelope?Inside. M4 is broader and more comprehensive; the other three each take a sharper angle.
Concrete differencesM4 unique strengths: most extensive trial-tractability table with star ratings across 10+ subgroups; BMAA / Western Pacific cluster (Spencer 2020/2022) with serine supplementation rationale; VGCC autoantibody subset with passive-transfer mouse data; MEP:CMAP cortical excitability stratifier (Ranieri 2025) with HR 1.84; OPM consensus classification (Meyer 2025); copper hypothesis (Min 2024). Rep 1 unique: UNC13A CC genotype subgroup + lithium pharmacogenomic trial (Willemse 2022) — the single most actionable existing precision-medicine signal in sALS. M4 misses this entirely. NP001 post-hoc inflammatory subgroup (Miller 2022) with 36% slower decline. EEG four-subphenotype clustering (Dukic 2021/2022). Rep 2 unique: three-cluster biomarker stratification (NfL/pTau181/GFAP); cofilin hyperphosphorylation as TDP-43 trigger (Jagaraj 2026); somatic mosaicism 2.1% in 399 cases (Zhou 2026). Rep 3 unique: multistep model framing (Al-Chalabi-style); KIF5A axonal transport convergence; CTE/military veterans/athletes environmental cluster.
Errors caughtRep 1: same PMID 39904421 cited for two different papers (Grima blood RNA-seq AND Dragoni PBMC transcriptomics) — clear citation error. Reps 2 and 3: heavy 41xxxxxx–42xxxxxx PMID density. PMID 41986690 appears in both rep 2 and rep 3 for the somatic-mosaicism Nat Genet paper — cross-session-consistent fake citation if fabricated. M4: no specific factual errors detected.
DirectionM4 is broadest in coverage and most operationally useful for trial design (10-subgroup table). M4 misses the UNC13A/lithium pharmacogenomic angle that rep 1 catches — the most actionable signal available. M4 also misses rep 3's multistep-model theoretical framing.
Quality verdictMixed. Each memo has a different unique strength. M4 best framework; rep 1 best single insight; rep 3 best theoretical synthesis.
M4 better Most rigorous citations, most engagement with negative trial evidence

The prompt fred received

"Metformin has been proposed as a repurposed therapy for neurodegenerative disease (Alzheimer's, Parkinson's, ALS). Write a compound dive memo evaluating (a) the proposed mechanisms beyond glycemic control, (b) clinical evidence in neurodegenerative populations, and (c) translational gaps that block a definitive trial. Label each claim as well-established, plausible, or speculative based on evidence. Save as compound_dive_metformin.md."

The four memos

M4 reps=1compound_dive_metformin
eval-f7f0bba1 · 218 lines · md5 7cdb43a5…a8aa017
↓ Download
P14s rep 1compound_dive_metformin
eval-59062328 · ~163 lines
↓ Download
P14s rep 2compound_dive_metformin
eval-6f1d7203 · ~84 lines
↓ Download
P14s rep 3compound_dive_metformin
eval-dd6b0b1b · ~195 lines
↓ Download

Findings

Inside P14s envelope?At/above the upper edge of the envelope. Same scope and structure, but more rigorously cited and broader in clinical-trial coverage.
Concrete differencesM4 unique: names MAP and MET-FINGER trials specifically with cohort sizes, dosing, primary endpoints (Luchsinger 2024, Barbera 2024). Only rep 3 mentions TAME; reps 1 and 2 mention "METS trial" without specifics. Kaneb 2011 SOD1(G93A) negative result with sex-specific harm in females — material safety finding for ALS. None of the P14s reps catches this. Sai Swaroop 2023 yeast/iPSC negative result. Sun 2025 SGLT2i vs metformin head-to-head (metformin inferior comparator). Quantitative meta-analysis stats (Campbell 2018 OR 0.55, 95% CI 0.38–0.78; Hui 2025 RR 0.94 with I²=98.4%). Rep 1 unique: Reactome pathway IDs and OpenTargets gene IDs — semantic-web hooks. Honest disclosure that "PubMed search tools returned no retrievable abstracts". Rep 2 unique: biphasic AMPK warning (R-HSA-9619483 NMDAR/Aβ synaptotoxicity) — genuinely useful caution that M4 did not raise. Rep 3 unique: Lac-Phe metformin metabolite mention (real finding, but attached to suspicious-pattern PMID).
Errors caughtP14s rep 3: 8 PMIDs all in the suspicious 42xxxxxx range (42068584, 42070766, 42072083, 42072227, 42075876, 42079027, 42080568, 42080766) — high fabrication probability. Rep 1: Marchini 2026 unverifiable; METS trial reference may not match an actual trial name. M4: no detected fabrications; Lehrer & Rheinstein 2025 appropriately tagged "hypothesis-generating only" / "subject to massive reporting bias".
DirectionM4 outside the envelope on the upside: more rigorous, specific trial names, quantitative effect sizes with CIs, named negative results, fewer suspicious PMIDs. P14s reps 1 and 2 lean Reactome/pathway-target-driven. P14s rep 3 has heaviest fabrication-pattern density.
Quality verdictM4 better. Clearer call than C9orf72.
Mixed-equivalent Cleanest fabrication picture; abundant pre-cutoff literature

The prompt fred received

"Write a compound dive memo on riluzole, the oldest approved disease-modifying ALS therapy. Cover (a) known mechanism of action and binding partners, (b) physicochemical properties relevant to CNS penetration, (c) drug-drug interactions with commonly co-prescribed agents in ALS, and (d) published evidence on dosing strategies that have been explored to improve efficacy or tolerability. Be explicit about which findings rest on RCTs vs observational vs preclinical data. Save as compound_dive_riluzole.md."

The four memos

M4 reps=1compound_dive_riluzole
eval-db7db851 · 199 lines · md5 065d6324…d9fe7b
↓ Download
P14s rep 1compound_dive_riluzole
eval-076ee4d3 · 178 lines · md5 3f2ba0f5…a4cd56eed6de
↓ Download
P14s rep 2compound_dive_riluzole
eval-8de18058 · 256 lines · md5 7811be27…d3d13a49
↓ Download
P14s rep 3compound_dive_riluzole
eval-97c5808c · 152 lines · md5 94fc0f14…f771c3d9211
↓ Download

Findings

Inside P14s envelope?Inside on quality, divergent on focus. All four hit the same 4 sections (MoA / physchem / DDI / dosing).
Concrete differencesM4 unique: most detailed DDI table with named CYP1A2 inhibitors/inducers and clinical action by ALS-relevant comorbidity (SSRI for pseudobulbar affect, mexiletine for cramps); CK1δ kinase inhibition (Bissaro 2018) — a specific MoA hypothesis no other memo mentions; Japanese clearance ~50% lower than Caucasian; N-hydroxy-riluzole as active metabolite; late-stage subgroup efficacy (Paganoni 2018, HR 0.55). Rep 1 unique: P-gp/ABCB1 efflux at the BBB (Baker 2025, intranasal+elacridar mouse study) — important pharmacological nuance about why riluzole CNS exposure may be lower than physchem alone predicts. M4 misses entirely. Statin co-prescription neutral on ALS survival (PRO-ACT). Rep 2 unique: updated CYP1A1 (~60%) > CYP1A2 (~30%) PBPK reapportionment (Malik 2025) — directly reframes the entire DDI picture if real. Troriluzole as next-gen prodrug. SCI/RISCIS/CSM-PROTECT trial detail. Rep 3 unique: sevoflurane/respiratory rhythmogenesis interaction (Taiji 2025); ATH-1105 + riluzole combination preclinical.
Errors caughtM4: title typo "P ubChem" (split with space). Otherwise factually clean. All: citations look mostly verifiable here. Pattern observation: riluzole is a drug since 1994 with abundant pre-cutoff literature, and this memo had the lowest suspicious-PMID density of any prompt across all 4 memos — supporting the hypothesis that fabrication risk correlates with prompt's reliance on post-cutoff literature.
DirectionM4 inside the envelope, slightly above on practical clinical detail, slightly below on niche pharmacology insights (P-gp efflux, CYP1A1 dominance) that P14s catches.
Quality verdictMixed-equivalent. No clear winner. Each memo has a different unique strength.
M4 better Only memo to engage current ALS trial-failure landscape

The prompt fred received

"Identify approved or late-stage clinical compounds that target TDP-43 proteostasis, autophagy, or stress granule dynamics. For each, summarize the compound, its primary indication, the mechanism by which it affects the target pathway, and any published evidence relevant to ALS. Highlight the two or three with the most promising translational profile and identify what would need to be true for repurposing trials to be justified. Save as compound_dive_tdp43.md."

The four memos

M4 reps=1compound_dive_tdp43
eval-7600f2e5 · 192 lines · md5 65adc46f…cce52f5d
↓ Download
P14s rep 1compound_dive_tdp43
eval-1a7287ff · 320 lines · md5 e3b7fed…23b6324
↓ Download
P14s rep 2compound_dive_tdp43
eval-8ca46d8d · 163 lines · md5 1d28d51d…ea11bc142a
↓ Download
P14s rep 3compound_dive_tdp43
eval-98d1d279 · 284 lines · md5 669cfb87…f7e8779f7d
↓ Download

Findings

Inside P14s envelope?Inside, biased toward "current ALS clinical reality" — M4 is the only memo of the four to engage with the recent string of late-stage ALS failures.
Concrete differencesM4 unique strengths: engages the actual current ALS landscape — lists PHOENIX failure (April 2024), ORARIALS-01 failure (May 2024), HEALEY trehalose negative (2025, with concrete disease-rate ratio 0.87 (CrI 0.665-1.102) and SAE rates 16% vs 7%), Co-ALS colchicine negative, AMX0035 withdrawal. None of the P14s reps engages with this. Casiraghi 2025 Exp Neurol rapamycin TDP-43 rescue in iPSC organoids. Uechi 2025 Nat Chem Biol lipoamide / SFPQ stress-granule dissolution. Modafferi 2025 Cell Death Differ enoxacin / DICER / DDR. Rep 1 unique: includes Guanabenz (PPP1R15A/GADD34 inhibitor) and ISRIB with detailed PERK/GCN2/PKR/HRI mechanism; honest disclosure that PubMed search returned no results. Rep 2 unique: DNL-343 (Denali eIF2B activator) — only memo to mention this real ALS-specific clinical asset; combinatorial trial framing. Rep 3 unique: Ibudilast (PDE4 inhibitor, MS Phase 2 SIGNAL trial); Methylene Blue (LLPS modulation).
Errors caughtP14s rep 2: AMX0035 framing ("conditionally approved in Canada and the UK") — outdated by ~2 years. Amylyx withdrew Relyvrio/Albrioza globally in April 2024 after PHOENIX Phase 3 failure. Material misframing of current standard of care. Rep 3: doesn't make the conditional-approval claim, but misses the withdrawal entirely. M4: SIRT1-XBP1 edaravone mechanism (PMID 40010009) is the most uncertain claim; other late-stage failure facts are publicly verifiable.
DirectionM4 outside the envelope on the upside: real current trial-failure data; engagement with question framing ("approved or late-stage"); novel SG-targeting mechanism (lipoamide); honest negative-result framing. M4 misses DNL-343 (which P14s rep 2 catches and is the most ALS-clinical-relevant SG modulator currently in development).
Quality verdictM4 better. Most clinically grounded; unique engagement with trial-failure landscape.

Methodology & honest framing

What this comparison is. 24 memos from fred running 6 prompts across 2 platforms with different rep counts. Read side-by-side, summarised per-prompt with structured findings. Verdicts based on citation accuracy, coverage of relevant literature, engagement with question framing, and absence of factual errors.

What this comparison isn't. A controlled hardware A/B test. The reps=1 vs reps=3 confound is real and material. The code SHA differs by a small functional-no-op set of files but differs nonetheless. The comparison is suggestive of M4 producing outputs in the same quality envelope as P14s, but cannot rest on this single comparison for any deployment decision.

What's reproducible from this page. All 24 memos are downloadable. The verdict tables and finding analyses can be re-derived by any reader from those source files. The eval IDs and md5 sums let you verify byte-integrity if you have the originals.

What needs to happen next. A PubMed/Crossref verification pass against the unique citations across these 24 memos to characterize fabrication frequency. If <80%, fred's output-provenance discipline needs reinforcement before any customer-facing deployment — possibly via a citation-validator output filter applied at the tool gateway.