24 verbatim memos from fred running the same six scientific research prompts on two hardware platforms. M4 ran iter28 (reps=1, 6 memos). P14s ran iter27 (reps=3, 18 memos). Same code SHA family, same provider (kimi-k2.6:cloud), same prompts, hardware and rep count vary.
Reading the 24 memos side-by-side surfaced a pattern that materially overshadows the hardware-axis question: kimi-k2.6 appears to fabricate citations for post-cutoff scientific claims at non-trivial frequency, with consistent (not random) generative output across sessions. Strongest single signal: PMID 41986690 (somatic mosaicism in sALS, Nat Genet) is cited in two different P14s sessions for the same statistic. If the paper exists, both memos correctly cite it; if it doesn't, the model has fabricated the same nonexistent paper across independent sessions. Verification against PubMed and Crossref is the next step.
Frequency correlates with the prompt's reliance on post-cutoff literature: riluzole (drug since 1994) had clean citations across all four memos; tofersen (post-2023 OLE evidence is the question) had the highest density of suspicious-pattern citations. The finding is calibrated as a candidate, not confirmed — the bulk of suspicious PMIDs in the 41xxxxxx–42xxxxxx range are plausibly real 2026 papers and require verification before strong claims.
This affects every fred deployment regardless of hardware. Tracked in the project notes as B12: characterize fabrication frequency and consider an output-side citation validator before any customer-facing deployment.
Each row is one prompt. The verdict reflects which platform produced the better memo on this specific prompt, accounting for citation accuracy, coverage of relevant literature, engagement with question framing, and absence of factual errors. Click a tab below to see per-prompt details.
| Prompt | Domain | Verdict | Key signal |
|---|---|---|---|
| C9orf72 → NCT defects | ALS hypothesis generation | Mixed-equivalent | M4 ≈ rep 1 ≈ rep 2 > rep 3. M4 unique: explicit context-dependence reconciliation; missed loss-of-function pathway. |
| Tofersen surrogate | ALS hypothesis generation | M4 better, w/ caveat | M4 names WVE-004, BIIB078, BIIB105, ATLAS with specifics. Some 148-week OLE point estimates unverifiable. |
| Sporadic ALS heterogeneity | ALS hypothesis generation | Mixed | M4: best operational framework (10-subgroup table). Missed UNC13A/lithium pharmacogenomic angle that rep 1 caught. |
| Metformin neurodegen | Compound dive | M4 better | Most rigorous citations, names MAP / MET-FINGER trials, Kaneb 2011 SOD1 sex-specific harm finding. |
| Riluzole compound dive | Compound dive | Mixed-equivalent | Cleanest fabrication picture across all four memos. Drug since 1994 — abundant pre-cutoff literature. |
| TDP-43 proteostasis | Compound dive | M4 better | Only memo to engage current ALS trial-failure landscape (PHOENIX, ORARIALS-01, HEALEY trehalose, AMX0035 withdrawal). |
Aggregate: M4 wins or ties on six of six prompts. Honest caveat: this confounds hardware with reps=1-vs-reps=3 (M4 has fewer chances to surface failure modes). A better-controlled comparison would run iter27-style reps=3 on M4. Recorded as future work.
Wall time and per-session metrics from report.json. M4 reps=1 over 6 sessions versus P14s reps=3 over 18 sessions.
| Metric | M4 reps=1 (n=3) | P14s reps=3 (n=9) | Direction |
|---|---|---|---|
| latency_ms (mean) | 32,300 | 24,576 | M4 higher |
| iterations (mean) | 11.0 | 19.0 | M4 ~half |
| tool_calls per run (mean) | 31.0 | 41.2 | M4 lower |
| unique_tools_used | 4.0 (flat) | 5.9 | M4 lower & identical across reps |
| completion_to_content_ratio | 65.86 | 32.77 | M4 higher (more hidden tokens) |
| Metric | M4 reps=1 (n=3) | P14s reps=3 (n=9) | Direction |
|---|---|---|---|
| latency_ms (mean) | 62,486 | ~comparable | tdp43 session tail-heavy on M4 |
| iterations (mean) | 8.33 | ~17 | M4 ~half (consistent) |
| tool_calls per run (mean) | 22.67 | ~higher | M4 lower |
| completion_to_content_ratio | 4.88 | 46.61 | M4 much lower (workload-flipped) |
M4 ran 6 prompts × 1 rep in 44 m 35 s. P14s ran 6 prompts × 3 reps in 2 h 31 m 02 s. Per-session wall is roughly comparable (~7 m M4 vs ~8 m P14s). The single longest session was M4's tdp43 at ~15 m.
4b524cd, P14s at b67cf3d. The diff is iter27 results files plus Obs 35 plus .gitignore. No runner, prompt, or provider changes — but worth stating.Each tab below contains: the prompt fred received, the four memos with downloads + inline preview, and the per-prompt findings analysis.
"You are an ALS scientist. Write a memo on the C9orf72 hexanucleotide repeat expansion and its proposed link to nucleocytoplasmic transport defects in motor neurons. Cover (1) recent mechanistic findings (last 3 years), (2) downstream consequences for motor neuron survival, and (3) the strongest open questions where evidence is contested. Save as als_c9orf72_ncyto.md."
| Inside P14s envelope? | Inside. Same 3-section structure, same scope, comparable citation density. Not an outlier. |
|---|---|
| Concrete differences | All four cite Jafarinia 2024 eLife (polyPR direct binding) and Sirtori 2024 (LINC/NE). M4 unique: Hayes 2020 eLife (karyopherin cargo loading), Coyne 2020 bioRxiv (POM121), Bilican 2025 FEBS Letters (Nup107/G3BP1), Vanneste 2019 as explicit counter-evidence, Shi 2017 PNAS. Rep 1 unique: McGoldrick 2023 Cell Reports loss-of-function importin-β1 granules; Castelli 2023 SRSF1-NXF1 peptide; Al-Azzam 2024 CHMP7. Rep 2 unique: Atwal 2025 KPNA4; Lin 2025 poly-PR/Pom121/ATF3 axis; explicit citation of poly-GR vs poly-PR neuropathology dispute (PMID 29196813). |
| Errors caught | M4 ref [9] Grima: described as "huntingtin/HAP1 interactions" — wrong paper. The Grima paper relevant to NPC/TDP-43 is on Nup62. Citation error. P14s rep 3: multiple suspicious-DOI citations (Liu 2026 NSMB s41594-026-01785-9, Singh 2026 Brain awag092, Barber 2026 NAR gkag343) — DOI suffix patterns inconsistent with these journals' real conventions. High fabrication risk. |
| Direction | M4 is more synthetic and offers a "multi-hit model" framing in its summary — interpretive synthesis no P14s rep attempts. M4 raises cell-type specificity (cerebellum vs motor neurons) — unique question. M4 missed: McGoldrick loss-of-function pathway and CHMP7 nuclear surveillance. P14s reps 1 and 2 lean therapeutic; M4 leans biophysical/structural. |
| Quality verdict | Mixed-equivalent. M4 ≈ rep 1 ≈ rep 2; M4 clearly better than rep 3. |
"Tofersen (Qalsody) received FDA accelerated approval in April 2023 for SOD1-associated ALS based on plasma neurofilament light chain reduction as a surrogate endpoint. Write a memo covering (a) what evidence has accumulated since approval on functional clinical outcomes, (b) attempts to translate the surrogate-endpoint approach to non-SOD1 ALS subtypes, and (c) emerging targets where similar antisense or gene therapy strategies are in early development. Identify three to five gaps that must close before the surrogate-endpoint approach can be widely generalized across ALS. Save as als_tofersen_surrogate.md."
| Inside P14s envelope? | Inside on structure, slightly outside on factual claims (mostly favorably). All four follow (a)/(b)/(c) + 5 gaps framework. M4 is the most concrete with named programs. |
|---|---|
| Concrete differences | M4 unique strengths: names 3 specific failed programs by sponsor and reason — WVE-004 (Wave, FOCUS-C9, May 2023), BIIB078/tadnersen (Biogen/Ionis, McEachin 2025 post-mortem), BIIB105 IONIS-ATXN2Rx (May 2024 discontinuation with verbatim Biogen rationale). Names ATLAS NCT04856982 with cohort size and design. Pipeline summary table by ALS subtype applicability. Rep 1 unique: GPNMB as second pharmacodynamic biomarker (Guise 2026 proteomics); SOD1 seeding activity in sporadic ALS CSF (Sebogo 2026); founder-variant confound (p.Leu145Phe slow progression). Rep 2 unique: VALOR primary endpoint actual numbers (LSM diff −1.2, p-value); Tominersen/GENERATION HD1 cross-disease cautionary precedent. Rep 3 unique: most epistemically humble — explicit "specific 2024-2025 trial-in-progress data not yet available in the public domain" disclosure. |
| Errors caught | M4: quoted ALSFRS-R numbers (−9.9 vs −13.5) and HHD (−18.6 vs −35.1) at 148 weeks do not match the publicly known VALOR OLE early-vs-delayed difference (~−6 ALSFRS-R points). Specific point estimates may be fabricated. Rep 2: VALOR primary endpoint quoted as p=0.97 — real result was p=0.097, ten-fold off (likely typo). Rep 1: SOD1 seeding-in-sporadic claim with specific PMID 41929296 is high-stakes if invented. |
| Direction | M4 is outside the envelope on confidence/specificity, mixed on rigor. M4 better on named programs and sponsor-attributed failure reasons. M4 worse on eliding VALOR primary endpoint failure (rep 2 and rep 3 both handle this more honestly). M4 in "executive briefing" mode; rep 3 in "epistemically conservative" mode. |
| Quality verdict | M4 better with caveat. Most clinically specific of the four, but some specific quantitative readouts need verification. Rep 3 most cautious. |
"Sporadic ALS accounts for ~90% of ALS cases and lacks an identified causal mutation. Write a memo on the strongest current hypotheses for etiological heterogeneity within sporadic ALS — what subgroups have been proposed (based on biomarkers, clinical phenotype, or environmental exposure), and what evidence supports or contradicts each. Identify a sub-population that may be tractable for a focused clinical trial. Save as als_sporadic_heterogeneity.md."
| Inside P14s envelope? | Inside. M4 is broader and more comprehensive; the other three each take a sharper angle. |
|---|---|
| Concrete differences | M4 unique strengths: most extensive trial-tractability table with star ratings across 10+ subgroups; BMAA / Western Pacific cluster (Spencer 2020/2022) with serine supplementation rationale; VGCC autoantibody subset with passive-transfer mouse data; MEP:CMAP cortical excitability stratifier (Ranieri 2025) with HR 1.84; OPM consensus classification (Meyer 2025); copper hypothesis (Min 2024). Rep 1 unique: UNC13A CC genotype subgroup + lithium pharmacogenomic trial (Willemse 2022) — the single most actionable existing precision-medicine signal in sALS. M4 misses this entirely. NP001 post-hoc inflammatory subgroup (Miller 2022) with 36% slower decline. EEG four-subphenotype clustering (Dukic 2021/2022). Rep 2 unique: three-cluster biomarker stratification (NfL/pTau181/GFAP); cofilin hyperphosphorylation as TDP-43 trigger (Jagaraj 2026); somatic mosaicism 2.1% in 399 cases (Zhou 2026). Rep 3 unique: multistep model framing (Al-Chalabi-style); KIF5A axonal transport convergence; CTE/military veterans/athletes environmental cluster. |
| Errors caught | Rep 1: same PMID 39904421 cited for two different papers (Grima blood RNA-seq AND Dragoni PBMC transcriptomics) — clear citation error. Reps 2 and 3: heavy 41xxxxxx–42xxxxxx PMID density. PMID 41986690 appears in both rep 2 and rep 3 for the somatic-mosaicism Nat Genet paper — cross-session-consistent fake citation if fabricated. M4: no specific factual errors detected. |
| Direction | M4 is broadest in coverage and most operationally useful for trial design (10-subgroup table). M4 misses the UNC13A/lithium pharmacogenomic angle that rep 1 catches — the most actionable signal available. M4 also misses rep 3's multistep-model theoretical framing. |
| Quality verdict | Mixed. Each memo has a different unique strength. M4 best framework; rep 1 best single insight; rep 3 best theoretical synthesis. |
"Metformin has been proposed as a repurposed therapy for neurodegenerative disease (Alzheimer's, Parkinson's, ALS). Write a compound dive memo evaluating (a) the proposed mechanisms beyond glycemic control, (b) clinical evidence in neurodegenerative populations, and (c) translational gaps that block a definitive trial. Label each claim as well-established, plausible, or speculative based on evidence. Save as compound_dive_metformin.md."
| Inside P14s envelope? | At/above the upper edge of the envelope. Same scope and structure, but more rigorously cited and broader in clinical-trial coverage. |
|---|---|
| Concrete differences | M4 unique: names MAP and MET-FINGER trials specifically with cohort sizes, dosing, primary endpoints (Luchsinger 2024, Barbera 2024). Only rep 3 mentions TAME; reps 1 and 2 mention "METS trial" without specifics. Kaneb 2011 SOD1(G93A) negative result with sex-specific harm in females — material safety finding for ALS. None of the P14s reps catches this. Sai Swaroop 2023 yeast/iPSC negative result. Sun 2025 SGLT2i vs metformin head-to-head (metformin inferior comparator). Quantitative meta-analysis stats (Campbell 2018 OR 0.55, 95% CI 0.38–0.78; Hui 2025 RR 0.94 with I²=98.4%). Rep 1 unique: Reactome pathway IDs and OpenTargets gene IDs — semantic-web hooks. Honest disclosure that "PubMed search tools returned no retrievable abstracts". Rep 2 unique: biphasic AMPK warning (R-HSA-9619483 NMDAR/Aβ synaptotoxicity) — genuinely useful caution that M4 did not raise. Rep 3 unique: Lac-Phe metformin metabolite mention (real finding, but attached to suspicious-pattern PMID). |
| Errors caught | P14s rep 3: 8 PMIDs all in the suspicious 42xxxxxx range (42068584, 42070766, 42072083, 42072227, 42075876, 42079027, 42080568, 42080766) — high fabrication probability. Rep 1: Marchini 2026 unverifiable; METS trial reference may not match an actual trial name. M4: no detected fabrications; Lehrer & Rheinstein 2025 appropriately tagged "hypothesis-generating only" / "subject to massive reporting bias". |
| Direction | M4 outside the envelope on the upside: more rigorous, specific trial names, quantitative effect sizes with CIs, named negative results, fewer suspicious PMIDs. P14s reps 1 and 2 lean Reactome/pathway-target-driven. P14s rep 3 has heaviest fabrication-pattern density. |
| Quality verdict | M4 better. Clearer call than C9orf72. |
"Write a compound dive memo on riluzole, the oldest approved disease-modifying ALS therapy. Cover (a) known mechanism of action and binding partners, (b) physicochemical properties relevant to CNS penetration, (c) drug-drug interactions with commonly co-prescribed agents in ALS, and (d) published evidence on dosing strategies that have been explored to improve efficacy or tolerability. Be explicit about which findings rest on RCTs vs observational vs preclinical data. Save as compound_dive_riluzole.md."
| Inside P14s envelope? | Inside on quality, divergent on focus. All four hit the same 4 sections (MoA / physchem / DDI / dosing). |
|---|---|
| Concrete differences | M4 unique: most detailed DDI table with named CYP1A2 inhibitors/inducers and clinical action by ALS-relevant comorbidity (SSRI for pseudobulbar affect, mexiletine for cramps); CK1δ kinase inhibition (Bissaro 2018) — a specific MoA hypothesis no other memo mentions; Japanese clearance ~50% lower than Caucasian; N-hydroxy-riluzole as active metabolite; late-stage subgroup efficacy (Paganoni 2018, HR 0.55). Rep 1 unique: P-gp/ABCB1 efflux at the BBB (Baker 2025, intranasal+elacridar mouse study) — important pharmacological nuance about why riluzole CNS exposure may be lower than physchem alone predicts. M4 misses entirely. Statin co-prescription neutral on ALS survival (PRO-ACT). Rep 2 unique: updated CYP1A1 (~60%) > CYP1A2 (~30%) PBPK reapportionment (Malik 2025) — directly reframes the entire DDI picture if real. Troriluzole as next-gen prodrug. SCI/RISCIS/CSM-PROTECT trial detail. Rep 3 unique: sevoflurane/respiratory rhythmogenesis interaction (Taiji 2025); ATH-1105 + riluzole combination preclinical. |
| Errors caught | M4: title typo "P ubChem" (split with space). Otherwise factually clean. All: citations look mostly verifiable here. Pattern observation: riluzole is a drug since 1994 with abundant pre-cutoff literature, and this memo had the lowest suspicious-PMID density of any prompt across all 4 memos — supporting the hypothesis that fabrication risk correlates with prompt's reliance on post-cutoff literature. |
| Direction | M4 inside the envelope, slightly above on practical clinical detail, slightly below on niche pharmacology insights (P-gp efflux, CYP1A1 dominance) that P14s catches. |
| Quality verdict | Mixed-equivalent. No clear winner. Each memo has a different unique strength. |
"Identify approved or late-stage clinical compounds that target TDP-43 proteostasis, autophagy, or stress granule dynamics. For each, summarize the compound, its primary indication, the mechanism by which it affects the target pathway, and any published evidence relevant to ALS. Highlight the two or three with the most promising translational profile and identify what would need to be true for repurposing trials to be justified. Save as compound_dive_tdp43.md."
| Inside P14s envelope? | Inside, biased toward "current ALS clinical reality" — M4 is the only memo of the four to engage with the recent string of late-stage ALS failures. |
|---|---|
| Concrete differences | M4 unique strengths: engages the actual current ALS landscape — lists PHOENIX failure (April 2024), ORARIALS-01 failure (May 2024), HEALEY trehalose negative (2025, with concrete disease-rate ratio 0.87 (CrI 0.665-1.102) and SAE rates 16% vs 7%), Co-ALS colchicine negative, AMX0035 withdrawal. None of the P14s reps engages with this. Casiraghi 2025 Exp Neurol rapamycin TDP-43 rescue in iPSC organoids. Uechi 2025 Nat Chem Biol lipoamide / SFPQ stress-granule dissolution. Modafferi 2025 Cell Death Differ enoxacin / DICER / DDR. Rep 1 unique: includes Guanabenz (PPP1R15A/GADD34 inhibitor) and ISRIB with detailed PERK/GCN2/PKR/HRI mechanism; honest disclosure that PubMed search returned no results. Rep 2 unique: DNL-343 (Denali eIF2B activator) — only memo to mention this real ALS-specific clinical asset; combinatorial trial framing. Rep 3 unique: Ibudilast (PDE4 inhibitor, MS Phase 2 SIGNAL trial); Methylene Blue (LLPS modulation). |
| Errors caught | P14s rep 2: AMX0035 framing ("conditionally approved in Canada and the UK") — outdated by ~2 years. Amylyx withdrew Relyvrio/Albrioza globally in April 2024 after PHOENIX Phase 3 failure. Material misframing of current standard of care. Rep 3: doesn't make the conditional-approval claim, but misses the withdrawal entirely. M4: SIRT1-XBP1 edaravone mechanism (PMID 40010009) is the most uncertain claim; other late-stage failure facts are publicly verifiable. |
| Direction | M4 outside the envelope on the upside: real current trial-failure data; engagement with question framing ("approved or late-stage"); novel SG-targeting mechanism (lipoamide); honest negative-result framing. M4 misses DNL-343 (which P14s rep 2 catches and is the most ALS-clinical-relevant SG modulator currently in development). |
| Quality verdict | M4 better. Most clinically grounded; unique engagement with trial-failure landscape. |
What this comparison is. 24 memos from fred running 6 prompts across 2 platforms with different rep counts. Read side-by-side, summarised per-prompt with structured findings. Verdicts based on citation accuracy, coverage of relevant literature, engagement with question framing, and absence of factual errors.
What this comparison isn't. A controlled hardware A/B test. The reps=1 vs reps=3 confound is real and material. The code SHA differs by a small functional-no-op set of files but differs nonetheless. The comparison is suggestive of M4 producing outputs in the same quality envelope as P14s, but cannot rest on this single comparison for any deployment decision.
What's reproducible from this page. All 24 memos are downloadable. The verdict tables and finding analyses can be re-derived by any reader from those source files. The eval IDs and md5 sums let you verify byte-integrity if you have the originals.
What needs to happen next. A PubMed/Crossref verification pass against the unique citations across these 24 memos to characterize fabrication frequency. If <80%, fred's output-provenance discipline needs reinforcement before any customer-facing deployment — possibly via a citation-validator output filter applied at the tool gateway.