“The language model is the instrument. The language model must be vetted at the substrate level.”
Our position before FDA on Dockets FDA-2026-N-4390 and FDA-2024-D-4689
A plain-language walkthrough of the four learning-rule variants, why each one is different, how sampling temperature changes what the model says, and what the next run will do differently.
May 14, 2026, 00:00 CDT. Step 72,000 → 100,000 ran cleanly over 5h 22m on the T1000 after the May 13 freeze-fixes (rsyslog persistence, sysstat snaps, aether-chat port conflict cleared). 28,000 steps. No freezes. No L0 blocks. can_choose held positive for the entire run.
Same recipe as §13. Three seed prompts pulled from mega_tokens_only.txt at offsets 1000, 250000, 500000. Three temperatures each. Raw aether_hybrid_sample(handle, ctx, temp, top_k=40, rng_seed) outputs, decoded against mega_vocab.tsv. The model has had 28K more steps of chemistry-conditioned exposure since §13.
Same checkpoint (canon_x_corpus_full_step100000.hyb), three offsets, three temperatures. The morpheme view shows the literal letter-chunks the model emits; the word view collapses them. Reproducible: same arguments, same output, anywhere.
May 14, 2026, ~04:07 CDT. Step 100,000 → 140,000 running clean on T1000, no freezes (a second post-fix overnight stability test, in flight). Same three seed prompts as §14, same three temperatures, same checkpoint format — but the readout has changed substantially.
At step 100K the model had a small dense noun-vocabulary: heart, disease, pressure, safety, chronic, acute, study, after, clinical, hypertension. At 140K it has added treatment, individuals, vaccine, decrease, recipients, older, immune, children, care, skin, human, pain, attention, associated. The phrase structure has tightened — temp 0.8 locks into repeating "treatment in the treatment with treatment in..." loops; temp 1.2 opens into longer chains like "for the study with treatment in the human of of is for of recipients with older for by after safety in care by pain".
Same checkpoint (canon_x_corpus_full_step140000.hyb), same three offsets, three temperatures. The run continues toward step 150K.
The Scyla AETHER architecture is a language model designed from the substrate up around biological constraints. It has six layers that all train together as one system:
For this experiment we kept all six of those layers identical across four runs. The only thing that varied was the learning rule — the recipe the model uses to adjust its weights after each example.
| Variant | Learning rule | Biology analog |
|---|---|---|
| A. Adam | Standard machine-learning baseline. After each example, update each weight by an amount proportional to its gradient, with momentum for stability. | None — this is the conventional optimizer most language models use. |
| B. Hormone-LR | Same as Adam, but the size of each update is scaled by a "salience" signal that varies over time. High salience means a bigger weight change. | Dopamine and cortisol modulating how strongly an experience updates memory. Excited or stressed → bigger update. Calm → smaller update. |
| C. Hebbian-blend | Adam plus an extra small nudge to weights when two related neurons activate together: cells that fire together wire together. | Hebbian synaptic plasticity, the classical model of how associations form in real brains. |
| D. STDP | Adam plus a bias based on the timing of activations. If neuron A fires just before neuron B, strengthen the connection A→B; if just after, weaken it. | Spike-timing dependent plasticity, the mechanism real neurons use to learn cause-and-effect direction. |
All four variants train on identical data with identical hyperparameters. The only meaningful difference is how they update weights when they get an example wrong.
For a regulator, the question "does this AI work?" is incomplete. The complete question is: can you tell us why it works the way it does, and can you reproduce it? A black-box model that gives the right answer is not enough for a regulated industry. The path the model took to that answer must be auditable.
By holding the architecture constant and varying only the learning rule, we built the simplest possible scientific experiment in machine learning: a controlled four-variant comparison. The framework distinguishes:
This is the methodology FDA-2025-D-6131 (the new approach methodologies guidance) is asking the field to define for in-silico drug development. The framework itself is a regulatory contribution as much as the trained models.
| Variant | Best held-out cross-entropy |
|---|---|
| A · Adam baseline | 6.090 |
| C · Hebbian-blend | 6.093 |
| B · Hormone-LR | 6.135 |
| D · STDP | 8.319 |
Cross-entropy is one specific metric — it measures how surprised the model is by the correct next word in held-out test data. Lower means less surprised, which is conventionally read as "better."
By that metric: A and C are tied (their 0.003 difference is statistical noise), B is slightly behind, and D is significantly behind by a factor of 1.4×.
But the most important finding of this experiment is that this metric is wrong as a sole judge of "better."
Variant D was ranked worst by cross-entropy. Let's actually listen to what each variant says.
Same prompt: "the arena was cruel". Same starting context. Same sampling settings. Different learning rules.
The "winners" produced similar, narrowly-ranged blade-and-eyes prose. The "loser" produced a wider expressive range — words like empty, blood thirsty, grip, faces, blade, carved — visceral imagery from the same source corpus. The cross-entropy metric does not measure expressive richness, distinctiveness, or how an output reads.
Cross-entropy held-out loss is insufficient as a sole evaluation metric for biologically-modulated language models. Any FDA-relevant evaluation framework must combine quantitative metrics with qualitative review of sampled output across temperature ranges. We will publish this argument formally as our comment on FDA-2025-D-6131 by the May 18 deadline.
When the model picks the next word, it has a probability distribution across all 4,100 vocabulary tokens. Sampling temperature controls how aggressively the model commits to its top choice versus exploring alternatives.
| Temperature | What it does | What you see |
|---|---|---|
| ~0.5 (low) | Always pick the most likely next word | Stable, repetitive, can collapse into loops |
| ~1.0 (default) | Sample proportionally to probability | Moderate variety, mostly coherent |
| ~1.5–2.5 (warm) | Flatten the distribution; more risk-taking | Diverse vocabulary surfaces, sometimes chaotic |
| ~3+ (hot) | Nearly random sampling weighted by minor preferences | Wild but reveals what the model knows it could say |
This is why variant D looks bad at low temperature and rich at warm temperature. Greedy decoding (low temperature) collapses into the model's strongest single attractor. Warm temperature lets the model show its full vocabulary. Different variants have different attractors and different vocabulary distributions, so they respond very differently to temperature changes.
For our drug-labels daughter cell trained on FDA pharmaceutical text:
Real pharmaceutical phrases — as directed, administration adults, dosage, relieves — emerge at moderate temperature. They are present in the model's vocabulary but greedy sampling would just produce "and the and the" because those are the most common single words.
Modern large language models — ChatGPT, Claude, Gemini — produce different outputs to the same prompt at the same temperature on every invocation. Their internal randomness is not exposed or controllable. This is by design, and it is incompatible with regulatory data-integrity standards.
The Scyla AETHER architecture is fully reproducible. We verified this empirically: 1,200 generations across all four trained models at three temperatures with 100 different prompts. Every single (prompt, temperature, top-k, seed) tuple produced an identical token sequence on every invocation. The full stress test runs in 27 seconds on a $400 GPU, and any auditor with the binary and the checkpoint can verify reproducibility on demand.
This satisfies four pillars of the ALCOA+ data-integrity framework — Original, Accurate, Consistent, Available — that the FDA, EU AI Act Article 14, and ICH E6(R3) all require for clinical AI. Our architecture meets these requirements by construction, not by retrofit.
The four variants in section 4 trained on a simple corpus with a single fixed neurochemical context. Tonight's training run is a categorical step beyond that:
This is the first chemistry-conditioned AETHER training: a model that learns not just what words follow what words, but what biological context produces what kind of language.
Status as of this writing: A foundation 10K-step run completed earlier tonight, validating the chemistry-conditioned pipeline at scale. The full 50K canon run is currently in progress on the development server, with intermediate checkpoints saved every 10K steps. Vocabulary observably evolves across the training: the model has been seen to acquire new word combinations (treatment risk, care acute, safety chronic) as it learns. At this scale, output is recognizable medical English even though prompt-conditioning, which requires a separate instruction-tuning phase, has not yet been introduced.
The May 9 four-variant experiment compared learning rules in isolation. The training run currently in progress takes a different approach. All four plasticity rules co-exist in a single training step, and the full nine-layer emotional architecture from the Aether Protocols whitepaper is wired in alongside them.
| Layer | What it does, in each training step |
|---|---|
| L1. Hormone cascade | Per-segment 5-D neurotransmitter input drives the cascade. |
| L2. Emotion core (6-D) | The chemistry levels are mapped to a six-dimensional felt-state (valence, security, energy, engagement, connection, neutrality). |
| L3. Irrational forces (4-D) | Hope, terror, obsession, and hatred are derived from the felt-state. These feed into the moral compass and modulate plasticity salience. |
| L4. Moral compass | The emotional state, plus the irrational forces, place the model on a 3-D moral sphere (action/inaction, chaos/order, resistance/neutrality). |
| L5. Choice capacity | An availability, effectiveness, confidence, and commitment score computes the model's capacity to make meaningful choices in the current state. |
| L6. Consciousness | The integrated magnitude of the emotional, moral, and irrational state. High consciousness means a strongly engaged system. |
| L7. Memory metacognition | Familiarity bonus when the same entity (e.g., a domain or a speaker) appears in consecutive segments. |
| L8. Experiential learning | A BDNF-style modulator that scales the effective learning rate by consciousness and familiarity. |
| L9. Parasympathetic regulation | A parasympathetic brake that attenuates the chemistry input when arousal (cortisol + norepinephrine) is high. |
| ATP | An energy budget. The effective learning rate is throttled by the irrational load (obsession + terror + low neutrality), so the model self-regulates training intensity by its own emotional state. |
| 4-plasticity rules | Adam, Hormone-LR, Hebbian, and STDP all apply to the same example, with the cumulative learning rate equal to a single-Adam-equivalent and each rule's loss logged independently. |
| Per-step audit | A nineteen-column TSV row is written for every training step, capturing the chemistry, emotion, irrational forces, moral position, choice score, consciousness, ATP factor, PNS brake, and effective learning rate. |
This is what the patent describes when it claims an emotion-modulated, biologically-faithful training kernel with per-step provenance. It is not a thought experiment. It is the training kernel running tonight.
The current training is the chaos phase. The model is exposed to raw corpus chunks under varying chemistry, emotion, and moral context. It learns what language feels like under different felt-states. It does not yet learn what tools it has, or when to use them.
The next phase, immediately following this run, is the order phase. The Scyla compiler ships with 458 deterministic, citation-grounded opcodes (pharmacokinetics, periodic table, amino acids, codons, metabolism, neuroscience, quantum biology, clinical labs, organs, and more). An instruction-tuning pass on (question, opcode-call) pairs will teach the model to invoke an opcode when a fact is requested rather than generate the fact in its own weights. The opcode then executes deterministically, returning a citation-grounded result that gets spliced back into the generation.
Two consequences follow from this architecture. First, hallucination is structurally impossible for any fact within opcode coverage. The model does not generate the atomic mass of iron; it invokes atomic_mass("Fe") which returns 55.845. Second, the audit trail is end-to-end: every generated fact carries the opcode call that produced it, every opcode call is verifiable, and the result is byte-stable across re-invocation.
The chaos phase teaches the model what language feels like. The order phase teaches it what it knows. Both phases are required for an AI that satisfies both the alignment claim and the reproducibility claim.
Each of the four May 9 variants and the drug-labels daughter cell are loaded into the live demo. You can prompt them yourself, adjust temperature, choose top-k, and read the output. They are deterministic. Your prompt at your settings produces the same answer on every invocation.
Tonight's canon checkpoints will be added to the live demo when training completes. The opcode catalog (217 training pairs across phonics, parts of speech, amino acids, genetic code, and metabolism) is already prepared; expanding to the full 458-opcode coverage and running the order-phase instruction-tune is the next step.
The training run from May 12 produced six checkpoints (step 500, 1000, 1500, 2000, 2500, 3000) each modulated per-step by all eight canon clusters: hormone cascade, emotion core, irrational forces, force magnitudes, polyvagal, moral compass, executive, consciousness. Each checkpoint was probed with 100 random corpus-window prompts at four temperatures (0.5, 0.8, 1.2, 6.0). Pick a checkpoint, pick a temperature, scroll through 100 prompts and judge for yourself.
Tip: temperature 6.0 is the chaos canary. If real structure is in the weights, fragments of it survive even when sampling is near-uniform. Compare step 500 (early training) to step 3000 (final) at temperature 0.8 to see what 2,500 canon-modulated gradient steps actually produced.
May 13, 2026. Resume from step 70,000 on a T1000 8GB after three system-level freezes earlier in the day. 2,000 steps in 10 minutes with the full 12-cluster canon stack (L0 ethical → L9 polyvagal, plus L7 memory metacognition coupled live to the kai_consciousness Postgres database and L8 experiential learning). Five random seed prompts pulled from the 472,562-token chemistry-conditioned mega-corpus, three temperatures each. Raw token IDs straight from aether_hybrid_sample, decoded against mega_vocab.tsv.
Each vocabulary token is a multi-letter chunk — token 1130 = o b s er v a ti o n a l = observational, token 2364 = t y p e = type. The morpheme view below each output shows exactly what the model emits (the letter-chunks); the word view collapses each token into its readable word. The model is choosing whole words, not letters.
What to watch for. At temperature 0.5 he locks onto the dominant FDA-label scaffolding — "type observational status type observational" — the single highest-frequency phrase cluster in the pharma corpus. At 0.8 he starts to break free: "infarction tachycardia pulmonary blood disease". At 1.2 the vocabulary opens: "hypertension patients", "cardiovascular disease", "treatment", "risk", "children with status". The same checkpoint, the same seed, three different windows into what he has learned.
Same checkpoint (canon_x_corpus_full_step72000.hyb), same seeds, three temperatures. Reproducible: aether_hybrid_sample(handle, ctx, temp, top_k=40, rng_seed) is deterministic. Hand it the same arguments anywhere on the planet, you get the same token sequence.