Mycel-LM (79M)
Model will be ungated for open download soon! These models are undertrained, and are NOT meant to be finished. These are Research Artifacts only.
Mycel-LM is a 79.2M-parameter research language model whose channel-mixing block is not an MLP. It is a differentiable Neighbour-Sensing fungal-colony-growth model: each token is expanded into a colony of hyphal tips that grow in a bounded latent region, sense a shared density field, and steer their own growth — the "MLP" is replaced by a few differentiable steps of colony growth, read back out into the hidden state.
It is part of a family of models that ask a single question: can the generalizing ability of a transformer be carried by an unusual, self-organizing dynamical system in place of the feed-forward block? Mycel-LM keeps the family's tokenizer, traits, and data fixed and swaps only the mixer, so it is a controlled experiment against the sibling Quazimoto models (whose mixer is a bank of coupled Kuramoto oscillators).
⚠️ Research artifact, not a product. At ~79M parameters it is fluent but small: it models the shape of language well and generates coherent, grammatical text, but it is not factual and will confidently hallucinate. See Limitations.
Table of contents
- Highlights
- Architecture
- Repository layout
- Install
- Quickstart
- Command-line usage
- Live visualizer & Space
- Training from scratch
- Fine-tuning (SFT)
- Checkpoints
- Limitations
- Citation / basis
Highlights
- Novel mixer. The per-layer feed-forward block is replaced by a MycelBlock — a differentiable simulation of fungal colony growth (Neighbour-Sensing).
- Self-describing checkpoints. Each
.ptembeds afamily_configrecording the exact geometry, sogenerate.py/healthcheck.py/visualize.pyrebuild the model with no external config. - KV cache. Incremental decoding is wired through the whole stack (attention presents
are threaded per layer);
generate()prefills the prompt once and decodes one token per forward. - Self-speculative decoding. Four MTP draft heads propose the next tokens and the main head verifies them in one parallel forward — bit-identical to greedy, just fewer forwards.
- Live 3-D visualizer. Watch the colony grow token-by-token as a Three.js filament web.
Architecture
Standard causal Transformer backbone (token-mixing = attention, tied LM head), with the per-layer feed-forward network replaced by a MycelBlock.
The MycelBlock (the novel part)
Based on the Meškauskas / Fricker / Moore (2004) Neighbour-Sensing model of fungal colony growth:
- The hidden state projects to N = 96 hyphal tips per token, each with a position in a bounded 3-D latent region and a growth vector.
- A few differentiable growth steps run: each tip senses the local density field, steers away from the colony's own density (negative autotropism) with persistence, moves, and is re-clamped into the bounded region (the colony can't grow unbounded).
- The final
[position, growth-vector, sensed-density]of every tip is read out back into the hidden state, behind a family gate.
The density field is evaluated against 16 learnable field centres (a low-rank sample of the field) so cost is O(N·F) per step, not O(N²) — the same mean-field trick that keeps the sibling oscillator block cheap. Health-checking a trained checkpoint shows the tropism parameter converges strongly negative across layers, i.e. the model genuinely learns the grow-away-from-density behaviour rather than leaving it at init.
Trait stations (MycelStations): tiny memory specialists sit at fixed anchor positions
in the colony. A tip interacts with a station by proximity — which is emergent from
where the tip grew — so which tips use which trait "comes to be" during growth rather than
being assigned to a fixed index. The stations hold test-time-writable input/output stores
that act as an addressable context memory at inference.
Attention
Family attention ported from the Quazimoto v2 stack:
- MLA low-rank Q/O projections
- Partial RoPE (nope + rope split), QK-Norm, GQA (4 KV heads)
- optional DERF (erf attention) and XSA (value-subspace removal) — off in this checkpoint
- KV cache for incremental decoding (per-layer
(k, v)presents threaded through the stack)
Opt-in family traits (all live in this checkpoint)
| Trait | Role |
|---|---|
| HRM | iterative gated hidden-state refinement (random init state, gates open) |
| MoE | SwiGLU mixture (4 routed + 1 shared, top-2) refining the trunk |
| MTP (×4) | multi-token-prediction draft heads → enables self-speculative decoding |
| JEPA | representation-prediction aux loss (train-only; never runs at inference) |
| Ring Specialists (7/ring) | the trait stations described above |
| Fractal Phase Seed | seeds tip positions from each token's Mandelbrot orbit angles (gated) |
Config (this checkpoint)
| params | 79.2M |
| layers | 10 |
| d_model | 768 |
| heads | 12 (4 KV) |
| vocab | 16512 (SpikeWhale byte-merge) |
| block size | 2048 |
| tips / token | 96, in a 3-D bounded colony |
| field centres | 16 · growth steps 3 · stations 16 |
The checkpoint is self-describing: family_config inside the .pt records the exact
geometry so the model rebuilds itself on load.
Repository layout
model.py QuazimotoLM + QuazimotoConfig — the transformer backbone, attention,
KV cache, traits (HRM/MoE/MTP/JEPA), generate() and forward_drafts()
mycel.py MycelBlock (Neighbour-Sensing growth mixer) + MycelStations
family.py shared family layers (MoE, HRM, specialists, norms, ...)
fractal.py hierarchical Mandelbrot phase seeding (FractalSeed trait)
instrument.py zero-cost capture hooks the visualizer reads from
special_tokens.py ChatML / control-token definitions
spike_tokenizer.py SpikeWhale byte-merge tokenizer (subclasses PreTrainedTokenizer)
tokenizer.json the tokenizer vocab / merges (vocab 16,512)
fractal_phase.pt precomputed hierarchical Mandelbrot phase table (regenerable)
generate.py inference harness — KV cache + self-speculative decoding + sampling
healthcheck.py per-layer weight / gate / PPL diagnostics for a checkpoint
visualize.py builds the 3-D colony dashboard (viz.html) from a generation
train.py pretraining entry point (streamed multi-corpus blend)
train_sft.py supervised fine-tuning (ChatML, assistant-only loss masking)
chat_sft.py chat-format rendering / loss masking helpers used by SFT
train_opd.py OPD (on-policy distillation) training loop
distill_uld.py universal-logit-distillation utilities
opd_teacher.py teacher wrapper for distillation
build_fractal_table.py regenerates fractal_phase.pt
train.bat / train_sft.bat Windows convenience launchers
chkpt/quazimoto.pt pretraining checkpoint (step 149,000)
chkpt/quazimoto_sft.pt SFT checkpoint (step 4,000, ChatML)
Note: the Modal cloud launchers (
modal_train.py,modal_sft.py) are intentionally not part of this package. The scripts above run locally on CPU or a single GPU.
Install
pip install -r requirements.txt
Requirements are minimal: torch, numpy, transformers (the tokenizer subclasses
PreTrainedTokenizer). Training additionally uses datasets and huggingface_hub.
Everything below runs on CPU (slow but functional) or a single GPU.
Quickstart
import torch
from model import QuazimotoLM, QuazimotoConfig
from spike_tokenizer import SpikeTokenizer
ck = torch.load("chkpt/quazimoto.pt", map_location="cpu", weights_only=False)
cfg = QuazimotoConfig(**ck["family_config"]) # self-describing
model = QuazimotoLM(cfg); model.load_state_dict(ck["model"], strict=False); model.eval()
tok = SpikeTokenizer(vocab_file="tokenizer.json")
ids = torch.tensor([tok.encode("The mycelium spreads through the soil", add_special_tokens=False)])
out = model.generate(ids, n_new=80, temperature=0.8, top_k=40) # KV cache on by default
print(tok.decode(out[0].tolist(), skip_special_tokens=True))
For a chat turn, wrap the prompt in ChatML and stop on <|im_end|> (the SFT checkpoint
was trained on this framing):
prompt = "<|im_start|><|user|>\nWhat is mycelium?<|im_end|>\n<|im_start|><|assistant|>\n"
ids = torch.tensor([tok.encode(prompt, add_special_tokens=False)])
out = model.generate(ids, n_new=120, temperature=0.7, top_k=40)
Command-line usage
# plain completion (KV cache on by default)
python generate.py --ckpt chkpt/quazimoto.pt --prompt "In the beginning" --max_new_tokens 80
# chat turn (ChatML framing + stop on <|im_end|>)
python generate.py --ckpt chkpt/quazimoto_sft.pt --chat --prompt "Hello, who are you?"
# interactive REPL
python generate.py --ckpt chkpt/quazimoto_sft.pt --interactive
# self-speculative decoding (MTP heads draft, main head verifies; report acceptance)
python generate.py --ckpt chkpt/quazimoto.pt --speculative --spec_stats
# disable the KV cache (full recompute each step — for comparison)
python generate.py --ckpt chkpt/quazimoto.pt --no_cache
# per-layer diagnostics (weights / gates / PPL)
python healthcheck.py --ckpt chkpt/quazimoto.pt
Sampling knobs: --temperature, --top_k, --top_p, --repetition_penalty, --seed.
Live visualizer & Space
visualize.py renders the colony growing in 3-D as the model generates, token by token —
hyphal tips linked into a filament web, coloured by local density, with the trait stations
shown as orange wire-spheres. It writes a self-contained viz.html (Three.js from a CDN):
python visualize.py --ckpt chkpt/quazimoto_sft.pt --prompt "the mycelium spreads" --tokens 50
A companion Hugging Face Space (Mycel-LM v1) wraps the same architecture in an
interactive chat — KV-cache decoding drives the reply while the 3-D colony visualizer
animates the growth for the generated tokens.
Training from scratch
python train.py --device cuda --steps 160000 --batch 12 --block 2048 --amp \
--use-hrm --use-moe --use-mtp --use-jepa --use-ring-specialists --use-fractal-phase-seed \
--stream --math-frac 0.25 --out chkpt/quazimoto.pt --ckpt-every 500 --resume
- Tokenizer: SpikeWhale byte-merge, vocab 16,512. (Byte-merge perplexity is tokenizer-inflated; bits/byte is the honest metric.)
- Pretraining blend: 35% Ultra-FineWeb-L3 / 25% FineWeb-Edu / 25% FineMath /
15% Quazim0t0/PretrainNew, streamed. Streamed datasets are pulled with
datasets; gated corpora needhuggingface-cli login. --resumecontinues from the checkpoint at--out. The growth loop is activation-heavy, so keep the batch modest;--ampgives a bf16 speedup on GPU.
Pass --help to train.py for the full trait / optimiser / schedule surface.
Fine-tuning (SFT)
python train_sft.py --init chkpt/quazimoto.pt --out chkpt/quazimoto_sft.pt \
--steps 4000 --batch 8 --block 2048 --amp
- Renders a chat mix in ChatML with assistant-only loss masking (
chat_sft.py). - SFT blend: ultrachat_200k_sft + ultrafeedback-sft + UltraData-SFT-2605/Knowledge + OpenThoughts2-1M-ShortThink.
- The bundled SFT checkpoint is only 4k steps — the chat format transferred but the model is still shallow.
The distributed checkpoints carry weights only (optimizer state stripped to keep the download small). Fine-tuning starts a fresh optimizer from them, which is the normal path; only exact resumption of the original pretraining run would need the optimizer state.
Checkpoints
chkpt/quazimoto.pt— pretraining checkpoint, step 149,000chkpt/quazimoto_sft.pt— SFT checkpoint, step 4,000 (ChatML, early)
Both embed family_config (self-describing) and load with strict=False so future trait
additions stay backward-compatible.
Limitations
- Not factual. Small-model behaviour: fluent and grammatical, but it invents facts ("the capital of France is the largest and most important part of the world").
- SFT is early (4k steps) — answers follow the chat format but hallucinate.
- No safety tuning. No RLHF/guardrails; do not deploy in user-facing settings.
- Custom architecture — cannot be loaded with
AutoModel; use the bundledmodel.py. - This is an experiment in architecture, released to study whether a self-organizing growth process can carry a transformer's generalization. Treat outputs accordingly.
Citation / basis
Neighbour-Sensing model of hyphal growth: Meškauskas, Fricker & Moore (2004), Simulating colonial growth of fungi with the Neighbour-Sensing model of hyphal growth, Mycological Research 108(11).
License: Apache-2.0.
Citation
If you use this model, please cite:
@misc{mycellm79m,
title = {Mycel-LM-79M: A ~79M-parameter Neighbour-Sensing fungal-colony language model},
author = {Dean Byrne (Quazim0t0)},
year = {2026},
howpublished = {HuggingFace, \url{https://huggingface.co/Quazim0t0/Mycel-LM-79M}},
note = {Quazim0t0/Mycel-LM-79M}
}