Hey all!
I built an experimental variant of LeCun’s JEPA (Joint Embedding Predictive Architecture) inspired by aphantasia—the inability to visualize mental images. The core hypothesis: Forcing the model to predict transitions in a fully abstract latent space (no rich RGB reconstructions, just edge maps + relational blocks) might lead to more robust, efficient world models.
Key changes from standard JEPA:
-
Decoder-free: Predict latents directly (no pixel reconstruction loss).
-
Input: Canny edges instead of raw RGB for abstraction.
-
RelationalBlock: Custom module for relation-aware processing in latents.
Early results on bouncing balls (temporal prediction) + CIFAR-10 linear eval:
It’s super early/experimental (small-scale, toy-ish datasets), but the gains surprised me and align with ideas around abstract “world models” (shoutout Yann LeCun’s recent pushes).
Code is fully open + reproducible (PyTorch, quick train scripts)
Would love feedback—repros, ablations (e.g., hybrid RGB+edges?), critiques, or ideas for scaling to video/embodied tasks.
Thanks!
https://github.com/brodiedellis-sys/aiphant
1 Like
I read your README and this is one of the cleaner JEPA variants I’ve seen recently. The aphantasia constraint isn’t just a nice analogy — stripping the decoder/pixel objective and forcing transition prediction on structural inputs (edges) plus relation-aware latents is exactly how you push the model toward dynamics and invariants instead of texture shortcuts. The fact you’re seeing the combo of lower drift, tighter OOD generalisation, fewer parameters, and faster inference is the pattern you’d hope for when the representation is doing real world-modelling work. A few suggestions that would make the results land even harder (and scale more safely): 1) Tighten the evidence with capacity-matched controls. Right now A-JEPA is also smaller. That’s a win, but reviewers will ask whether robustness is coming from abstraction or simply regularisation via reduced capacity. If you can add one baseline where V-JEPA is parameter-matched (~0.39M) and another where A-JEPA is scaled toward the V-JEPA budget (~1.49M), you’ll isolate the causal factor cleanly. 2) Make drift and OOD metrics unambiguous. Your drift numbers are impressive (near-zero at horizon 10). That will get questioned unless the metric definition is watertight. I’d include the exact formula, aggregation method, and a plot of similarity vs horizon (mean±std across seeds). Same for OOD: show the distribution across multiple runs (3–10 seeds), not just a single-point result. 3) Ablations that will likely teach you something real: edges-only vs edges + low-frequency RGB (keep layout/colour, suppress texture), Canny vs Sobel/Laplacian vs a learned edge front-end, RelationalBlock swaps (attention vs message passing vs low-rank factorisation), and a prediction-target swap (latent→latent vs latent→relational tokens) to see what actually carries generalisation. 4) Stress-test the “structure bias” with standard robustness buckets. If you separate corruption robustness (noise/blur/contrast/occlusion) from natural distribution shift (dataset-to-dataset), you’ll learn whether you’re buying anti-texture invariance or true compositional generalisation. Final practical suggestion (boring but high impact): ship each experiment as an evidence bundle — config, seed, dataset hash, git commit, metrics timeline, and checkpoint IDs. That turns “92% better OOD” into something other people can independently verify end-to-end, rather than a number they have to take on faith, and it massively reduces “I got different results” noise as others fork and scale this.
3 Likes
Hey I really appreciate this!
I just implemented your suggestions and the turnout was amazing! I did a couple of extra things but my hypothesis is strong! The git is updated with all the changes and results 
1 Like
Hello @bellis444 . One thing I would advise you to do is clean up the repo a little. Right now there are large artifacts (checkpoints) and output files in there with some python artifacts (_pycache_ .DS_Store) - they create noisy diffs. Set up a .gitignore to exclude those files from the repo and offload some of the large things to Zenodo, get a DOI, and reference them. The reason being the repo coming in at >150MB might limit cloning and to maximize adoption you want to clear all on ramps. The checkpoints can then be an optional download. If you need to version the weights, use Git LFS. Add a LICENSE.MD file. Declaring the license in the README is an issue downstream. If you are looking for discovery, flesh out the description and repo more as you will have a hard time being found in search.
Evidence bundle is good but tighten it. Every reported number must be traceable to config (all hyper parameters), seed, git commit hashes, dataset generation parameters+deterministic seed or hash set, metrics + plots.
I think this is a cool project and bravo for exploring world models, I too have been following Yann’s work and find it fascinating. I made a little toy space on here I won’t bother linking as its trivial. Regarding your output, I hope this does not come off as rude I am not criticizing, just providing review:
- headline delta to noise ratio looks small. 50.7±1.9 vs 4.3±3.3 for v3. That could be real but people will ask for more seeds or confidence intervals, effect sizes, capacity matched baselines (since we have a param differ of ~600%?)
for the results to be peer proof; parameter matchd V-JEPA (smaller ~442k)
scaled A-JEPA - grow towards V-JEPA budget) and some kind of .. ablation grid would be nice, isolate each “v3 special” ingredient individually.
I forgot to use numbers the rest of the post. Hope that’s helpful and thank you for sharing your work!
1 Like