Difference between JEPA and Masked Auto Encoders

Hey, I was reading about Joint Embedding Predictive Architecture. Based on my understanding, I believe the architecture of JEPA is as so, there are two frames x and y, where y comes after x. The x frame is masked, it has only the important features, that we can understand about the world of x. From that masked x, we are told to predict y, which is called y’.

How is this different from Masked Auto Encoders, if that is the case?

1 Like

Hmm…?


Your description matches the high-level pattern “hide part of the input, predict something about what’s missing.” Both JEPA-style methods and MAE do that. The difference is what they predict and where the loss is applied.

What MAE is doing (Masked Autoencoder)

Goal: reconstruct the original signal (usually pixels) from a heavily masked input.

  • Input: an image split into patches. Most patches are masked.
  • Encoder: processes only the visible patches.
  • Decoder: takes the encoded representation plus mask tokens and tries to reconstruct the missing pixels.
  • Loss: reconstruction loss in pixel space (or sometimes a simple patch-level pixel loss). The MAE paper states this directly: “mask random patches … and reconstruct the missing pixels,” using an asymmetric encoder-decoder. (CVF Open Access)

So MAE is a generative reconstruction objective: it forces the model to preserve enough low-level detail to reproduce the input.

What JEPA is doing (Joint Embedding Predictive Architecture)

JEPA is a family of self-supervised objectives proposed as part of a broader “world model” direction.
A JEPA trains an encoder so that, given one view/context, it can predict the representation (embedding) of another view/target.

I-JEPA (the common concrete instantiation in vision)

I-JEPA is explicitly described as non-generative and as predicting in representation space rather than pixels.

Mechanically (simplified):

  • Take one image.

  • Choose:

    • a context block (visible subset of patches),
    • several target blocks (patch regions you want to predict).
  • Context encoder produces embeddings for the visible context.

  • Predictor network outputs predicted embeddings for the target blocks.

  • A separate target encoder produces the target embeddings.

  • Loss: L2 distance between predicted target embeddings and target encoder embeddings.

  • The target encoder is updated via exponential moving average (EMA) of the context encoder weights (a teacher-student style stabilizer).

Two details that often clear up confusion:

  1. JEPA does not try to reconstruct pixels. It tries to predict embeddings that (ideally) capture semantics.
  2. In I-JEPA, the masking for targets is applied to the target encoder outputs, not by corrupting the target encoder input, and the paper notes this distinction is “crucial” for semantic targets.

About your “two frames x and y”

That framing is more natural for video JEPA variants (past context predicts future embeddings). JEPA is intended to generalize to time as well as space. The “world model” motivation is explicit in LeCun’s position paper.
But the most-cited JEPA implementation in vision, I-JEPA, uses one image and predicts masked regions’ representations, not future frames’ pixels.

The core difference in one sentence

  • MAE: predict missing pixels (reconstruct the input).
  • JEPA: predict missing/future embeddings (predict in latent space).

Why that difference matters

1) Semantics vs surface detail

  • Pixel reconstruction (MAE) forces retention of many details that are not semantically important (textures, exact colors).
  • Representation prediction (JEPA) is designed to let the model drop unpredictable or unnecessary low-level detail and focus on what helps prediction in an abstract space. The I-JEPA paper motivates this explicitly: predicting in representation space can eliminate unnecessary pixel-level details and lead to more semantic features.

2) Architecture and compute

  • MAE needs a decoder that maps back to pixels. (CVF Open Access)
  • I-JEPA replaces the pixel decoder with a predictor that outputs embeddings and uses a teacher (EMA target encoder).

3) Collapse avoidance is handled differently

  • MAE: collapse is not the typical failure mode because reconstructing pixels is a strong constraint.
  • JEPA/I-JEPA: because you match embeddings, you generally need stabilization to avoid degenerate solutions. I-JEPA uses an EMA-updated target encoder (teacher-student style).

4) “Masked x has only important features” is not quite right

In both MAE and I-JEPA, the mask is usually not “importance-based.” It is a training trick to create a prediction task.

  • In MAE it is typically random patches. (CVF Open Access)
  • In I-JEPA it is a structured sampling of context and target blocks to push semantics (large targets, distributed context).

“Important features” are what you hope emerges in the learned representation, not what is explicitly kept by the mask.

Side-by-side comparison

Aspect MAE JEPA (e.g., I-JEPA)
Prediction target Pixels (reconstruct missing patches) (CVF Open Access) Embeddings of target blocks (latent prediction)
Output space Pixel / patch values Representation space
Needs decoder to pixels Yes (CVF Open Access) Not for the SSL loss
Typical masking Random high-ratio masking (CVF Open Access) Context block + semantic target blocks
“Generative?” Yes (reconstruction) Non-generative (predict embeddings)
Stabilization Reconstruction provides constraint Often uses EMA target encoder, etc.

Links (primary sources)

Notes are above. URLs are here.

I-JEPA paper (arXiv): https://arxiv.org/abs/2301.08243
I-JEPA paper (PDF): https://arxiv.org/pdf/2301.08243
MAE paper (arXiv): https://arxiv.org/abs/2111.06377
MAE paper (CVPR PDF): https://openaccess.thecvf.com/content/CVPR2022/papers/He_Masked_Autoencoders_Are_Scalable_Vision_Learners_CVPR_2022_paper.pdf
LeCun “A Path Towards Autonomous Machine Intelligence” (OpenReview PDF): https://openreview.net/pdf?id=BZ5a1r-kVsf
Official MAE code (Meta): https://github.com/facebookresearch/mae
Official I-JEPA code (Meta): https://github.com/facebookresearch/ijepa
Meta blog on I-JEPA: https://ai.meta.com/blog/yann-lecun-ai-model-i-jepa/

Summary

  • MAE reconstructs pixels from masked input. (CVF Open Access)
  • JEPA predicts embeddings of a target from context, not pixels.
  • I-JEPA uses one image with context and target blocks, plus an EMA target encoder.
  • Your “x then y frame” story fits video JEPA, but the defining difference stays the same: latent prediction vs pixel reconstruction.

hey, thanks for the answer. I think I will get more clarity after reading I-JEPA, based on your answer. I will reach out to you, if I have more questions. Thank you for your time

1 Like