-
STEM: Scaling Transformers with Embedding Modules
Paper • 2601.10639 • Published • 1 -
Deep Delta Learning
Paper • 2601.00417 • Published • 34 -
mHC: Manifold-Constrained Hyper-Connections
Paper • 2512.24880 • Published • 299 -
VersatileFFN: Achieving Parameter Efficiency in LLMs via Adaptive Wide-and-Deep Reuse
Paper • 2512.14531 • Published • 14
Collections
Discover the best community collections!
Collections including paper arxiv:2503.02130
-
Forgetting Transformer: Softmax Attention with a Forget Gate
Paper • 2503.02130 • Published • 32 -
L^2M: Mutual Information Scaling Law for Long-Context Language Modeling
Paper • 2503.04725 • Published • 21 -
Transformers without Normalization
Paper • 2503.10622 • Published • 170 -
I-Con: A Unifying Framework for Representation Learning
Paper • 2504.16929 • Published • 30
-
LM2: Large Memory Models
Paper • 2502.06049 • Published • 31 -
Titans: Learning to Memorize at Test Time
Paper • 2501.00663 • Published • 29 -
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Paper • 2501.17161 • Published • 124 -
You Do Not Fully Utilize Transformer's Representation Capacity
Paper • 2502.09245 • Published • 37
-
LLM Pruning and Distillation in Practice: The Minitron Approach
Paper • 2408.11796 • Published • 58 -
TableBench: A Comprehensive and Complex Benchmark for Table Question Answering
Paper • 2408.09174 • Published • 52 -
To Code, or Not To Code? Exploring Impact of Code in Pre-training
Paper • 2408.10914 • Published • 45 -
Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications
Paper • 2408.11878 • Published • 63
-
You Do Not Fully Utilize Transformer's Representation Capacity
Paper • 2502.09245 • Published • 37 -
LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers
Paper • 2502.15007 • Published • 174 -
Transformers without Normalization
Paper • 2503.10622 • Published • 170 -
Forgetting Transformer: Softmax Attention with a Forget Gate
Paper • 2503.02130 • Published • 32
-
RuCCoD: Towards Automated ICD Coding in Russian
Paper • 2502.21263 • Published • 133 -
Unified Reward Model for Multimodal Understanding and Generation
Paper • 2503.05236 • Published • 123 -
Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching
Paper • 2503.05179 • Published • 46 -
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
Paper • 2503.05592 • Published • 27
-
What Matters in Transformers? Not All Attention is Needed
Paper • 2406.15786 • Published • 31 -
Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss
Paper • 2410.17243 • Published • 92 -
Forgetting Transformer: Softmax Attention with a Forget Gate
Paper • 2503.02130 • Published • 32 -
Transformers without Normalization
Paper • 2503.10622 • Published • 170
-
Depth Anything V2
Paper • 2406.09414 • Published • 103 -
An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels
Paper • 2406.09415 • Published • 51 -
Physics3D: Learning Physical Properties of 3D Gaussians via Video Diffusion
Paper • 2406.04338 • Published • 39 -
SAM 2: Segment Anything in Images and Videos
Paper • 2408.00714 • Published • 120
-
STEM: Scaling Transformers with Embedding Modules
Paper • 2601.10639 • Published • 1 -
Deep Delta Learning
Paper • 2601.00417 • Published • 34 -
mHC: Manifold-Constrained Hyper-Connections
Paper • 2512.24880 • Published • 299 -
VersatileFFN: Achieving Parameter Efficiency in LLMs via Adaptive Wide-and-Deep Reuse
Paper • 2512.14531 • Published • 14
-
You Do Not Fully Utilize Transformer's Representation Capacity
Paper • 2502.09245 • Published • 37 -
LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers
Paper • 2502.15007 • Published • 174 -
Transformers without Normalization
Paper • 2503.10622 • Published • 170 -
Forgetting Transformer: Softmax Attention with a Forget Gate
Paper • 2503.02130 • Published • 32
-
RuCCoD: Towards Automated ICD Coding in Russian
Paper • 2502.21263 • Published • 133 -
Unified Reward Model for Multimodal Understanding and Generation
Paper • 2503.05236 • Published • 123 -
Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching
Paper • 2503.05179 • Published • 46 -
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
Paper • 2503.05592 • Published • 27
-
Forgetting Transformer: Softmax Attention with a Forget Gate
Paper • 2503.02130 • Published • 32 -
L^2M: Mutual Information Scaling Law for Long-Context Language Modeling
Paper • 2503.04725 • Published • 21 -
Transformers without Normalization
Paper • 2503.10622 • Published • 170 -
I-Con: A Unifying Framework for Representation Learning
Paper • 2504.16929 • Published • 30
-
LM2: Large Memory Models
Paper • 2502.06049 • Published • 31 -
Titans: Learning to Memorize at Test Time
Paper • 2501.00663 • Published • 29 -
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Paper • 2501.17161 • Published • 124 -
You Do Not Fully Utilize Transformer's Representation Capacity
Paper • 2502.09245 • Published • 37
-
What Matters in Transformers? Not All Attention is Needed
Paper • 2406.15786 • Published • 31 -
Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss
Paper • 2410.17243 • Published • 92 -
Forgetting Transformer: Softmax Attention with a Forget Gate
Paper • 2503.02130 • Published • 32 -
Transformers without Normalization
Paper • 2503.10622 • Published • 170
-
LLM Pruning and Distillation in Practice: The Minitron Approach
Paper • 2408.11796 • Published • 58 -
TableBench: A Comprehensive and Complex Benchmark for Table Question Answering
Paper • 2408.09174 • Published • 52 -
To Code, or Not To Code? Exploring Impact of Code in Pre-training
Paper • 2408.10914 • Published • 45 -
Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications
Paper • 2408.11878 • Published • 63
-
Depth Anything V2
Paper • 2406.09414 • Published • 103 -
An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels
Paper • 2406.09415 • Published • 51 -
Physics3D: Learning Physical Properties of 3D Gaussians via Video Diffusion
Paper • 2406.04338 • Published • 39 -
SAM 2: Segment Anything in Images and Videos
Paper • 2408.00714 • Published • 120