SSE (Stable Static Embedding): Unlocking the Potential of Static Embeddings, A Dynamic Tanh Normalization Approach without Speed Penalty
Rikka Botan
Independent Researcher, Japan
https://rikka-botan.github.io
Abstract
1 Main Contributions
1.We analyze limitations commonly observed in static embedding models, particularly gradient instability and inter-dimensional imbalance, which can negatively affect embedding quality.
2.We propose Separable DyT (Dynamic Tanh normalization), a simple normalization mechanism that stabilizes training and improves the structure of the embedding space.
3.We introduce SSE (Stable Static Embedding), a parameter-efficient static embedding framework, and demonstrate through extensive experiments on NanoBEIR that it achieves strong retrieval performance while maintaining fast inference and low computational cost.
Figure 1 | (a) Retrieval performance (nDCG@10) across NanoBEIR English tasks. (b) Mean nDCG@10 vs. inference speed (QPS: queries per second) measured on TREC-COVID and Quora using an Intel® Core™ Ultra 7 265K (3.90 GHz) with batch size 32.
Figure 2 | (a) Retrieval performance (nDCG@10) across NanoBEIR Japanese tasks. (b) Mean nDCG@10 vs. inference speed (QPS: queries per second) measured on Miracl using an Intel® Core™ Ultra 7 265K (3.90 GHz) with batch size 32.
2 Greeting
As plum blossoms light upon slender branches and tie ribbons into crystalline air—how are you doing today?
My name is Rikka Botan, nice to meet you.
This article provides technical insights into static embedding models.
If you are interested, please follow my account. I share updates on my research progress as well as everyday stories.
3 Introduction
Dense vector representations have become a fundamental component of modern information retrieval and retrieval-augmented language systems. However, a significant trade-off exists between accuracy and efficiency in current architectures. While contextual embedding models based on transformer architectures achieve strong semantic performance, their computational cost during inference remains a major bottleneck for large-scale and latency-sensitive applications. In contrast, static embedding models offer extremely fast inference and low memory consumption due to their simple and deterministic structure. These properties make them particularly attractive for large-scale search, recommendation, and real-time retrieval systems. Consequently, there is a widening gap between static embeddings and contextual encoders: the former provide speed but often lack accuracy, while the latter provide accuracy but are computationally expensive. Bridging this gap requires new techniques that enhance the representational quality of embedding models without introducing significant computational overhead.
The development of word representations has evolved significantly to address these needs since the pioneering work on Word2Vec (Mikolov et al., 2013), which established distributed word embeddings based on distributional hypotheses. This was followed by GloVe (Pennington et al., 2014), which introduced global co-occurrence statistics for more stable learning, and FastText (Bojanowski et al., 2017), which incorporated subword information to improve robustness for rare words. These models enabled practical deployment in large-scale systems through their fixed token representations and lightweight composition functions. Most notably, recent developments such as static-similarity-mrl-multilingual-v1 and static-retrieval-mrl-en-v1 (sentence-transformers, 2025) have marked a significant milestone, demonstrating that static embeddings rather than contextual encoders—can achieve practical retrieval performance while being 400 times faster than transformer-based models.(Aarsen, 2025)
Nevertheless, despite such remarkable efficiency gains, the rapid growth of web-scale corpora and retrieval-augmented generation pipelines continues to drive demand for embedding models that are even faster and more accurate. The existing gap between static embeddings and contextual encoders still requires further closing to meet industrial standards. More critically, static embedding approaches suffer from a fundamental limitation in structural expressiveness that hinders this progress. Because they rely on fixed token representations and lightweight composition functions, their ability to capture complex semantic relationships is inherently constrained. Specifically, anisotropy has been widely observed as an issue in embedding spaces: the representation space tends to develop directional bias where dimensions exhibit uneven variance, leading to representational imbalance across features. This phenomenon is often associated with gradient instability during training, which causes non-uniform development of representation capacity and degrades generalization performance. Previous attempts to improve embedding quality, including recent advances in Matryoshka Embeddings (Kusupati et al., 2024), have focused primarily on compression or incremental objectives, often failing to address this core representational imbalance during the training process itself.
In this work, we propose SSE (Stable Static Embedding), a simple yet effective framework for improving the performance of static embedding models. SSE adopts Separable DyT (Dynamic Tanh normalization), itself a derivative of DyT(Zhu et al., 2025), a lightweight normalization mechanism that stabilizes gradient flow and suppresses inter-dimensional imbalance during training. By dynamically controlling the scale and saturation of embedding activations, Separable DyT mitigates overfitting and improves the uniformity of the embedding space. This results in more discriminative and robust representations without increasing model complexity. We demonstrate through extensive experiments that SSE significantly outperforms conventional static embedding methods while maintaining a compact parameter size. Despite having only 16 million parameters, SSE achieves a mean NanoBEIR (English) nDCG@10 score of 0.512, surpassing several larger and more complex baselines. Furthermore, SSE requires only half the number of parameters compared to prior approaches with comparable performance, highlighting its efficiency advantage.
4 Method
4.1 Structure
The core component of SSE (Stable Static Embedding) is Separable DyT (Dynamic Tanh normalization), a lightweight normalization module that introduces magnitude-adaptive gradient flow for each embedding dimension. Separable DyT operates directly on embedding vectors and can be inserted as a post-projection normalization layer without introducing significant computational overhead.
Given an input embedding vector , SSE applies Separable DyT independently to each dimension, producing a normalized representation . This transformation reshapes the geometry of the embedding space by suppressing unstable high-magnitude dimensions and preserving informative low-magnitude features.
Figure 3 | SSE (Stable Static Embedding) Architecture
4.2 Separable DyT (Dynamic Tanh normalization)
For each embedding dimension , Separable DyT computes the output as:
where , , and are learnable parameters that control scaling, shifting, and output amplitude respectively.
The derivative with respect to the input dimension is:
This formulation introduces magnitude-dependent gradient gating. The gradient magnitude is modulated by the squared hyperbolic secant function:
For saturated dimensions , we have which yields exponential decay: Consequently, gradients vanish:
For non-saturated dimensions , we have preserving near-constant gradients:
Thus, Separable DyT adaptively attenuates gradients for large-magnitude (often noisy or overfitted) dimensions, while maintaining full gradient flow for small-magnitude, information-rich dimensions.
4.3 Implicit Regularization via Magnitude-Adaptive Gating
The key property of Separable DyT is that it acts as an implicit regularizer without introducing additional loss terms or hyperparameters. The magnitude-dependent gating mechanism selectively suppresses unstable feature directions during optimization, thereby:
- Reducing inter-dimensional imbalance in the embedding space
- Preventing gradient explosion and over-amplification
- Mitigating overfitting by dampening extreme activations
- Improving representation uniformity and isotropy
Unlike standard normalization techniques (e.g., layer normalization or L2 normalization), Separable DyT does not globally rescale embeddings. Instead, it performs dimension-wise adaptive modulation, allowing each feature to learn its own dynamic range and sensitivity.
4.4 Integration into Static Embedding Models
Separable DyT is applied to the output of an EmbeddingBag layer, which aggregates token embeddings into a fixed-dimensional representation. Let denote the embedding matrix and let a sentence be represented by a set of token indices . The EmbeddingBag layer computes a pooled representation:
where the aggregation is typically performed using mean pooling over the selected embeddings.
Separable DyT is then applied to the aggregated representation:
where denotes the final sentence embedding.
Because Separable DyT operates element-wise on the aggregated vector and introduces only a small number of learnable parameters per dimension, it integrates seamlessly into existing static embedding architectures without altering their structural simplicity.
4.5 Computational Efficiency
SSE preserves the primary advantage of static embedding models: extremely fast inference. The EmbeddingBag operation performs efficient aggregation of token embeddings, avoiding the need for sequential computation or deep contextual encoding.
The Separable DyT transformation consists of element-wise affine transformations followed by a tanh activation, both of which are computationally inexpensive and highly parallelizable. The number of additional parameters introduced by Separable DyT grows linearly with the embedding dimension and remains negligible compared to contextual encoders.
Consequently, SSE maintains the constant-time inference characteristics of static embedding models while improving the stability and expressiveness of the embedding space.
5 Experiments
5.1 Training Configuration
We train our embedding model using contrastive learning combined with Matryoshka Loss to support variable-dimensional embeddings. Following the training protocol of Train 400x faster Static Embedding Models with Sentence Transformers(Aarsen, 2025), we optimize for multiple projection dimensions simultaneously: 32, 64, 128, 256, and 512. This allows the model to maintain retrieval performance across different embedding sizes without retraining.
We utilize the AdamW optimizer with mixed-precision training enabled via bf16. The learning rate is set to 0.1 with a cosine decay schedule and a warmup ratio of 0.1. To ensure efficient gradient accumulation, we employ a per-device batch size of 512 with 8 gradient accumulation steps, resulting in an effective global batch size of 4,096 (assuming standard multi-GPU setups). We train for exactly one epoch (num_train_epochs=1) and evaluate the model every training step. To prevent overfitting on duplicate pairs during contrastive learning, we apply a no_duplicates batch sampler.
The specific non-default hyperparameters used in our experiments are summarized in Table 1.
Table 1 | Training Hyperparameters
| Parameter | Value |
|---|---|
| Optimizer | AdamW (beta2: 0.9999, epsilon: 1e-10) |
| Learning Rate | 0.1 |
| LR Scheduler | Cosine Decay |
| Warmup Ratio | 0.1 |
| Batch Size (per device) | 512 |
| Gradient Accumulation Steps | 8 |
| Training Epochs | 1 |
| Precision | BF16 (bf16: True) |
| Evaluation Strategy | Steps |
| Dataloader Workers | 4 |
| Batch Sampler | no_duplicates |
5.2 Training Datasets
To ensure robust generalization across various retrieval and semantic similarity tasks, we train on a diverse collection of 15 datasets. These datasets cover question answering (QA), natural language inference (NLI), and information retrieval (IR) domains. All datasets are processed with the Matryoshka Loss function during training. The complete list of training corpora is provided in Table 2.
Table 2 | Training Datasets
| Dataset | Domain | Ratio |
|---|---|---|
squad |
Question Answering | ~2% |
trivia_qa |
Question Answering | ~2% |
allnli |
Natural Language Inference | ~2% |
pubmedqa |
Scientific QA | ~2% |
hotpotqa |
Multi-hop QA | ~2% |
miracl |
Multilingual IR | ~2% |
mr_tydi |
Multilingual IR | ~2% |
msmarco |
Web Search IR | ~5% |
msmarco_10m |
Large-scale IR | ~45% |
msmarco_hard |
Hard Negative Mining | ~2% |
mldr |
Long Document Retrieval | ~2% |
s2orc |
Scientific Text | ~14% |
swim_ir |
Semantic Web IR | ~2% |
paq |
Question Answering | ~14% |
nq |
Natural Questions | ~2% |
5.3 Experimental Models
We compared SSE against two baselines:
- Static Embedding (no DyT): The standard baseline without DyT layers.
- Static Embedding + DyT: A variant incorporating Dynamic Tanh normalization.
5.4 Training Results
Figure 4 presents a comparative analysis of training loss and gradient-related metrics across three model variants.
Loss Convergence (Figure 4 (a)): All three models demonstrate nearly identical loss trajectories, rapidly decreasing from approximately 19 to stabilize around a plateau between 2 and 3. This convergence pattern indicates that the core objective function is effectively satisfied by all variants. Consequently, the observed performance differences are not attributable to variations in primary loss minimization speed or final loss values.
Gradient Dynamics (Figure 4 (b)): Distinct behaviors emerge when examining gradient magnitude metrics. The Static Embedding baseline (purple) maintains consistently low values throughout training, suggesting potential issues such as vanishing gradients or excessive feature suppression. In contrast, the SSE model (pink) displays higher initial variance before settling at a moderate level of approximately 0.2. This trajectory reflects an active modulation of gradient flow, which prevents the saturation of activation functions by large-magnitude inputs. By sustaining non-zero gradients for informative dimensions, SSE facilitates continuous parameter updates during the training process, unlike the baseline where feature learning appears to stagnate early on.
Figure 4 | Comparison of (a) Loss and (b) Gradient Norm Across Training Steps.
6 Evaluations
6.1 NanoBEIR mean nDCG@10 progression by steps
Figure 5 illustrates the training dynamics and final performance of SSE compared to the baselines. The results demonstrate that SSE consistently outperforms both baseline models throughout the training process, particularly in later stages (after step 1k).
- Superior Convergence: As shown by the pink line, SSE achieves a peak mean nDCG@10 of approximately 0.5124, surpassing the standard Static Embedding baseline (purple line), which plateaus around 0.5068. This represents a relative improvement of roughly 1.3% in retrieval accuracy.
- Necessity of Separability: Notably, the variant with standard DyT (cyan line) underperforms compared to the baseline in the long run, fluctuating between 0.497 and 0.503. This suggests that applying a global or non-separable normalization can inadvertently suppress informative features or fail to address dimension-specific instability. In contrast, SSE's Separable DyT mechanism successfully solves these issues by adapting gradients per dimension.
Figure 5 | NanoBEIR mean nDCG@10 Across Training Steps.
Table 3 | NanoBEIR English nDCG@10 comparison.
| Model | NanoArguAna nDCG@10 | NanoClimateFEVER nDCG@10 | NanoDBPedia nDCG@10 | NanoFEVER nDCG@10 | NanoFiQA2018 nDCG@10 | NanoHotpotQA nDCG@10 | NanoMSMARCO nDCG@10 | NanoNFCorpus nDCG@10 | NanoNQ nDCG@10 | NanoQuoraRetrieval nDCG@10 | NanoSCIDOCS nDCG@10 | NanoSciFact nDCG@10 | NanoTouche2020 nDCG@10 | Mean nDCG@10 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SSE(Static Embedding + Separable DyT) | 0.4105 | 0.2998 | 0.5493 | 0.6808 | 0.3744 | 0.7021 | 0.4132 | 0.2982 | 0.4652 | 0.9094 | 0.3381 | 0.6176 | 0.6029 | 0.5124 |
| Static Embedding + DyT | 0.3615 | 0.2717 | 0.5632 | 0.6716 | 0.3393 | 0.6765 | 0.4367 | 0.3142 | 0.4674 | 0.9094 | 0.3267 | 0.6129 | 0.5816 | 0.5025 |
| Static Embedding (No DyT) | 0.3884 | 0.3005 | 0.5552 | 0.7125 | 0.3573 | 0.6783 | 0.4219 | 0.2955 | 0.4638 | 0.8979 | 0.3264 | 0.6076 | 0.5834 | 0.5068 |
6.2 Matryoshka Evaluations
To validate the effectiveness of our proposed Stable Static Embedding (SSE) framework, we conducted comprehensive experiments on the NanoBEIR benchmark, measuring retrieval performance via mean nDCG@10 across varying embedding dimensions ranging from 32 to 512. We compared SSE against standard static embeddings and a strong reference model (static-retrieval-mrl-en-v1). The results, summarized in Table 4 and visualized in Figure 1, demonstrate that SSE consistently achieves superior retrieval performance while maintaining computational efficiency.
As illustrated in Figure 6, the SSE model (pink line) exhibits a distinct advantage over all baselines as embedding dimensions increase. While the reference model (static-retrieval-mrl-en-v1) initially outperforms SSE at dimension 32 with an nDCG@10 of 0.3532 compared to SSE's 0.3448, this trend reverses rapidly starting from dimension 64. At dimension 64, SSE achieves a score of 0.4275, surpassing the reference model by approximately 2.4%. This performance gap widens significantly as the embedding capacity grows; at dimension 512, SSE reaches a peak nDCG@10 of 0.5124, exceeding the reference model's 0.4957 and the standard "Static Embedding + DyT" baseline (0.5025). These results confirm that the proposed architecture scales effectively with embedding dimensionality, leveraging additional capacity more efficiently than conventional methods.
The ablation study within Table 4 further isolates the contribution of our core component, Separable DyT. By comparing SSE against a variant using non-separable Dynamic Tanh normalization ("Static Embedding + DyT"), we observe that SSE consistently outperforms this baseline across all dimensions (e.g., at dim 512: 0.5124 vs. 0.5025). This performance gain validates the hypothesis presented in Section 4.2: applying normalization independently to each dimension allows for more precise magnitude-adaptive gradient gating. Specifically, Separable DyT suppresses unstable high-magnitude dimensions while preserving informative low-magnitude features, thereby constructing a more robust embedding geometry than standard normalization techniques which treat dimensions collectively.
Finally, the results highlight the favorable efficiency-accuracy trade-off of SSE. Even at lower dimensions (e.g., 64 or 128), where computational cost is minimal, SSE maintains state-of-the-art performance relative to larger models. This suggests that the Separable DyT module effectively compresses information density into fewer dimensions without sacrificing retrieval quality. Consequently, SSE offers a compelling solution for efficient semantic search applications, delivering high accuracy in resource-constrained environments while scaling robustly as dimensionality increases.
Figure 6 | NanoBEIR English mean nDCG@10 vs Matryoshka Embedding Truncation.
Table 4 | NanoBEIR English mean nDCG@10 vs Matryoshka Embedding Truncation.
| Model | 32 | 64 | 128 | 256 | 512 | 1024 |
|---|---|---|---|---|---|---|
| SSE (Static Embedding + Separable DyT) | 0.3448 | 0.4275 | 0.4659 | 0.4969 | 0.5124 | - |
| Static Embedding + DyT | 0.3338 | 0.4134 | 0.4622 | 0.4919 | 0.5025 | - |
| Static Embedding (no DyT) | 0.3367 | 0.4161 | 0.4625 | 0.4912 | 0.5068 | - |
| static-retrieval-mrl-en-v1 (For reference) | 0.3532 | 0.4176 | 0.4622 | 0.4819 | 0.4957 | 0.5031 |
6.3 Performance Analysis: English Retrieval Task
In the English retrieval task (Figure 7), SSE exhibits even more pronounced advantages over existing baselines. The scatter plot reveals that while Transformer models achieve the highest absolute accuracy (>0.60 nDCG@10), they suffer from severe latency constraints (<10,000 QPS). Conversely, most static embedding methods struggle to reach an nDCG@10 of 0.50.
Our SSE variants (stable-static-embedding-fast-retrieval-mrl-en) achieve a Mean nDCG@10 above 0.50 with a throughput exceeding 50,000 QPS. Notably, the results indicate a 2.16x speedup over Static Embedding approaches while maintaining competitive accuracy. This substantial efficiency gain confirms that the magnitude-adaptive gradient flow introduced by Separable DyT prevents the saturation of embedding dimensions, thereby preserving information richness even in high-dimensional static representations.
Figure 7 | NanoBEIR English Mean nDCG@10 vs. inference speed (QPS: queries per second) measured on TREC-COVID and Quora using an Intel® Core™ Ultra 7 265K (3.90 GHz) with batch size 32.
Figure 8 | Retrieval performance (nDCG@10) across NanoBEIR English tasks.
6.4 Performance Analysis: Japanese Retrieval Task
Figure 9 illustrates the performance distribution on the Japanese retrieval benchmark. The results clearly delineate a trade-off frontier where Transformer-based models occupy the high-accuracy, low-throughput region (left side), while standard static embeddings are clustered in the lower-performance zone despite their speed.
Our proposed SSE models (stable-static-embedding-fast-retrieval-mrl-ja and bilingual variants) successfully bridge this gap, positioning themselves in the top-right quadrant of the Pareto frontier. Specifically, our method achieves a Mean nDCG@10 exceeding 0.45 while maintaining a throughput of approximately 60,000 QPS. This performance demonstrates that SSE not only outperforms standard static embeddings by a significant margin but also offers a 1.46x speedup compared to Static Embedding baselines for comparable accuracy levels.
Figure 9 | NanoBEIR Japanese Mean nDCG@10 vs. inference speed (QPS: queries per second) measured on Miracl using an Intel® Core™ Ultra 7 265K (3.90 GHz) with batch size 32.
Figure 10 | Retrieval performance (nDCG@10) across NanoBEIR Japanese tasks.
6.5 Spectral Analysis of Learned Embeddings
The spectral analysis via PCA of the learned embedding matrices provides insights into the representation geometry underlying SSE's superior performance. As illustrated in Figure 11, the baseline Static Embedding exhibits a smooth and gradual decay of normalized eigenvalues across all 512 dimensions, suggesting that semantic variance is distributed across many directions and the effective rank remains relatively high.
In contrast, Standard DyT exhibits an abrupt spectral cliff around PC ≈ 480, indicating a sudden collapse of variance in the tail dimensions. This behavior suggests an unstable compression effect where high-dimensional variance is abruptly suppressed, potentially disrupting the embedding geometry.
SSE, on the other hand, demonstrates an earlier yet smoother eigenvalue decay around PC ≈ 430, after which the remaining dimensions decay rapidly toward zero. This pattern suggests that SSE implicitly performs a form of low-rank regularization, concentrating semantic variance into a more compact subspace while suppressing noise-dominated directions. Such controlled compression likely improves the stability of cosine similarity and distance geometry, which may contribute to the enhanced retrieval performance observed in downstream evaluations.
Figure 11 | PCA Spectrum on the 13 NanoBEIR English Datasets: Normalized Eigenvalue Decay (a) Linear Scale, (b) Logarithmic Scale.
7 Discussion
7.1 Scope and Limitations
It is important to note that the validity of SSE has been empirically confirmed specifically within the context of static text embeddings. In our current framework, Separable DyT operates on pre-defined embedding vectors (e.g., word or item IDs) retrieved via lookup tables. While this setting represents a critical bottleneck in many recommendation and NLP systems, the method's applicability to other forms of representation remains an open question.
7.2 Future Work
Extending SSE to broader contexts represents a promising direction for future research, which we expect other groups to clarify and advance. Specifically, the following areas are anticipated to be explored:
- Dynamic and Contextual Embeddings: It is expected that Separable DyT will be evaluated for integration into Transformer-based architectures (e.g., as an alternative or complement to LayerNorm) by future researchers. This would aim to stabilize training in deep networks where hidden states vary dynamically with input context.
- Cross-Modal Generalization: The magnitude-adaptive principle is not inherently tied to text data. Future work is anticipated to explore the application of SSE to image patch embeddings, graph node representations, and other modalities where high-dimensional vector stability is crucial.
- Optimization Dynamics: It is expected that future studies will analyze how Separable DyT interacts with various optimizers (e.g., AdamW vs. SGD) in non-static settings. This research aims to determine whether the learnable parameters require different initialization or regularization strategies when applied beyond static lookup tables.
8 Published Models
8.1 SSE series
RikkaBotan/stable-static-embedding-fast-retrieval-mrl-en
- SSE for Retrieval MRL English version
Table 5 | NanoBEIR English Evaluation.
| Dataset | nDCG@10 | MRR@10 | MAP@100 |
|---|---|---|---|
| NanoBEIR Mean | 0.5124 | 0.5640 | 0.4317 |
| NanoClimateFEVER | 0.2998 | 0.3611 | 0.2344 |
| NanoDBPedia | 0.5493 | 0.7492 | 0.4247 |
| NanoFEVER | 0.6808 | 0.6318 | 0.6105 |
| NanoFiQA2018 | 0.3744 | 0.4197 | 0.3162 |
| NanoHotpotQA | 0.7021 | 0.7679 | 0.6273 |
| NanoMSMARCO | 0.4132 | 0.3537 | 0.3733 |
| NanoNFCorpus | 0.2982 | 0.4889 | 0.1091 |
| NanoNQ | 0.4652 | 0.3992 | 0.4028 |
| NanoQuoraRetrieval | 0.9094 | 0.9122 | 0.8847 |
| NanoSCIDOCS | 0.3381 | 0.5509 | 0.2604 |
| NanoArguAna | 0.4105 | 0.3193 | 0.3325 |
| NanoSciFact | 0.6176 | 0.5933 | 0.5824 |
| NanoTouche2020 | 0.6029 | 0.7852 | 0.4539 |
RikkaBotan/stable-static-embedding-fast-retrieval-mrl-ja
- SSE for Retrieval MRL Japanese version
Table 6 | NanoBEIR Japanese Evaluation.
| Dataset | nDCG@10 | MRR@10 | MAP@100 |
|---|---|---|---|
| NanoBEIR Mean | 0.4507 | 0.5090 | 0.3695 |
| NanoClimateFEVER | 0.3110 | 0.4208 | 0.2347 |
| NanoDBPedia | 0.5596 | 0.7652 | 0.4000 |
| NanoFEVER | 0.5611 | 0.5003 | 0.4923 |
| NanoFiQA2018 | 0.3247 | 0.3731 | 0.2692 |
| NanoHotpotQA | 0.4795 | 0.5758 | 0.4182 |
| NanoMSMARCO | 0.3845 | 0.3191 | 0.3335 |
| NanoNFCorpus | 0.2736 | 0.4544 | 0.1014 |
| NanoNQ | 0.4218 | 0.3658 | 0.3572 |
| NanoQuoraRetrieval | 0.7786 | 0.7750 | 0.7428 |
| NanoSCIDOCS | 0.3026 | 0.4850 | 0.2192 |
| NanoArguAna | 0.3521 | 0.2686 | 0.2793 |
| NanoSciFact | 0.6372 | 0.6100 | 0.5990 |
| NanoTouche2020 | 0.4731 | 0.7036 | 0.3572 |
RikkaBotan/stable-static-embedding-fast-retrieval-mrl-bilingual-ja-en
- SSE for Retrieval MRL Bilingual version (English & Japanese)
Table 7 | NanoBEIR English Evaluation.
| Dataset | nDCG@10 | MRR@10 | MAP@100 |
|---|---|---|---|
| NanoBEIR Mean | 0.5073 | 0.5563 | 0.4207 |
| NanoClimateFEVER | 0.3239 | 0.4045 | 0.2612 |
| NanoDBPedia | 0.5647 | 0.7321 | 0.4262 |
| NanoFEVER | 0.6450 | 0.5790 | 0.5514 |
| NanoFiQA2018 | 0.3374 | 0.3838 | 0.2766 |
| NanoHotpotQA | 0.6897 | 0.7505 | 0.6177 |
| NanoMSMARCO | 0.4463 | 0.3621 | 0.3740 |
| NanoNFCorpus | 0.2844 | 0.4456 | 0.1071 |
| NanoNQ | 0.4851 | 0.4217 | 0.4186 |
| NanoQuoraRetrieval | 0.8554 | 0.8540 | 0.8202 |
| NanoSCIDOCS | 0.3376 | 0.5482 | 0.2566 |
| NanoArguAna | 0.3941 | 0.3154 | 0.3279 |
| NanoSciFact | 0.6185 | 0.5977 | 0.5881 |
| NanoTouche2020 | 0.6123 | 0.8369 | 0.4432 |
Table 8 | NanoBEIR Japanese Evaluation.
| Dataset | nDCG@10 | MRR@10 | MAP@100 |
|---|---|---|---|
| NanoBEIR Mean | 0.4511 | 0.5141 | 0.3772 |
| NanoClimateFEVER | 0.2979 | 0.4005 | 0.2353 |
| NanoDBPedia | 0.5429 | 0.7633 | 0.4059 |
| NanoFEVER | 0.5133 | 0.4643 | 0.4661 |
| NanoFiQA2018 | 0.3174 | 0.3669 | 0.2619 |
| NanoHotpotQA | 0.5000 | 0.5672 | 0.4234 |
| NanoMSMARCO | 0.4372 | 0.3865 | 0.4022 |
| NanoNFCorpus | 0.2866 | 0.5185 | 0.1177 |
| NanoNQ | 0.3987 | 0.3500 | 0.3527 |
| NanoQuoraRetrieval | 0.7944 | 0.8100 | 0.7685 |
| NanoSCIDOCS | 0.3153 | 0.5127 | 0.2322 |
| NanoArguAna | 0.3721 | 0.2873 | 0.2990 |
| NanoSciFact | 0.6216 | 0.5904 | 0.5804 |
| NanoTouche2020 | 0.4662 | 0.6656 | 0.3589 |
8.2 Quantized SSE series
RikkaBotan/quantized-stable-static-embedding-fast-retrieval-mrl-en
- Quantized SSE for Retrieval MRL English version
Table 9 | NanoBEIR English Evaluation.
| Dataset | nDCG@10 | MRR@10 | MAP@100 |
|---|---|---|---|
| NanoBEIR Mean | 0.5110 | 0.5645 | 0.4312 |
| NanoClimateFEVER | 0.3127 | 0.3822 | 0.2439 |
| NanoDBPedia | 0.5472 | 0.7440 | 0.4252 |
| NanoFEVER | 0.6870 | 0.6402 | 0.6191 |
| NanoFiQA2018 | 0.3750 | 0.4155 | 0.3129 |
| NanoHotpotQA | 0.6927 | 0.7572 | 0.6205 |
| NanoMSMARCO | 0.4105 | 0.3504 | 0.3694 |
| NanoNFCorpus | 0.3063 | 0.4989 | 0.1148 |
| NanoNQ | 0.4523 | 0.3884 | 0.3941 |
| NanoQuoraRetrieval | 0.9147 | 0.9222 | 0.8944 |
| NanoSCIDOCS | 0.3345 | 0.5562 | 0.2622 |
| NanoArguAna | 0.4154 | 0.3151 | 0.3257 |
| NanoSciFact | 0.5972 | 0.5774 | 0.5703 |
| NanoTouche2020 | 0.5979 | 0.7910 | 0.4526 |
RikkaBotan/quantized-stable-static-embedding-fast-retrieval-mrl-ja
- Quantized SSE for Retrieval MRL Japanese version
Table 10 | NanoBEIR Japanese Evaluation.
| Dataset | nDCG@10 | MRR@10 | MAP@100 |
|---|---|---|---|
| NanoBEIR Mean | 0.4477 | 0.5088 | 0.3675 |
| NanoClimateFEVER | 0.3152 | 0.4258 | 0.2417 |
| NanoDBPedia | 0.5554 | 0.7767 | 0.3962 |
| NanoFEVER | 0.5536 | 0.4907 | 0.4827 |
| NanoFiQA2018 | 0.3160 | 0.3614 | 0.2653 |
| NanoHotpotQA | 0.4722 | 0.5669 | 0.4136 |
| NanoMSMARCO | 0.3929 | 0.3237 | 0.3371 |
| NanoNFCorpus | 0.2686 | 0.4584 | 0.0962 |
| NanoNQ | 0.4170 | 0.3607 | 0.3571 |
| NanoQuoraRetrieval | 0.7768 | 0.7750 | 0.7393 |
| NanoSCIDOCS | 0.2939 | 0.4774 | 0.2197 |
| NanoArguAna | 0.3471 | 0.2617 | 0.2727 |
| NanoSciFact | 0.6387 | 0.6127 | 0.6001 |
| NanoTouche2020 | 0.4732 | 0.7240 | 0.3560 |
RikkaBotan/quantized-stable-static-embedding-fast-retrieval-mrl-bilingual-ja-en
- Quantized SSE for Retrieval MRL Bilingual version (English & Japanese)
Table 11 | NanoBEIR English Evaluation.
| Dataset | nDCG@10 | MRR@10 | MAP@100 |
|---|---|---|---|
| NanoBEIR Mean | 0.5049 | 0.5526 | 0.4197 |
| NanoClimateFEVER | 0.3166 | 0.3874 | 0.2511 |
| NanoDBPedia | 0.5604 | 0.7321 | 0.4244 |
| NanoFEVER | 0.6511 | 0.5871 | 0.5595 |
| NanoFiQA2018 | 0.3179 | 0.3541 | 0.2617 |
| NanoHotpotQA | 0.6840 | 0.7459 | 0.6191 |
| NanoMSMARCO | 0.4417 | 0.3616 | 0.3748 |
| NanoNFCorpus | 0.2939 | 0.4535 | 0.1202 |
| NanoNQ | 0.4952 | 0.4287 | 0.4251 |
| NanoQuoraRetrieval | 0.8528 | 0.8533 | 0.8190 |
| NanoSCIDOCS | 0.3335 | 0.5460 | 0.2551 |
| NanoArguAna | 0.3978 | 0.3202 | 0.3326 |
| NanoSciFact | 0.6076 | 0.5842 | 0.5733 |
| NanoTouche2020 | 0.6105 | 0.8298 | 0.4406 |
Table 12 | NanoBEIR Japanese Evaluation.
| Dataset | nDCG@10 | MRR@10 | MAP@100 |
|---|---|---|---|
| NanoBEIR Mean | 0.4493 | 0.5083 | 0.3744 |
| NanoClimateFEVER | 0.2883 | 0.3860 | 0.2218 |
| NanoDBPedia | 0.5458 | 0.7632 | 0.4048 |
| NanoFEVER | 0.4956 | 0.4403 | 0.4421 |
| NanoFiQA2018 | 0.3224 | 0.3667 | 0.2640 |
| NanoHotpotQA | 0.4866 | 0.5444 | 0.4117 |
| NanoMSMARCO | 0.4578 | 0.4085 | 0.4226 |
| NanoNFCorpus | 0.2731 | 0.4844 | 0.1138 |
| NanoNQ | 0.3944 | 0.3406 | 0.3436 |
| NanoQuoraRetrieval | 0.8003 | 0.8179 | 0.7766 |
| NanoSCIDOCS | 0.3156 | 0.5133 | 0.2325 |
| NanoArguAna | 0.3635 | 0.2758 | 0.2871 |
| NanoSciFact | 0.6341 | 0.6020 | 0.5903 |
| NanoTouche2020 | 0.4628 | 0.6646 | 0.3566 |
9 Implementations
9.1 Modeling
We implemented SSE using PyTorch within the sentence-transformers framework. Our model inherits from the library's InputModule, ensuring full compatibility with standard input processing and inference workflows.
"""
coding = utf-8
Copyright 2026 Rikka Botan. All rights reserved
Licensed under "MIT License"
Stable Static Embedding official PyTorch implementation
"""
from __future__ import annotations
import os
from pathlib import Path
from safetensors.torch import save_file as save_safetensors_file
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from typing import Dict
from dataclasses import dataclass
from tokenizers import Tokenizer
from transformers import PreTrainedTokenizerFast
from sentence_transformers.models.InputModule import InputModule
class SeparableDyT(nn.Module):
def __init__(
self,
hidden_dim: int,
alpha_init: float = 0.5
):
super().__init__()
self.alpha = nn.Parameter(alpha_init*torch.ones(hidden_dim))
self.beta = nn.Parameter(torch.ones(hidden_dim))
self.bias = nn.Parameter(torch.zeros(hidden_dim))
def forward(
self,
x: torch.Tensor
) -> torch.Tensor:
x = self.beta * F.tanh(self.alpha * x + self.bias)
return x
class SSE(InputModule):
"""
Stable Static Embedding (SSE)
StaticEmbedding-compatible Sentence-Transformers module
"""
def __init__(
self,
tokenizer: Tokenizer | PreTrainedTokenizerFast,
vocab_size: int,
hidden_dim: int = 1024,
**kwargs,
):
super().__init__()
if isinstance(tokenizer, PreTrainedTokenizerFast):
tokenizer = tokenizer._tokenizer
elif not isinstance(tokenizer, Tokenizer):
raise ValueError("Tokenizer must be a fast (Rust) tokenizer")
self.tokenizer: Tokenizer = tokenizer
self.tokenizer.no_padding()
self.embedding = nn.EmbeddingBag(vocab_size, hidden_dim)
self.dyt = SeparableDyT(hidden_dim)
self.embedding_dim = hidden_dim
# For model card compatibility
self.base_model = kwargs.get("base_model", None)
# Tokenization (StaticEmbedding-compatible)
def tokenize(
self,
texts: list[str],
**kwargs
) -> dict[str, torch.Tensor]:
encodings = self.tokenizer.encode_batch(texts, add_special_tokens=False)
encodings_ids = [encoding.ids for encoding in encodings]
offsets = torch.from_numpy(
np.cumsum(
[0] + [len(token_ids) for token_ids in encodings_ids[:-1]]
)
)
input_ids = torch.tensor(
[token_id for token_ids in encodings_ids for token_id in token_ids],
dtype=torch.long
)
return {
"input_ids": input_ids,
"offsets": offsets
}
# Forward
def forward(
self,
features: Dict[str, torch.Tensor],
**kwargs,
) -> Dict[str, torch.Tensor]:
x = self.embedding(features["input_ids"], features["offsets"])
x = self.dyt(x)
features["sentence_embedding"] = x
return features
# Required APIs
def get_sentence_embedding_dimension(self) -> int:
return self.embedding_dim
@property
def max_seq_length(self) -> int:
return torch.inf
def save(
self,
output_path: str,
*args,
safe_serialization: bool = True,
**kwargs,
) -> None:
os.makedirs(output_path, exist_ok=True)
if safe_serialization:
save_safetensors_file(
self.state_dict(),
os.path.join(output_path, "model.safetensors"),
)
else:
torch.save(
self.state_dict(),
os.path.join(output_path, "pytorch_model.bin"),
)
self.tokenizer.save(
str(Path(output_path) / "tokenizer.json")
)
@classmethod
def load(
cls,
model_name_or_path: str,
**kwargs,
):
allowed_keys = {
"cache_dir",
"local_files_only",
"force_download",
}
filtered_kwargs = {
k: v for k, v in kwargs.items() if k in allowed_keys
}
tokenizer_path = cls.load_file_path(
model_name_or_path,
filename="tokenizer.json",
**filtered_kwargs,
)
tokenizer = Tokenizer.from_file(tokenizer_path)
weights = cls.load_torch_weights(
model_name_or_path=model_name_or_path,
**filtered_kwargs,
)
hidden_dim = weights["embedding.weight"].size(1)
vocab_size = weights["embedding.weight"].size(0)
model = cls(
tokenizer=tokenizer,
vocab_size=vocab_size,
hidden_dim=hidden_dim,
)
model.load_state_dict(weights)
return model
@dataclass
class SSESforzandoConfig:
hidden_dim: int = 512
vocab_size: int = 30522
@dataclass
class SSEForzandoConfig:
hidden_dim: int = 384
vocab_size: int = 30522
9.2 Inference
It can be used with simple code like the following.
"""
coding = utf-8
Copyright 2026 Rikka Botan. All rights reserved
Licensed under "MIT License"
Stable Static Embedding inference implementation
"""
import torch
from sentence_transformers import SentenceTransformer
# load (remote code enabled)
model = SentenceTransformer(
"RikkaBotan/stable-static-embedding-fast-retrieval-mrl-en",
trust_remote_code=True,
device="cuda" if torch.cuda.is_available() else "cpu",
truncate_dim=256,
)
# inference
query = "What is Stable Static Embedding?"
sentences = [
"SSE: Stable Static embedding works without attention.",
"Stable Static Embedding is a fast embedding method designed for retrieval tasks.",
"Static embeddings are often compared with transformer-based sentence encoders.",
"I cooked pasta last night while listening to jazz music.",
"Large language models are commonly trained using next-token prediction objectives.",
"Instruction tuning improves the ability of LLMs to follow human-written prompts.",
]
with torch.no_grad():
embeddings = model.encode(
[query] + sentences,
convert_to_tensor=True,
normalize_embeddings=True,
batch_size=32
)
print("embeddings shape:", embeddings.shape)
# cosine similarity
similarities = model.similarity(embeddings[0], embeddings[1:])
for i, similarity in enumerate(similarities[0].tolist()):
print(f"{similarity:.05f}: {sentences[i]}")
9.3 Quantized Modeling
We performed 4-bit quantization modeling (SSEQ) to optimize resource utilization. By reducing parameter precision, we achieved significant compression of data size and storage overhead while maintaining retrieval performance comparable to the original model.
"""
coding = utf-8
Copyright 2026 Rikka Botan. All rights reserved
Licensed under "MIT License"
Stable Static Embedding official PyTorch implementation
"""
from __future__ import annotations
import os
from pathlib import Path
from safetensors.torch import save_file as save_safetensors_file
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from typing import Dict
from dataclasses import dataclass
from tokenizers import Tokenizer
from transformers import PreTrainedTokenizerFast
from sentence_transformers.models.InputModule import InputModule
from safetensors.torch import load_file
def quantize_q4(weight: torch.Tensor):
"""
weight: (vocab, dim)
returns: packed uint8 + scale + zero
"""
w = weight.detach().cpu().numpy().astype(np.float32)
scales = np.max(np.abs(w), axis=1, keepdims=True) + 1e-8
w_norm = w / scales
q = np.clip(np.round((w_norm + 1) * 7.5), 0, 15).astype(np.uint8)
# pack 2x4bit -> uint8
packed = (q[:, 0::2] << 4) | q[:, 1::2]
return {
"packed": packed,
"scales": scales.astype(np.float32),
}
def dequantize_q4(packed: np.ndarray, scales: np.ndarray):
hi = (packed >> 4) & 0xF
lo = packed & 0xF
q = np.empty((packed.shape[0], packed.shape[1]*2), dtype=np.uint8)
q[:, 0::2] = hi
q[:, 1::2] = lo
w = (q.astype(np.float32) / 7.5) - 1.0
w = w * scales
return torch.from_numpy(w)
class SeparableDyT(nn.Module):
def __init__(
self,
hidden_dim: int,
alpha_init: float = 0.5
):
super().__init__()
self.alpha = nn.Parameter(alpha_init*torch.ones(hidden_dim))
self.beta = nn.Parameter(torch.ones(hidden_dim))
self.bias = nn.Parameter(torch.zeros(hidden_dim))
def forward(
self,
x: torch.Tensor
) -> torch.Tensor:
x = self.beta * F.tanh(self.alpha * x + self.bias)
return x
class SSEQ(InputModule):
"""
Stable Static Embedding (SSE)
StaticEmbedding-compatible Sentence-Transformers module
"""
def __init__(
self,
tokenizer: Tokenizer | PreTrainedTokenizerFast,
vocab_size: int,
hidden_dim: int = 1024,
**kwargs,
):
super().__init__()
if isinstance(tokenizer, PreTrainedTokenizerFast):
tokenizer = tokenizer._tokenizer
elif not isinstance(tokenizer, Tokenizer):
raise ValueError("Tokenizer must be a fast (Rust) tokenizer")
self.tokenizer: Tokenizer = tokenizer
self.tokenizer.no_padding()
self.embedding = nn.EmbeddingBag(vocab_size, hidden_dim)
self.dyt = SeparableDyT(hidden_dim)
self.embedding_dim = hidden_dim
# For model card compatibility
self.base_model = kwargs.get("base_model", None)
# Tokenization (StaticEmbedding-compatible)
def tokenize(
self,
texts: list[str],
**kwargs
) -> dict[str, torch.Tensor]:
encodings = self.tokenizer.encode_batch(texts, add_special_tokens=False)
encodings_ids = [encoding.ids for encoding in encodings]
offsets = torch.from_numpy(
np.cumsum(
[0] + [len(token_ids) for token_ids in encodings_ids[:-1]]
)
)
input_ids = torch.tensor(
[token_id for token_ids in encodings_ids for token_id in token_ids],
dtype=torch.long
)
return {
"input_ids": input_ids,
"offsets": offsets
}
# Forward
def forward(
self,
features: Dict[str, torch.Tensor],
**kwargs,
) -> Dict[str, torch.Tensor]:
x = self.embedding(features["input_ids"], features["offsets"])
x = self.dyt(x)
features["sentence_embedding"] = x
return features
# Required APIs
def get_sentence_embedding_dimension(self) -> int:
return self.embedding_dim
@property
def max_seq_length(self) -> int:
return torch.inf
def save(self, output_path: str):
os.makedirs(output_path, exist_ok=True)
state = self.state_dict()
emb = state["embedding.weight"]
q = quantize_q4(emb)
del state["embedding.weight"]
save_safetensors_file(
state,
os.path.join(output_path, "model_rest.safetensors"),
)
with open(os.path.join(output_path, "embedding.q4_k_m.bin"), "wb") as f:
f.write(q["packed"].tobytes())
f.write(q["scales"].tobytes())
self.tokenizer.save(
str(Path(output_path) / "tokenizer.json")
)
@classmethod
def load(cls, model_path: str):
tokenizer = Tokenizer.from_file(
os.path.join(model_path, "tokenizer.json")
)
state = load_file(
os.path.join(model_path, "model_rest.safetensors"),
device="cpu"
)
# read q4 binary
bin_path = os.path.join(model_path, "embedding.q4_k_m.bin")
with open(bin_path, "rb") as f:
raw = f.read()
hidden = state["dyt.alpha"].shape[0]
total_uint8 = len(raw)
bytes_per_row = hidden // 2 + 4
vocab = total_uint8 // bytes_per_row
packed_size = vocab * hidden // 2
packed = np.frombuffer(raw[:packed_size], dtype=np.uint8)
scales = np.frombuffer(raw[packed_size:], dtype=np.float32)
packed = packed.reshape(vocab, hidden // 2)
scales = scales.reshape(vocab, 1)
emb = dequantize_q4(packed, scales)
# rebuild model
model = cls(
tokenizer=tokenizer,
vocab_size=emb.shape[0],
hidden_dim=emb.shape[1]
)
state["embedding.weight"] = emb
model.load_state_dict(state)
return model
9.4 Quantization
By using the following, we can quantize the weights. By uploading the created weights and SSEQ implementation to Hugging Face, we can use them just like other sentence-transformers compatible models.
"""
coding = utf-8
Copyright 2026 Rikka Botan. All rights reserved
Licensed under "MIT License"
Quantization implementation
"""
import os
from tokenizers import Tokenizer
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer
from SSE import SSE
from SSE_quantize import SSEQ
def quantize_and_save_sse_from_hf(
hf_model_name: str,
output_path: str,
):
print(f"[1] Loading HF model: {hf_model_name}")
st_model = SentenceTransformer(hf_model_name)
sseq_module = None
for m in st_model.modules():
if isinstance(m, SSE):
sseq_module = m
break
if sseq_module is None:
raise ValueError("SSE module not found in the model")
print("[2] Extract tokenizer")
# tokenizer
try:
hf_tokenizer = AutoTokenizer.from_pretrained(hf_model_name)
tokenizer = Tokenizer.from_str(hf_tokenizer.backend_tokenizer.to_str())
except Exception:
tokenizer = sseq_module.tokenizer
print("[3] Rebuild SSEQ model")
# embedding weight
emb_weight = sseq_module.embedding.weight.detach().cpu()
model = SSEQ(
tokenizer=tokenizer,
vocab_size=emb_weight.shape[0],
hidden_dim=emb_weight.shape[1],
base_model=hf_model_name
)
state = sseq_module.state_dict()
model.load_state_dict(state)
print("[4] Quantize & Save")
os.makedirs(output_path, exist_ok=True)
model.save(output_path)
print(f"[✓] Quantized model saved to: {output_path}")
- Simple usage
quantize_and_save_sse_from_hf(
"RikkaBotan/stable-static-embedding-fast-retrieval-mrl-en",
"./sse-q4")
10 Application Example
RikkaBotan/Stable-Static-Embedding-Semantic-Web-Search-Japanese
RikkaBotan/Stable-Static-Embedding-Semantic-Web-Search-Bilingual-ja-en
This application utilizes SSE model to provide high-performance semantic search. It is built with an emphasis on speed and accuracy, enabling fast searches without compromising relevance. It is ideal for real-time or large-scale search tasks.
Acknowledgements
The author acknowledge the support of Saldra, Witness and Lumina Logic Minds for providing computational resources used in this work.
Our interest in this topic originated from reading Tom Aarsen's seminal article, Train 400x faster Static Embedding Models with Sentence Transformers, which motivated us to investigate on static embedding.
I thank the developers of sentence-transformers, python and pytorch.
I thank all the researchers for their efforts to date.
I thank Japan's high standard of education.
And most of all, thank you for your interest in this blog.
About us
Japanese independent researcher having shy and pampered personality. Twin-tail hair is a charm point. Interested in nlp. Usually using python and C.
Please contact us if you have any requests for joint research, writing, speaking engagements, or employment.











