SSE (Stable Static Embedding): Unlocking the Potential of Static Embeddings, A Dynamic Tanh Normalization Approach without Speed Penalty

Community Article Published March 14, 2026

Rikka Botan
Independent Researcher, Japan
https://rikka-botan.github.io

Abstract

Static embedding models enable fast inference due to their simple architecture, but, it is well known that improving their structural expressiveness is challenging. At the same time, as corpora continue to grow in scale, the demand for both higher efficiency and higher accuracy in embedding models has increased significantly. In this work, we propose a simple yet effective method called SSE (Stable Static Embedding), which incorporates Separable DyT (Dynamic Tanh normalization). We demonstrate that SSE achieves higher retrieval performance than prior approaches while using only half the number of parameters. Despite having only 16 million parameters, SSE attains a mean NanoBEIR (English) nDCG@10 score of 0.512. By leveraging Separable DyT, SSE effectively regulates gradient flow and suppresses inter-dimensional imbalance and overfitting, thereby improving generalization performance. Our method provides a new perspective on static embedding models and offers a pathway toward faster and more accurate retrieval systems.

1 Main Contributions

1.We analyze limitations commonly observed in static embedding models, particularly gradient instability and inter-dimensional imbalance, which can negatively affect embedding quality.

2.We propose Separable DyT (Dynamic Tanh normalization), a simple normalization mechanism that stabilizes training and improves the structure of the embedding space.

3.We introduce SSE (Stable Static Embedding), a parameter-efficient static embedding framework, and demonstrate through extensive experiments on NanoBEIR that it achieves strong retrieval performance while maintaining fast inference and low computational cost.

figure2_en

Figure 1 | (a) Retrieval performance (nDCG@10) across NanoBEIR English tasks. (b) Mean nDCG@10 vs. inference speed (QPS: queries per second) measured on TREC-COVID and Quora using an Intel® Core™ Ultra 7 265K (3.90 GHz) with batch size 32.

figure1_ja

Figure 2 | (a) Retrieval performance (nDCG@10) across NanoBEIR Japanese tasks. (b) Mean nDCG@10 vs. inference speed (QPS: queries per second) measured on Miracl using an Intel® Core™ Ultra 7 265K (3.90 GHz) with batch size 32.

2 Greeting

As plum blossoms light upon slender branches and tie ribbons into crystalline air—how are you doing today?

My name is Rikka Botan, nice to meet you.

This article provides technical insights into static embedding models.

If you are interested, please follow my account. I share updates on my research progress as well as everyday stories.

My X(Twitter) account

3 Introduction

Dense vector representations have become a fundamental component of modern information retrieval and retrieval-augmented language systems. However, a significant trade-off exists between accuracy and efficiency in current architectures. While contextual embedding models based on transformer architectures achieve strong semantic performance, their computational cost during inference remains a major bottleneck for large-scale and latency-sensitive applications. In contrast, static embedding models offer extremely fast inference and low memory consumption due to their simple and deterministic structure. These properties make them particularly attractive for large-scale search, recommendation, and real-time retrieval systems. Consequently, there is a widening gap between static embeddings and contextual encoders: the former provide speed but often lack accuracy, while the latter provide accuracy but are computationally expensive. Bridging this gap requires new techniques that enhance the representational quality of embedding models without introducing significant computational overhead.

The development of word representations has evolved significantly to address these needs since the pioneering work on Word2Vec (Mikolov et al., 2013), which established distributed word embeddings based on distributional hypotheses. This was followed by GloVe (Pennington et al., 2014), which introduced global co-occurrence statistics for more stable learning, and FastText (Bojanowski et al., 2017), which incorporated subword information to improve robustness for rare words. These models enabled practical deployment in large-scale systems through their fixed token representations and lightweight composition functions. Most notably, recent developments such as static-similarity-mrl-multilingual-v1 and static-retrieval-mrl-en-v1 (sentence-transformers, 2025) have marked a significant milestone, demonstrating that static embeddings rather than contextual encoders—can achieve practical retrieval performance while being 400 times faster than transformer-based models.(Aarsen, 2025)

Nevertheless, despite such remarkable efficiency gains, the rapid growth of web-scale corpora and retrieval-augmented generation pipelines continues to drive demand for embedding models that are even faster and more accurate. The existing gap between static embeddings and contextual encoders still requires further closing to meet industrial standards. More critically, static embedding approaches suffer from a fundamental limitation in structural expressiveness that hinders this progress. Because they rely on fixed token representations and lightweight composition functions, their ability to capture complex semantic relationships is inherently constrained. Specifically, anisotropy has been widely observed as an issue in embedding spaces: the representation space tends to develop directional bias where dimensions exhibit uneven variance, leading to representational imbalance across features. This phenomenon is often associated with gradient instability during training, which causes non-uniform development of representation capacity and degrades generalization performance. Previous attempts to improve embedding quality, including recent advances in Matryoshka Embeddings (Kusupati et al., 2024), have focused primarily on compression or incremental objectives, often failing to address this core representational imbalance during the training process itself.

In this work, we propose SSE (Stable Static Embedding), a simple yet effective framework for improving the performance of static embedding models. SSE adopts Separable DyT (Dynamic Tanh normalization), itself a derivative of DyT(Zhu et al., 2025), a lightweight normalization mechanism that stabilizes gradient flow and suppresses inter-dimensional imbalance during training. By dynamically controlling the scale and saturation of embedding activations, Separable DyT mitigates overfitting and improves the uniformity of the embedding space. This results in more discriminative and robust representations without increasing model complexity. We demonstrate through extensive experiments that SSE significantly outperforms conventional static embedding methods while maintaining a compact parameter size. Despite having only 16 million parameters, SSE achieves a mean NanoBEIR (English) nDCG@10 score of 0.512, surpassing several larger and more complex baselines. Furthermore, SSE requires only half the number of parameters compared to prior approaches with comparable performance, highlighting its efficiency advantage.

4 Method

4.1 Structure

The core component of SSE (Stable Static Embedding) is Separable DyT (Dynamic Tanh normalization), a lightweight normalization module that introduces magnitude-adaptive gradient flow for each embedding dimension. Separable DyT operates directly on embedding vectors and can be inserted as a post-projection normalization layer without introducing significant computational overhead.

Given an input embedding vector xRd \mathbf{x} \in \mathbb{R}^d , SSE applies Separable DyT independently to each dimension, producing a normalized representation yRd \mathbf{y} \in \mathbb{R}^d . This transformation reshapes the geometry of the embedding space by suppressing unstable high-magnitude dimensions and preserving informative low-magnitude features.

——————————————————————————Algorithm 1: SSE(Stable Static Embedding)——————————————————————————Input: x:(B,S)Output: y:(B,E)1:xT:(B,S)Tonenizer(x)2:xS:(B,E)EmbeddingBag(xT)3:y:(B,E)SeparableDynamicTanhnormalization(xS)4:return y—————————————————————————— \begin{aligned} & \text{------------------------------------------------------------------------------} \\ & \textbf{Algorithm 1: SSE(Stable Static Embedding)} \\ & \text{------------------------------------------------------------------------------} \\ & \textbf{Input: } x: (B, S) \\ & \textbf{Output: } y: (B, E) \\ & \quad 1:\quad x_T: (B, S) \leftarrow \mathrm{Tonenizer}(x) \\ & \quad 2:\quad x_S: (B, E) \leftarrow \mathrm{EmbeddingBag}(x_T) \\ & \quad 3:\quad y: (B, E) \leftarrow \mathrm{Separable Dynamic Tanh normalization}(x_S) \\ & \quad 4:\quad \textbf{return } y \\ & \text{------------------------------------------------------------------------------} \\ \end{aligned}

Figure 3 | SSE (Stable Static Embedding) Architecture

4.2 Separable DyT (Dynamic Tanh normalization)

For each embedding dimension xkx_k, Separable DyT computes the output as:

yk=γktanh(αkxk+βk), y_k = \gamma_k \tanh(\alpha_k x_k + \beta_k),

where αk\alpha_k, βk\beta_k, and γk\gamma_k are learnable parameters that control scaling, shifting, and output amplitude respectively.

——————————————————————————Algorithm 2: Separable Dynamic Tanh normalization——————————————————————————Input: x:(B,E)Output: y:(B,E)Parameters: α,β,γ:(E)1:xαx+β2:y:(B,E)γTanh(x)3:return y—————————————————————————— \begin{aligned} & \text{------------------------------------------------------------------------------} \\ & \textbf{Algorithm 2: Separable Dynamic Tanh normalization} \\ & \text{------------------------------------------------------------------------------} \\ & \textbf{Input: } x: (B, E) \\ & \textbf{Output: } y: (B, E) \\ & \textbf{Parameters: } \alpha, \beta, \gamma: (E) \\ & \quad 1:\quad x \leftarrow \alpha \cdot x + \beta \\ & \quad 2:\quad y: (B, E) \leftarrow \gamma \cdot \mathrm{Tanh}(x) \\ & \quad 3:\quad \textbf{return } y \\ & \text{------------------------------------------------------------------------------} \\ \end{aligned}

The derivative with respect to the input dimension xkx_k is:

ykxk=γkαk,sech2(αkxk+βk). \frac{\partial y_k}{\partial x_k} = \gamma_k \alpha_k , \mathrm{sech}^2(\alpha_k x_k + \beta_k).

This formulation introduces magnitude-dependent gradient gating. The gradient magnitude is modulated by the squared hyperbolic secant function:

  • For saturated dimensions (xk>1)(|x_k| > 1), we have αkxk+βk1, |\alpha_k x_k + \beta_k| \gg 1, which yields exponential decay: sech2(z)4e2z. \mathrm{sech}^2(z) \sim 4 e^{-2|z|}. Consequently, gradients vanish: ykxk0. \frac{\partial y_k}{\partial x_k} \rightarrow 0.

  • For non-saturated dimensions (xk1)(|x_k| \ll 1), we have sech2(z)1, \mathrm{sech}^2(z) \approx 1, preserving near-constant gradients: ykxkγkαk. \frac{\partial y_k}{\partial x_k} \approx \gamma_k \alpha_k.

Thus, Separable DyT adaptively attenuates gradients for large-magnitude (often noisy or overfitted) dimensions, while maintaining full gradient flow for small-magnitude, information-rich dimensions.

4.3 Implicit Regularization via Magnitude-Adaptive Gating

The key property of Separable DyT is that it acts as an implicit regularizer without introducing additional loss terms or hyperparameters. The magnitude-dependent gating mechanism selectively suppresses unstable feature directions during optimization, thereby:

  • Reducing inter-dimensional imbalance in the embedding space
  • Preventing gradient explosion and over-amplification
  • Mitigating overfitting by dampening extreme activations
  • Improving representation uniformity and isotropy

Unlike standard normalization techniques (e.g., layer normalization or L2 normalization), Separable DyT does not globally rescale embeddings. Instead, it performs dimension-wise adaptive modulation, allowing each feature to learn its own dynamic range and sensitivity.

4.4 Integration into Static Embedding Models

Separable DyT is applied to the output of an EmbeddingBag layer, which aggregates token embeddings into a fixed-dimensional representation. Let ERV×d \mathbf{E} \in \mathbb{R}^{V \times d} denote the embedding matrix and let a sentence be represented by a set of token indices t1,,tn {t_1, \dots, t_n} . The EmbeddingBag layer computes a pooled representation:

z=EmbeddingBag(t1,,tn) \mathbf{z} = \mathrm{EmbeddingBag}(t_1, \dots, t_n)

where the aggregation is typically performed using mean pooling over the selected embeddings.

Separable DyT is then applied to the aggregated representation:

s=SeparableDyT(z) \mathbf{s} = \mathrm{SeparableDyT}(\mathbf{z})

where sRd \mathbf{s} \in \mathbb{R}^d denotes the final sentence embedding.

Because Separable DyT operates element-wise on the aggregated vector and introduces only a small number of learnable parameters per dimension, it integrates seamlessly into existing static embedding architectures without altering their structural simplicity.

4.5 Computational Efficiency

SSE preserves the primary advantage of static embedding models: extremely fast inference. The EmbeddingBag operation performs efficient aggregation of token embeddings, avoiding the need for sequential computation or deep contextual encoding.

The Separable DyT transformation consists of element-wise affine transformations followed by a tanh activation, both of which are computationally inexpensive and highly parallelizable. The number of additional parameters introduced by Separable DyT grows linearly with the embedding dimension and remains negligible compared to contextual encoders.

Consequently, SSE maintains the constant-time inference characteristics of static embedding models while improving the stability and expressiveness of the embedding space.

5 Experiments

5.1 Training Configuration

We train our embedding model using contrastive learning combined with Matryoshka Loss to support variable-dimensional embeddings. Following the training protocol of Train 400x faster Static Embedding Models with Sentence Transformers(Aarsen, 2025), we optimize for multiple projection dimensions simultaneously: 32, 64, 128, 256, and 512. This allows the model to maintain retrieval performance across different embedding sizes without retraining.

We utilize the AdamW optimizer with mixed-precision training enabled via bf16. The learning rate is set to 0.1 with a cosine decay schedule and a warmup ratio of 0.1. To ensure efficient gradient accumulation, we employ a per-device batch size of 512 with 8 gradient accumulation steps, resulting in an effective global batch size of 4,096 (assuming standard multi-GPU setups). We train for exactly one epoch (num_train_epochs=1) and evaluate the model every training step. To prevent overfitting on duplicate pairs during contrastive learning, we apply a no_duplicates batch sampler.

The specific non-default hyperparameters used in our experiments are summarized in Table 1.

Table 1 | Training Hyperparameters

Parameter Value
Optimizer AdamW (beta2: 0.9999, epsilon: 1e-10)
Learning Rate 0.1
LR Scheduler Cosine Decay
Warmup Ratio 0.1
Batch Size (per device) 512
Gradient Accumulation Steps 8
Training Epochs 1
Precision BF16 (bf16: True)
Evaluation Strategy Steps
Dataloader Workers 4
Batch Sampler no_duplicates

5.2 Training Datasets

To ensure robust generalization across various retrieval and semantic similarity tasks, we train on a diverse collection of 15 datasets. These datasets cover question answering (QA), natural language inference (NLI), and information retrieval (IR) domains. All datasets are processed with the Matryoshka Loss function during training. The complete list of training corpora is provided in Table 2.

Table 2 | Training Datasets

Dataset Domain Ratio
squad Question Answering ~2%
trivia_qa Question Answering ~2%
allnli Natural Language Inference ~2%
pubmedqa Scientific QA ~2%
hotpotqa Multi-hop QA ~2%
miracl Multilingual IR ~2%
mr_tydi Multilingual IR ~2%
msmarco Web Search IR ~5%
msmarco_10m Large-scale IR ~45%
msmarco_hard Hard Negative Mining ~2%
mldr Long Document Retrieval ~2%
s2orc Scientific Text ~14%
swim_ir Semantic Web IR ~2%
paq Question Answering ~14%
nq Natural Questions ~2%

5.3 Experimental Models

We compared SSE against two baselines:

  • Static Embedding (no DyT): The standard baseline without DyT layers.
  • Static Embedding + DyT: A variant incorporating Dynamic Tanh normalization.

——————————————————————————Algorithm 3: Dynamic Tanh normalization——————————————————————————Input: x:(B,E)Output: y:(B,E)Parameters: α:(1)  β,γ:(E)1:xTanh(αx)2:y:(B,E)γx+β3:return y—————————————————————————— \begin{aligned} & \text{------------------------------------------------------------------------------} \\ & \textbf{Algorithm 3: Dynamic Tanh normalization} \\ & \text{------------------------------------------------------------------------------} \\ & \textbf{Input: } x: (B, E) \\ & \textbf{Output: } y: (B, E) \\ & \textbf{Parameters: } \alpha: (1) \\ & \qquad \qquad \qquad \; \beta, \gamma: (E) \\ & \quad 1:\quad x \leftarrow \mathrm{Tanh}(\alpha \cdot x) \\ & \quad 2:\quad y: (B, E) \leftarrow \gamma \cdot x + \beta\\ & \quad 3:\quad \textbf{return } y \\ & \text{------------------------------------------------------------------------------} \\ \end{aligned}

5.4 Training Results

Figure 4 presents a comparative analysis of training loss and gradient-related metrics across three model variants.

Loss Convergence (Figure 4 (a)): All three models demonstrate nearly identical loss trajectories, rapidly decreasing from approximately 19 to stabilize around a plateau between 2 and 3. This convergence pattern indicates that the core objective function is effectively satisfied by all variants. Consequently, the observed performance differences are not attributable to variations in primary loss minimization speed or final loss values.

Gradient Dynamics (Figure 4 (b)): Distinct behaviors emerge when examining gradient magnitude metrics. The Static Embedding baseline (purple) maintains consistently low values throughout training, suggesting potential issues such as vanishing gradients or excessive feature suppression. In contrast, the SSE model (pink) displays higher initial variance before settling at a moderate level of approximately 0.2. This trajectory reflects an active modulation of gradient flow, which prevents the saturation of activation functions by large-magnitude inputs. By sustaining non-zero gradients for informative dimensions, SSE facilitates continuous parameter updates during the training process, unlike the baseline where feature learning appears to stagnate early on.

dyt_comparison_figure

Figure 4 | Comparison of (a) Loss and (b) Gradient Norm Across Training Steps.

6 Evaluations

6.1 NanoBEIR mean nDCG@10 progression by steps

Figure 5 illustrates the training dynamics and final performance of SSE compared to the baselines. The results demonstrate that SSE consistently outperforms both baseline models throughout the training process, particularly in later stages (after step 1k).

  • Superior Convergence: As shown by the pink line, SSE achieves a peak mean nDCG@10 of approximately 0.5124, surpassing the standard Static Embedding baseline (purple line), which plateaus around 0.5068. This represents a relative improvement of roughly 1.3% in retrieval accuracy.
  • Necessity of Separability: Notably, the variant with standard DyT (cyan line) underperforms compared to the baseline in the long run, fluctuating between 0.497 and 0.503. This suggests that applying a global or non-separable normalization can inadvertently suppress informative features or fail to address dimension-specific instability. In contrast, SSE's Separable DyT mechanism successfully solves these issues by adapting gradients per dimension.

dyt_comparison_ndcg_figure

Figure 5 | NanoBEIR mean nDCG@10 Across Training Steps.

Table 3 | NanoBEIR English nDCG@10 comparison.

Model NanoArguAna nDCG@10 NanoClimateFEVER nDCG@10 NanoDBPedia nDCG@10 NanoFEVER nDCG@10 NanoFiQA2018 nDCG@10 NanoHotpotQA nDCG@10 NanoMSMARCO nDCG@10 NanoNFCorpus nDCG@10 NanoNQ nDCG@10 NanoQuoraRetrieval nDCG@10 NanoSCIDOCS nDCG@10 NanoSciFact nDCG@10 NanoTouche2020 nDCG@10 Mean nDCG@10
SSE(Static Embedding + Separable DyT) 0.4105 0.2998 0.5493 0.6808 0.3744 0.7021 0.4132 0.2982 0.4652 0.9094 0.3381 0.6176 0.6029 0.5124
Static Embedding + DyT 0.3615 0.2717 0.5632 0.6716 0.3393 0.6765 0.4367 0.3142 0.4674 0.9094 0.3267 0.6129 0.5816 0.5025
Static Embedding (No DyT) 0.3884 0.3005 0.5552 0.7125 0.3573 0.6783 0.4219 0.2955 0.4638 0.8979 0.3264 0.6076 0.5834 0.5068

6.2 Matryoshka Evaluations

To validate the effectiveness of our proposed Stable Static Embedding (SSE) framework, we conducted comprehensive experiments on the NanoBEIR benchmark, measuring retrieval performance via mean nDCG@10 across varying embedding dimensions ranging from 32 to 512. We compared SSE against standard static embeddings and a strong reference model (static-retrieval-mrl-en-v1). The results, summarized in Table 4 and visualized in Figure 1, demonstrate that SSE consistently achieves superior retrieval performance while maintaining computational efficiency.

As illustrated in Figure 6, the SSE model (pink line) exhibits a distinct advantage over all baselines as embedding dimensions increase. While the reference model (static-retrieval-mrl-en-v1) initially outperforms SSE at dimension 32 with an nDCG@10 of 0.3532 compared to SSE's 0.3448, this trend reverses rapidly starting from dimension 64. At dimension 64, SSE achieves a score of 0.4275, surpassing the reference model by approximately 2.4%. This performance gap widens significantly as the embedding capacity grows; at dimension 512, SSE reaches a peak nDCG@10 of 0.5124, exceeding the reference model's 0.4957 and the standard "Static Embedding + DyT" baseline (0.5025). These results confirm that the proposed architecture scales effectively with embedding dimensionality, leveraging additional capacity more efficiently than conventional methods.

The ablation study within Table 4 further isolates the contribution of our core component, Separable DyT. By comparing SSE against a variant using non-separable Dynamic Tanh normalization ("Static Embedding + DyT"), we observe that SSE consistently outperforms this baseline across all dimensions (e.g., at dim 512: 0.5124 vs. 0.5025). This performance gain validates the hypothesis presented in Section 4.2: applying normalization independently to each dimension allows for more precise magnitude-adaptive gradient gating. Specifically, Separable DyT suppresses unstable high-magnitude dimensions while preserving informative low-magnitude features, thereby constructing a more robust embedding geometry than standard normalization techniques which treat dimensions collectively.

Finally, the results highlight the favorable efficiency-accuracy trade-off of SSE. Even at lower dimensions (e.g., 64 or 128), where computational cost is minimal, SSE maintains state-of-the-art performance relative to larger models. This suggests that the Separable DyT module effectively compresses information density into fewer dimensions without sacrificing retrieval quality. Consequently, SSE offers a compelling solution for efficient semantic search applications, delivering high accuracy in resource-constrained environments while scaling robustly as dimensionality increases.

dim_scores_ref_figure

Figure 6 | NanoBEIR English mean nDCG@10 vs Matryoshka Embedding Truncation.

Table 4 | NanoBEIR English mean nDCG@10 vs Matryoshka Embedding Truncation.

Model 32 64 128 256 512 1024
SSE (Static Embedding + Separable DyT) 0.3448 0.4275 0.4659 0.4969 0.5124 -
Static Embedding + DyT 0.3338 0.4134 0.4622 0.4919 0.5025 -
Static Embedding (no DyT) 0.3367 0.4161 0.4625 0.4912 0.5068 -
static-retrieval-mrl-en-v1 (For reference) 0.3532 0.4176 0.4622 0.4819 0.4957 0.5031

6.3 Performance Analysis: English Retrieval Task

In the English retrieval task (Figure 7), SSE exhibits even more pronounced advantages over existing baselines. The scatter plot reveals that while Transformer models achieve the highest absolute accuracy (>0.60 nDCG@10), they suffer from severe latency constraints (<10,000 QPS). Conversely, most static embedding methods struggle to reach an nDCG@10 of 0.50.

Our SSE variants (stable-static-embedding-fast-retrieval-mrl-en) achieve a Mean nDCG@10 above 0.50 with a throughput exceeding 50,000 QPS. Notably, the results indicate a 2.16x speedup over Static Embedding approaches while maintaining competitive accuracy. This substantial efficiency gain confirms that the magnitude-adaptive gradient flow introduced by Separable DyT prevents the saturation of embedding dimensions, thereby preserving information richness even in high-dimensional static representations.

figure2-2-2_en

Figure 7 | NanoBEIR English Mean nDCG@10 vs. inference speed (QPS: queries per second) measured on TREC-COVID and Quora using an Intel® Core™ Ultra 7 265K (3.90 GHz) with batch size 32.

figure2-1-2_en

Figure 8 | Retrieval performance (nDCG@10) across NanoBEIR English tasks.

6.4 Performance Analysis: Japanese Retrieval Task

Figure 9 illustrates the performance distribution on the Japanese retrieval benchmark. The results clearly delineate a trade-off frontier where Transformer-based models occupy the high-accuracy, low-throughput region (left side), while standard static embeddings are clustered in the lower-performance zone despite their speed.

Our proposed SSE models (stable-static-embedding-fast-retrieval-mrl-ja and bilingual variants) successfully bridge this gap, positioning themselves in the top-right quadrant of the Pareto frontier. Specifically, our method achieves a Mean nDCG@10 exceeding 0.45 while maintaining a throughput of approximately 60,000 QPS. This performance demonstrates that SSE not only outperforms standard static embeddings by a significant margin but also offers a 1.46x speedup compared to Static Embedding baselines for comparable accuracy levels.

figure1-2-2_ja

Figure 9 | NanoBEIR Japanese Mean nDCG@10 vs. inference speed (QPS: queries per second) measured on Miracl using an Intel® Core™ Ultra 7 265K (3.90 GHz) with batch size 32.

figure1-1-2_ja

Figure 10 | Retrieval performance (nDCG@10) across NanoBEIR Japanese tasks.

6.5 Spectral Analysis of Learned Embeddings

The spectral analysis via PCA of the learned embedding matrices provides insights into the representation geometry underlying SSE's superior performance. As illustrated in Figure 11, the baseline Static Embedding exhibits a smooth and gradual decay of normalized eigenvalues across all 512 dimensions, suggesting that semantic variance is distributed across many directions and the effective rank remains relatively high.

In contrast, Standard DyT exhibits an abrupt spectral cliff around PC ≈ 480, indicating a sudden collapse of variance in the tail dimensions. This behavior suggests an unstable compression effect where high-dimensional variance is abruptly suppressed, potentially disrupting the embedding geometry.

SSE, on the other hand, demonstrates an earlier yet smoother eigenvalue decay around PC ≈ 430, after which the remaining dimensions decay rapidly toward zero. This pattern suggests that SSE implicitly performs a form of low-rank regularization, concentrating semantic variance into a more compact subspace while suppressing noise-dominated directions. Such controlled compression likely improves the stability of cosine similarity and distance geometry, which may contribute to the enhanced retrieval performance observed in downstream evaluations.

pca_eigenvalue_figure2

Figure 11 | PCA Spectrum on the 13 NanoBEIR English Datasets: Normalized Eigenvalue Decay (a) Linear Scale, (b) Logarithmic Scale.

7 Discussion

7.1 Scope and Limitations

It is important to note that the validity of SSE has been empirically confirmed specifically within the context of static text embeddings. In our current framework, Separable DyT operates on pre-defined embedding vectors (e.g., word or item IDs) retrieved via lookup tables. While this setting represents a critical bottleneck in many recommendation and NLP systems, the method's applicability to other forms of representation remains an open question.

7.2 Future Work

Extending SSE to broader contexts represents a promising direction for future research, which we expect other groups to clarify and advance. Specifically, the following areas are anticipated to be explored:

  • Dynamic and Contextual Embeddings: It is expected that Separable DyT will be evaluated for integration into Transformer-based architectures (e.g., as an alternative or complement to LayerNorm) by future researchers. This would aim to stabilize training in deep networks where hidden states vary dynamically with input context.
  • Cross-Modal Generalization: The magnitude-adaptive principle is not inherently tied to text data. Future work is anticipated to explore the application of SSE to image patch embeddings, graph node representations, and other modalities where high-dimensional vector stability is crucial.
  • Optimization Dynamics: It is expected that future studies will analyze how Separable DyT interacts with various optimizers (e.g., AdamW vs. SGD) in non-static settings. This research aims to determine whether the learnable parameters αk,βk,γk \alpha_k, \beta_k, \gamma_k require different initialization or regularization strategies when applied beyond static lookup tables.

8 Published Models

8.1 SSE series

RikkaBotan/stable-static-embedding-fast-retrieval-mrl-en

  • SSE for Retrieval MRL English version

Table 5 | NanoBEIR English Evaluation.

Dataset nDCG@10 MRR@10 MAP@100
NanoBEIR Mean 0.5124 0.5640 0.4317
NanoClimateFEVER 0.2998 0.3611 0.2344
NanoDBPedia 0.5493 0.7492 0.4247
NanoFEVER 0.6808 0.6318 0.6105
NanoFiQA2018 0.3744 0.4197 0.3162
NanoHotpotQA 0.7021 0.7679 0.6273
NanoMSMARCO 0.4132 0.3537 0.3733
NanoNFCorpus 0.2982 0.4889 0.1091
NanoNQ 0.4652 0.3992 0.4028
NanoQuoraRetrieval 0.9094 0.9122 0.8847
NanoSCIDOCS 0.3381 0.5509 0.2604
NanoArguAna 0.4105 0.3193 0.3325
NanoSciFact 0.6176 0.5933 0.5824
NanoTouche2020 0.6029 0.7852 0.4539

RikkaBotan/stable-static-embedding-fast-retrieval-mrl-ja

  • SSE for Retrieval MRL Japanese version

Table 6 | NanoBEIR Japanese Evaluation.

Dataset nDCG@10 MRR@10 MAP@100
NanoBEIR Mean 0.4507 0.5090 0.3695
NanoClimateFEVER 0.3110 0.4208 0.2347
NanoDBPedia 0.5596 0.7652 0.4000
NanoFEVER 0.5611 0.5003 0.4923
NanoFiQA2018 0.3247 0.3731 0.2692
NanoHotpotQA 0.4795 0.5758 0.4182
NanoMSMARCO 0.3845 0.3191 0.3335
NanoNFCorpus 0.2736 0.4544 0.1014
NanoNQ 0.4218 0.3658 0.3572
NanoQuoraRetrieval 0.7786 0.7750 0.7428
NanoSCIDOCS 0.3026 0.4850 0.2192
NanoArguAna 0.3521 0.2686 0.2793
NanoSciFact 0.6372 0.6100 0.5990
NanoTouche2020 0.4731 0.7036 0.3572

RikkaBotan/stable-static-embedding-fast-retrieval-mrl-bilingual-ja-en

  • SSE for Retrieval MRL Bilingual version (English & Japanese)

Table 7 | NanoBEIR English Evaluation.

Dataset nDCG@10 MRR@10 MAP@100
NanoBEIR Mean 0.5073 0.5563 0.4207
NanoClimateFEVER 0.3239 0.4045 0.2612
NanoDBPedia 0.5647 0.7321 0.4262
NanoFEVER 0.6450 0.5790 0.5514
NanoFiQA2018 0.3374 0.3838 0.2766
NanoHotpotQA 0.6897 0.7505 0.6177
NanoMSMARCO 0.4463 0.3621 0.3740
NanoNFCorpus 0.2844 0.4456 0.1071
NanoNQ 0.4851 0.4217 0.4186
NanoQuoraRetrieval 0.8554 0.8540 0.8202
NanoSCIDOCS 0.3376 0.5482 0.2566
NanoArguAna 0.3941 0.3154 0.3279
NanoSciFact 0.6185 0.5977 0.5881
NanoTouche2020 0.6123 0.8369 0.4432

Table 8 | NanoBEIR Japanese Evaluation.

Dataset nDCG@10 MRR@10 MAP@100
NanoBEIR Mean 0.4511 0.5141 0.3772
NanoClimateFEVER 0.2979 0.4005 0.2353
NanoDBPedia 0.5429 0.7633 0.4059
NanoFEVER 0.5133 0.4643 0.4661
NanoFiQA2018 0.3174 0.3669 0.2619
NanoHotpotQA 0.5000 0.5672 0.4234
NanoMSMARCO 0.4372 0.3865 0.4022
NanoNFCorpus 0.2866 0.5185 0.1177
NanoNQ 0.3987 0.3500 0.3527
NanoQuoraRetrieval 0.7944 0.8100 0.7685
NanoSCIDOCS 0.3153 0.5127 0.2322
NanoArguAna 0.3721 0.2873 0.2990
NanoSciFact 0.6216 0.5904 0.5804
NanoTouche2020 0.4662 0.6656 0.3589

8.2 Quantized SSE series

RikkaBotan/quantized-stable-static-embedding-fast-retrieval-mrl-en

  • Quantized SSE for Retrieval MRL English version

Table 9 | NanoBEIR English Evaluation.

Dataset nDCG@10 MRR@10 MAP@100
NanoBEIR Mean 0.5110 0.5645 0.4312
NanoClimateFEVER 0.3127 0.3822 0.2439
NanoDBPedia 0.5472 0.7440 0.4252
NanoFEVER 0.6870 0.6402 0.6191
NanoFiQA2018 0.3750 0.4155 0.3129
NanoHotpotQA 0.6927 0.7572 0.6205
NanoMSMARCO 0.4105 0.3504 0.3694
NanoNFCorpus 0.3063 0.4989 0.1148
NanoNQ 0.4523 0.3884 0.3941
NanoQuoraRetrieval 0.9147 0.9222 0.8944
NanoSCIDOCS 0.3345 0.5562 0.2622
NanoArguAna 0.4154 0.3151 0.3257
NanoSciFact 0.5972 0.5774 0.5703
NanoTouche2020 0.5979 0.7910 0.4526

RikkaBotan/quantized-stable-static-embedding-fast-retrieval-mrl-ja

  • Quantized SSE for Retrieval MRL Japanese version

Table 10 | NanoBEIR Japanese Evaluation.

Dataset nDCG@10 MRR@10 MAP@100
NanoBEIR Mean 0.4477 0.5088 0.3675
NanoClimateFEVER 0.3152 0.4258 0.2417
NanoDBPedia 0.5554 0.7767 0.3962
NanoFEVER 0.5536 0.4907 0.4827
NanoFiQA2018 0.3160 0.3614 0.2653
NanoHotpotQA 0.4722 0.5669 0.4136
NanoMSMARCO 0.3929 0.3237 0.3371
NanoNFCorpus 0.2686 0.4584 0.0962
NanoNQ 0.4170 0.3607 0.3571
NanoQuoraRetrieval 0.7768 0.7750 0.7393
NanoSCIDOCS 0.2939 0.4774 0.2197
NanoArguAna 0.3471 0.2617 0.2727
NanoSciFact 0.6387 0.6127 0.6001
NanoTouche2020 0.4732 0.7240 0.3560

RikkaBotan/quantized-stable-static-embedding-fast-retrieval-mrl-bilingual-ja-en

  • Quantized SSE for Retrieval MRL Bilingual version (English & Japanese)

Table 11 | NanoBEIR English Evaluation.

Dataset nDCG@10 MRR@10 MAP@100
NanoBEIR Mean 0.5049 0.5526 0.4197
NanoClimateFEVER 0.3166 0.3874 0.2511
NanoDBPedia 0.5604 0.7321 0.4244
NanoFEVER 0.6511 0.5871 0.5595
NanoFiQA2018 0.3179 0.3541 0.2617
NanoHotpotQA 0.6840 0.7459 0.6191
NanoMSMARCO 0.4417 0.3616 0.3748
NanoNFCorpus 0.2939 0.4535 0.1202
NanoNQ 0.4952 0.4287 0.4251
NanoQuoraRetrieval 0.8528 0.8533 0.8190
NanoSCIDOCS 0.3335 0.5460 0.2551
NanoArguAna 0.3978 0.3202 0.3326
NanoSciFact 0.6076 0.5842 0.5733
NanoTouche2020 0.6105 0.8298 0.4406

Table 12 | NanoBEIR Japanese Evaluation.

Dataset nDCG@10 MRR@10 MAP@100
NanoBEIR Mean 0.4493 0.5083 0.3744
NanoClimateFEVER 0.2883 0.3860 0.2218
NanoDBPedia 0.5458 0.7632 0.4048
NanoFEVER 0.4956 0.4403 0.4421
NanoFiQA2018 0.3224 0.3667 0.2640
NanoHotpotQA 0.4866 0.5444 0.4117
NanoMSMARCO 0.4578 0.4085 0.4226
NanoNFCorpus 0.2731 0.4844 0.1138
NanoNQ 0.3944 0.3406 0.3436
NanoQuoraRetrieval 0.8003 0.8179 0.7766
NanoSCIDOCS 0.3156 0.5133 0.2325
NanoArguAna 0.3635 0.2758 0.2871
NanoSciFact 0.6341 0.6020 0.5903
NanoTouche2020 0.4628 0.6646 0.3566

9 Implementations

9.1 Modeling

We implemented SSE using PyTorch within the sentence-transformers framework. Our model inherits from the library's InputModule, ensuring full compatibility with standard input processing and inference workflows.

SSE Modeling
"""
coding = utf-8
Copyright 2026 Rikka Botan. All rights reserved
Licensed under "MIT License"
Stable Static Embedding official PyTorch implementation
"""

from __future__ import annotations
import os
from pathlib import Path
from safetensors.torch import save_file as save_safetensors_file
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from typing import Dict
from dataclasses import dataclass
from tokenizers import Tokenizer
from transformers import PreTrainedTokenizerFast
from sentence_transformers.models.InputModule import InputModule


class SeparableDyT(nn.Module):
    def __init__(
        self,
        hidden_dim: int,
        alpha_init: float = 0.5
    ):
        super().__init__()
        self.alpha = nn.Parameter(alpha_init*torch.ones(hidden_dim))
        self.beta = nn.Parameter(torch.ones(hidden_dim))
        self.bias = nn.Parameter(torch.zeros(hidden_dim))
    
    def forward(
        self,
        x: torch.Tensor
    ) -> torch.Tensor:
        x = self.beta * F.tanh(self.alpha * x + self.bias)
        return x


class SSE(InputModule):
    """
    Stable Static Embedding (SSE)
    StaticEmbedding-compatible Sentence-Transformers module
    """

    def __init__(
        self,
        tokenizer: Tokenizer | PreTrainedTokenizerFast,
        vocab_size: int,
        hidden_dim: int = 1024,
        **kwargs,
    ):
        super().__init__()

        if isinstance(tokenizer, PreTrainedTokenizerFast):
            tokenizer = tokenizer._tokenizer
        elif not isinstance(tokenizer, Tokenizer):
            raise ValueError("Tokenizer must be a fast (Rust) tokenizer")

        self.tokenizer: Tokenizer = tokenizer
        self.tokenizer.no_padding()

        self.embedding = nn.EmbeddingBag(vocab_size, hidden_dim)
        self.dyt = SeparableDyT(hidden_dim)

        self.embedding_dim = hidden_dim

        # For model card compatibility
        self.base_model = kwargs.get("base_model", None)

    # Tokenization (StaticEmbedding-compatible)
    def tokenize(
        self,
        texts: list[str],
        **kwargs
    ) -> dict[str, torch.Tensor]:
        encodings = self.tokenizer.encode_batch(texts, add_special_tokens=False)
        encodings_ids = [encoding.ids for encoding in encodings]

        offsets = torch.from_numpy(
            np.cumsum(
                [0] + [len(token_ids) for token_ids in encodings_ids[:-1]]
            )
        )
        input_ids = torch.tensor(
            [token_id for token_ids in encodings_ids for token_id in token_ids],
            dtype=torch.long
        )
        return {
            "input_ids": input_ids,
            "offsets": offsets
        }

    # Forward
    def forward(
        self,
        features: Dict[str, torch.Tensor],
        **kwargs,
    ) -> Dict[str, torch.Tensor]:
        x = self.embedding(features["input_ids"], features["offsets"])
        x = self.dyt(x)
        features["sentence_embedding"] = x
        return features

    # Required APIs
    def get_sentence_embedding_dimension(self) -> int:
        return self.embedding_dim

    @property
    def max_seq_length(self) -> int:
        return torch.inf
    
    def save(
        self,
        output_path: str,
        *args,
        safe_serialization: bool = True,
        **kwargs,
    ) -> None:
        os.makedirs(output_path, exist_ok=True)

        if safe_serialization:
            save_safetensors_file(
                self.state_dict(),
                os.path.join(output_path, "model.safetensors"),
            )
        else:
            torch.save(
                self.state_dict(),
                os.path.join(output_path, "pytorch_model.bin"),
            )

        self.tokenizer.save(
            str(Path(output_path) / "tokenizer.json")
        )

    @classmethod
    def load(
        cls,
        model_name_or_path: str,
        **kwargs,
    ):
        allowed_keys = {
            "cache_dir",
            "local_files_only",
            "force_download",
        }
        filtered_kwargs = {
            k: v for k, v in kwargs.items() if k in allowed_keys
        }
    
        tokenizer_path = cls.load_file_path(
            model_name_or_path,
            filename="tokenizer.json",
            **filtered_kwargs,
        )
        tokenizer = Tokenizer.from_file(tokenizer_path)
    
        weights = cls.load_torch_weights(
            model_name_or_path=model_name_or_path,
            **filtered_kwargs,
        )
    
        hidden_dim = weights["embedding.weight"].size(1)
        vocab_size = weights["embedding.weight"].size(0)
    
        model = cls(
            tokenizer=tokenizer,
            vocab_size=vocab_size,
            hidden_dim=hidden_dim,
        )
    
        model.load_state_dict(weights)
        return model
SSE Configs
@dataclass
class SSESforzandoConfig:
    hidden_dim: int = 512
    vocab_size: int = 30522


@dataclass
class SSEForzandoConfig:
    hidden_dim: int = 384
    vocab_size: int = 30522

9.2 Inference

It can be used with simple code like the following.

"""
coding = utf-8
Copyright 2026 Rikka Botan. All rights reserved
Licensed under "MIT License"
Stable Static Embedding inference implementation
"""

import torch
from sentence_transformers import SentenceTransformer

# load (remote code enabled)
model = SentenceTransformer(
    "RikkaBotan/stable-static-embedding-fast-retrieval-mrl-en",
    trust_remote_code=True,
    device="cuda" if torch.cuda.is_available() else "cpu",
    truncate_dim=256,
)

# inference
query = "What is Stable Static Embedding?"
sentences = [
    "SSE: Stable Static embedding works without attention.",
    "Stable Static Embedding is a fast embedding method designed for retrieval tasks.",
    "Static embeddings are often compared with transformer-based sentence encoders.",
    "I cooked pasta last night while listening to jazz music.",
    "Large language models are commonly trained using next-token prediction objectives.",
    "Instruction tuning improves the ability of LLMs to follow human-written prompts.",
]


with torch.no_grad():
    embeddings = model.encode(
        [query] + sentences,
        convert_to_tensor=True,
        normalize_embeddings=True,
        batch_size=32
    )

print("embeddings shape:", embeddings.shape)

# cosine similarity
similarities = model.similarity(embeddings[0], embeddings[1:])
for i, similarity in enumerate(similarities[0].tolist()):
    print(f"{similarity:.05f}: {sentences[i]}")

9.3 Quantized Modeling

We performed 4-bit quantization modeling (SSEQ) to optimize resource utilization. By reducing parameter precision, we achieved significant compression of data size and storage overhead while maintaining retrieval performance comparable to the original model.

SSEQ Modeling
"""
coding = utf-8
Copyright 2026 Rikka Botan. All rights reserved
Licensed under "MIT License"
Stable Static Embedding official PyTorch implementation
"""

from __future__ import annotations
import os
from pathlib import Path
from safetensors.torch import save_file as save_safetensors_file
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from typing import Dict
from dataclasses import dataclass
from tokenizers import Tokenizer
from transformers import PreTrainedTokenizerFast
from sentence_transformers.models.InputModule import InputModule
from safetensors.torch import load_file


def quantize_q4(weight: torch.Tensor):
    """
    weight: (vocab, dim)
    returns: packed uint8 + scale + zero
    """
    w = weight.detach().cpu().numpy().astype(np.float32)

    scales = np.max(np.abs(w), axis=1, keepdims=True) + 1e-8
    w_norm = w / scales

    q = np.clip(np.round((w_norm + 1) * 7.5), 0, 15).astype(np.uint8)

    # pack 2x4bit -> uint8
    packed = (q[:, 0::2] << 4) | q[:, 1::2]

    return {
        "packed": packed,
        "scales": scales.astype(np.float32),
    }


def dequantize_q4(packed: np.ndarray, scales: np.ndarray):
    hi = (packed >> 4) & 0xF
    lo = packed & 0xF

    q = np.empty((packed.shape[0], packed.shape[1]*2), dtype=np.uint8)
    q[:, 0::2] = hi
    q[:, 1::2] = lo

    w = (q.astype(np.float32) / 7.5) - 1.0
    w = w * scales
    return torch.from_numpy(w)


class SeparableDyT(nn.Module):
    def __init__(
        self,
        hidden_dim: int,
        alpha_init: float = 0.5
    ):
        super().__init__()
        self.alpha = nn.Parameter(alpha_init*torch.ones(hidden_dim))
        self.beta = nn.Parameter(torch.ones(hidden_dim))
        self.bias = nn.Parameter(torch.zeros(hidden_dim))
    
    def forward(
        self,
        x: torch.Tensor
    ) -> torch.Tensor:
        x = self.beta * F.tanh(self.alpha * x + self.bias)
        return x


class SSEQ(InputModule):
    """
    Stable Static Embedding (SSE)
    StaticEmbedding-compatible Sentence-Transformers module
    """

    def __init__(
        self,
        tokenizer: Tokenizer | PreTrainedTokenizerFast,
        vocab_size: int,
        hidden_dim: int = 1024,
        **kwargs,
    ):
        super().__init__()

        if isinstance(tokenizer, PreTrainedTokenizerFast):
            tokenizer = tokenizer._tokenizer
        elif not isinstance(tokenizer, Tokenizer):
            raise ValueError("Tokenizer must be a fast (Rust) tokenizer")

        self.tokenizer: Tokenizer = tokenizer
        self.tokenizer.no_padding()

        self.embedding = nn.EmbeddingBag(vocab_size, hidden_dim)
        self.dyt = SeparableDyT(hidden_dim)

        self.embedding_dim = hidden_dim

        # For model card compatibility
        self.base_model = kwargs.get("base_model", None)

    # Tokenization (StaticEmbedding-compatible)
    def tokenize(
        self,
        texts: list[str],
        **kwargs
    ) -> dict[str, torch.Tensor]:
        encodings = self.tokenizer.encode_batch(texts, add_special_tokens=False)
        encodings_ids = [encoding.ids for encoding in encodings]

        offsets = torch.from_numpy(
            np.cumsum(
                [0] + [len(token_ids) for token_ids in encodings_ids[:-1]]
            )
        )
        input_ids = torch.tensor(
            [token_id for token_ids in encodings_ids for token_id in token_ids],
            dtype=torch.long
        )
        return {
            "input_ids": input_ids,
            "offsets": offsets
        }

    # Forward
    def forward(
        self,
        features: Dict[str, torch.Tensor],
        **kwargs,
    ) -> Dict[str, torch.Tensor]:
        x = self.embedding(features["input_ids"], features["offsets"])
        x = self.dyt(x)
        features["sentence_embedding"] = x
        return features

    # Required APIs
    def get_sentence_embedding_dimension(self) -> int:
        return self.embedding_dim

    @property
    def max_seq_length(self) -> int:
        return torch.inf
    
    def save(self, output_path: str):
        os.makedirs(output_path, exist_ok=True)

        state = self.state_dict()

        emb = state["embedding.weight"]
        q = quantize_q4(emb)

        del state["embedding.weight"]

        save_safetensors_file(
            state,
            os.path.join(output_path, "model_rest.safetensors"),
        )

        with open(os.path.join(output_path, "embedding.q4_k_m.bin"), "wb") as f:
            f.write(q["packed"].tobytes())
            f.write(q["scales"].tobytes())

        self.tokenizer.save(
            str(Path(output_path) / "tokenizer.json")
        )
    
    @classmethod
    def load(cls, model_path: str):

        tokenizer = Tokenizer.from_file(
            os.path.join(model_path, "tokenizer.json")
        )

        state = load_file(
            os.path.join(model_path, "model_rest.safetensors"),
            device="cpu"
        )

        # read q4 binary
        bin_path = os.path.join(model_path, "embedding.q4_k_m.bin")
        with open(bin_path, "rb") as f:
            raw = f.read()

        hidden = state["dyt.alpha"].shape[0]
        total_uint8 = len(raw)

        bytes_per_row = hidden // 2 + 4
        vocab = total_uint8 // bytes_per_row

        packed_size = vocab * hidden // 2

        packed = np.frombuffer(raw[:packed_size], dtype=np.uint8)
        scales = np.frombuffer(raw[packed_size:], dtype=np.float32)

        packed = packed.reshape(vocab, hidden // 2)
        scales = scales.reshape(vocab, 1)

        emb = dequantize_q4(packed, scales)

        # rebuild model
        model = cls(
            tokenizer=tokenizer,
            vocab_size=emb.shape[0],
            hidden_dim=emb.shape[1]
        )

        state["embedding.weight"] = emb
        model.load_state_dict(state)

        return model

9.4 Quantization

By using the following, we can quantize the weights. By uploading the created weights and SSEQ implementation to Hugging Face, we can use them just like other sentence-transformers compatible models.

"""
coding = utf-8
Copyright 2026 Rikka Botan. All rights reserved
Licensed under "MIT License"
Quantization implementation
"""

import os
from tokenizers import Tokenizer
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer
from SSE import SSE
from SSE_quantize import SSEQ


def quantize_and_save_sse_from_hf(
    hf_model_name: str,
    output_path: str,
):

    print(f"[1] Loading HF model: {hf_model_name}")

    st_model = SentenceTransformer(hf_model_name)

    sseq_module = None
    for m in st_model.modules():
        if isinstance(m, SSE):
            sseq_module = m
            break

    if sseq_module is None:
        raise ValueError("SSE module not found in the model")

    print("[2] Extract tokenizer")

    # tokenizer
    try:
        hf_tokenizer = AutoTokenizer.from_pretrained(hf_model_name)
        tokenizer = Tokenizer.from_str(hf_tokenizer.backend_tokenizer.to_str())
    except Exception:
        tokenizer = sseq_module.tokenizer

    print("[3] Rebuild SSEQ model")

    # embedding weight
    emb_weight = sseq_module.embedding.weight.detach().cpu()

    model = SSEQ(
        tokenizer=tokenizer,
        vocab_size=emb_weight.shape[0],
        hidden_dim=emb_weight.shape[1],
        base_model=hf_model_name
    )

    state = sseq_module.state_dict()
    model.load_state_dict(state)

    print("[4] Quantize & Save")

    os.makedirs(output_path, exist_ok=True)
    model.save(output_path)

    print(f"[✓] Quantized model saved to: {output_path}")
  • Simple usage
quantize_and_save_sse_from_hf(
    "RikkaBotan/stable-static-embedding-fast-retrieval-mrl-en",
    "./sse-q4")

10 Application Example

Acknowledgements

The author acknowledge the support of Saldra, Witness and Lumina Logic Minds for providing computational resources used in this work.

Our interest in this topic originated from reading Tom Aarsen's seminal article, Train 400x faster Static Embedding Models with Sentence Transformers, which motivated us to investigate on static embedding.

I thank the developers of sentence-transformers, python and pytorch.

I thank all the researchers for their efforts to date.

I thank Japan's high standard of education.

And most of all, thank you for your interest in this blog.

About us

Japanese independent researcher having shy and pampered personality. Twin-tail hair is a charm point. Interested in nlp. Usually using python and C.

Please contact us if you have any requests for joint research, writing, speaking engagements, or employment.

RikkaBotan_Logo

References

SSE_Logo

Community

Sign up or log in to comment