A fine-tuned vision-language embedding model for MultiModal military & defense document retrieval by Racine AI.
What is natotan?
natotan is a domain-adapted vision-language embedding model built for multimodal military and defense document retrieval in English and French. It was created by applying LoRA (Low-Rank Adaptation) fine-tuning to Qwen/Qwen3-VL-Embedding-2B and merging the adapter weights into the base model for seamless deployment. On a custom retrieval benchmark of 5,428 query-document pairs spanning NATO and French defense publications, natotan achieves a 9.0% improvement in NDCG@1 and a 6.8% improvement in MRR over the unmodified base model, while outperforming Google's Gemini multimodalembedding@001 by over 230% in NDCG@10.
Key Findings
natotan demonstrates consistent retrieval improvements across both languages and nearly all document categories evaluated. The largest gains occur at the top of the ranking — NDCG@1 improves by 9.0% — which directly impacts the user experience in search applications where the first result matters most. Recall@5 improves from 0.843 to 0.893, meaning the correct document appears in the top 5 results for 89.3% of queries compared to 84.3% with the base model.
French-language retrieval benefits even more than English, with NDCG@1 improving by 12.3% (from 0.344 to 0.387) versus 5.8% for English (from 0.361 to 0.382). This is notable because many multimodal embedding models underperform on non-English content.
In comparison with Google's proprietary Gemini multimodalembedding@001 (1408-dimensional embeddings), natotan achieves an NDCG@10 of 0.699 versus 0.212, a difference of over 3.3x. Gemini's French performance is particularly weak at 0.132 NDCG@10, compared to 0.697 for natotan — a 5.3x gap. These results suggest that domain-adapted open-source models can substantially outperform general-purpose proprietary embeddings on specialized retrieval tasks.
Among the 16 document categories tested, natotan improves NDCG@1 in 11 categories, with the largest gains in medot (+200%), ajmedp (+46.4%), ft (+43.4%), and un_manuals (+39.7%). Five categories show minor regressions, primarily those with very small sample sizes (modern: n=14, medot: n=6) or specialized academic content (cahiers_pensee). Overall ranking quality (NDCG@10) improves in 13 out of 16 categories.
Model Overview
| Architecture | Qwen3-VL (Vision-Language Transformer) |
| Fine-tuning | LoRA (Low-Rank Adaptation), merged into base weights |
| Task | Multimodal embedding / document retrieval |
| Languages | English (2,714 samples), French (2,714 samples) |
| Domain | NATO & French defense publications |
| Format | safetensors — ready for direct inference, no adapter loading needed |
Evaluation Dataset
The benchmark uses 5,428 query-document pairs from held-out documents not seen during training, split evenly across English and French. The corpus covers NATO and French military sources across 16 document categories, ranging from tactical field manuals (1,016 samples) and allied medical publications (1,138 samples) to strategic doctrine (48 samples) and UN training manuals (200 samples).
| Category | Samples | Category | Samples | |
|---|---|---|---|---|
| amedp | 1,138 | tta | 1,100 | |
| tactical | 1,016 | ajp | 916 | |
| ajmedp | 224 | un_manuals | 200 | |
| ft | 154 | pia | 136 | |
| irsem | 132 | cahiers_pensee | 124 | |
| dia | 92 | lexicons | 82 | |
| strategic | 48 | other | 46 | |
| modern | 14 | medot | 6 |
Source themes: French military (3,104 samples), NATO (2,324 samples).
Models Compared
Three models were evaluated on the same benchmark to provide context on both open-source and proprietary baselines.
| Model | Type | Embedding Dim |
|---|---|---|
Gemini multimodalembedding@001 |
Google proprietary, multimodal | 1408 |
Base Qwen/Qwen3-VL-Embedding-2B |
Open-source, vision-language | 2048 |
| natotan (this model) | Base + LoRA merge, domain-adapted | 2048 |
Overall Results — 3-Way Comparison
NDCG (Normalized Discounted Cumulative Gain)
| Cutoff | Gemini | Base | natotan |
|---|---|---|---|
| @1 | 0.0925 | 0.3524 | 0.3841 |
| @3 | 0.1662 | 0.6020 | 0.6456 |
| @5 | 0.1880 | 0.6362 | 0.6802 |
| @10 | 0.2118 | 0.6575 | 0.6990 |
| @20 | 0.2328 | 0.6677 | 0.7064 |
| @50 | 0.2549 | 0.6734 | 0.7097 |
| @5428 | 0.3108 | 0.6769 | 0.7104 |
Recall
| Cutoff | Gemini | Base | natotan |
|---|---|---|---|
| @1 | 0.0925 | 0.3524 | 0.3841 |
| @3 | 0.2159 | 0.7612 | 0.8106 |
| @5 | 0.2690 | 0.8430 | 0.8930 |
| @10 | 0.3427 | 0.9079 | 0.9501 |
| @20 | 0.4259 | 0.9479 | 0.9790 |
| @50 | 0.5368 | 0.9764 | 0.9954 |
| @5428 | 1.0000 | 1.0000 | 1.0000 |
MRR & MAP
| Metric | Gemini | Base | natotan |
|---|---|---|---|
| MRR | 0.1823 | 0.5785 | 0.6179 |
| MAP | 0.1823 | 0.5785 | 0.6179 |
Results by Language
NDCG@10
| Language | Gemini | Base | natotan |
|---|---|---|---|
| English | 0.2917 | 0.6623 | 0.7013 |
| French | 0.1318 | 0.6527 | 0.6966 |
Recall@10
| Language | Gemini | Base | natotan |
|---|---|---|---|
| English | 0.4591 | 0.9094 | 0.9562 |
| French | 0.2262 | 0.9064 | 0.9440 |
Gemini's French performance is notably poor (NDCG@10 of 0.13 vs 0.29 in English), while both Base and natotan maintain near-parity across languages.
Full Language Breakdown — natotan vs Base
French (2,714 samples) — all metrics
| Metric | Cutoff | Base | natotan | Delta |
|---|---|---|---|---|
| NDCG | @1 | 0.3441 | 0.3865 | +0.0424 |
| NDCG | @3 | 0.5948 | 0.6442 | +0.0494 |
| NDCG | @5 | 0.6319 | 0.6779 | +0.0460 |
| NDCG | @10 | 0.6527 | 0.6966 | +0.0439 |
| NDCG | @20 | 0.6630 | 0.7050 | +0.0420 |
| NDCG | @50 | 0.6690 | 0.7085 | +0.0395 |
| Recall | @1 | 0.3441 | 0.3865 | +0.0424 |
| Recall | @3 | 0.7542 | 0.8069 | +0.0527 |
| Recall | @5 | 0.8427 | 0.8869 | +0.0442 |
| Recall | @10 | 0.9064 | 0.9440 | +0.0376 |
| Recall | @20 | 0.9469 | 0.9768 | +0.0298 |
| Recall | @50 | 0.9768 | 0.9941 | +0.0173 |
| MRR | — | 0.5727 | 0.6171 | +0.0445 |
| MAP | — | 0.5727 | 0.6171 | +0.0445 |
English (2,714 samples) — all metrics
| Metric | Cutoff | Base | natotan | Delta |
|---|---|---|---|---|
| NDCG | @1 | 0.3607 | 0.3817 | +0.0210 |
| NDCG | @3 | 0.6092 | 0.6470 | +0.0377 |
| NDCG | @5 | 0.6406 | 0.6826 | +0.0420 |
| NDCG | @10 | 0.6623 | 0.7013 | +0.0390 |
| NDCG | @20 | 0.6724 | 0.7077 | +0.0354 |
| NDCG | @50 | 0.6778 | 0.7109 | +0.0331 |
| Recall | @1 | 0.3607 | 0.3817 | +0.0210 |
| Recall | @3 | 0.7682 | 0.8143 | +0.0461 |
| Recall | @5 | 0.8434 | 0.8990 | +0.0556 |
| Recall | @10 | 0.9094 | 0.9562 | +0.0468 |
| Recall | @20 | 0.9488 | 0.9812 | +0.0324 |
| Recall | @50 | 0.9761 | 0.9967 | +0.0206 |
| MRR | — | 0.5843 | 0.6187 | +0.0344 |
| MAP | — | 0.5843 | 0.6187 | +0.0344 |
Results by Document Category
Performance varies by document type. natotan achieves the strongest gains on categories where the base model was weakest, such as tactical documents and UN manuals, while a small number of low-sample categories show minor regressions.
NDCG@10 — 3-Way Comparison (sorted by natotan score)
| Category | n | Gemini | Base | natotan |
|---|---|---|---|---|
| medot | 6 | 0.2103 | 0.427 | 0.815 |
| un_manuals | 200 | 0.1356 | 0.667 | 0.764 |
| ajmedp | 224 | 0.3231 | 0.653 | 0.750 |
| other | 46 | 0.5336 | 0.723 | 0.737 |
| modern | 14 | 0.5694 | 0.791 | 0.757 |
| lexicons | 82 | 0.1972 | 0.712 | 0.727 |
| strategic | 48 | 0.2222 | 0.633 | 0.726 |
| ft | 154 | 0.0360 | 0.655 | 0.720 |
| ajp | 916 | 0.2817 | 0.698 | 0.714 |
| tta | 1,100 | 0.1875 | 0.647 | 0.706 |
| amedp | 1,138 | 0.2589 | 0.685 | 0.694 |
| cahiers_pensee | 124 | 0.2505 | 0.682 | 0.678 |
| pia | 136 | 0.1965 | 0.656 | 0.674 |
| tactical | 1,016 | 0.1274 | 0.597 | 0.669 |
| irsem | 132 | 0.2426 | 0.654 | 0.644 |
| dia | 92 | 0.0610 | 0.612 | 0.627 |
Recall@10 — 3-Way Comparison (sorted by natotan score)
| Category | n | Gemini | Base | natotan |
|---|---|---|---|---|
| medot | 6 | 0.3333 | 0.667 | 1.000 |
| strategic | 48 | 0.4167 | 0.896 | 1.000 |
| cahiers_pensee | 124 | 0.4194 | 0.960 | 1.000 |
| lexicons | 82 | 0.3659 | 0.963 | 1.000 |
| modern | 14 | 0.8571 | 1.000 | 1.000 |
| un_manuals | 200 | 0.2500 | 0.920 | 0.975 |
| ajmedp | 224 | 0.4866 | 0.929 | 0.978 |
| pia | 136 | 0.3088 | 0.956 | 0.963 |
| other | 46 | 0.7826 | 0.957 | 0.957 |
| tta | 1,100 | 0.3064 | 0.884 | 0.956 |
| ft | 154 | 0.0844 | 0.948 | 0.955 |
| ajp | 916 | 0.4421 | 0.931 | 0.952 |
| amedp | 1,138 | 0.4156 | 0.944 | 0.936 |
| tactical | 1,016 | 0.2156 | 0.842 | 0.935 |
| irsem | 132 | 0.3939 | 0.924 | 0.932 |
| dia | 92 | 0.0870 | 0.880 | 0.924 |
MRR by Category (sorted by natotan score)
| Category | n | Base | natotan | Delta |
|---|---|---|---|---|
| medot | 6 | 0.354 | 0.750 | +0.396 |
| un_manuals | 200 | 0.587 | 0.694 | +0.107 |
| ajmedp | 224 | 0.565 | 0.675 | +0.110 |
| other | 46 | 0.647 | 0.661 | +0.015 |
| modern | 14 | 0.721 | 0.676 | -0.046 |
| ft | 154 | 0.561 | 0.644 | +0.083 |
| ajp | 916 | 0.624 | 0.637 | +0.013 |
| strategic | 48 | 0.551 | 0.637 | +0.086 |
| lexicons | 82 | 0.631 | 0.636 | +0.005 |
| tta | 1,100 | 0.572 | 0.625 | +0.053 |
| amedp | 1,138 | 0.601 | 0.616 | +0.014 |
| tactical | 1,016 | 0.523 | 0.585 | +0.062 |
| pia | 136 | 0.560 | 0.581 | +0.020 |
| cahiers_pensee | 124 | 0.595 | 0.572 | -0.023 |
| irsem | 132 | 0.570 | 0.555 | -0.014 |
| dia | 92 | 0.528 | 0.533 | +0.005 |
NDCG@1 by Category — natotan vs Base (sorted by improvement)
| Category | n | Base | natotan | Delta | Relative |
|---|---|---|---|---|---|
| medot | 6 | 0.167 | 0.500 | +0.333 | +200.0% |
| un_manuals | 200 | 0.365 | 0.510 | +0.145 | +39.7% |
| ajmedp | 224 | 0.308 | 0.451 | +0.143 | +46.4% |
| ft | 154 | 0.299 | 0.429 | +0.130 | +43.4% |
| strategic | 48 | 0.313 | 0.417 | +0.104 | +33.3% |
| tta | 1,100 | 0.350 | 0.388 | +0.038 | +10.9% |
| tactical | 1,016 | 0.324 | 0.356 | +0.032 | +10.0% |
| other | 46 | 0.391 | 0.413 | +0.022 | +5.6% |
| amedp | 1,138 | 0.358 | 0.373 | +0.016 | +4.4% |
| pia | 136 | 0.331 | 0.338 | +0.007 | +2.2% |
| ajp | 916 | 0.394 | 0.401 | +0.007 | +1.7% |
| dia | 92 | 0.315 | 0.304 | -0.011 | -3.4% |
| irsem | 132 | 0.341 | 0.326 | -0.015 | -4.4% |
| lexicons | 82 | 0.427 | 0.390 | -0.037 | -8.6% |
| modern | 14 | 0.500 | 0.429 | -0.071 | -14.3% |
| cahiers_pensee | 124 | 0.387 | 0.306 | -0.081 | -20.9% |
Qualitative Examples
Example 1
Example 2
Quick Start
natotan is a fully merged model that requires no adapter loading. It can be used as a drop-in replacement for the base Qwen3-VL-Embedding-2B model with the same API.
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained(
"YOUR_USERNAME/natotan",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"YOUR_USERNAME/natotan",
trust_remote_code=True,
)
Reproducing the Merge
The LoRA adapter can be merged into the base model using the provided script. The resulting model is self-contained and does not require the adapter at inference time.
python3 merge_lora.py \
--base_model Qwen/Qwen3-VL-Embedding-2B \
--adapter ./lora_adapters \
--output_dir ./merged \
--trust_remote_code
Output contents:
config.jsonmodel.safetensors(or sharded weights + index)- Tokenizer files (
tokenizer.json,tokenizer_config.json, etc.) MERGED_FROM_LORA.txt(provenance marker)
Frequently Asked Questions
What is natotan? natotan is a LoRA-fine-tuned and merged version of Qwen3-VL-Embedding-2B, optimized for multimodal military and defense document retrieval in English and French. It produces 2048-dimensional embeddings and is evaluated on a benchmark of 5,428 query-document pairs from NATO and French military publications.
How much does natotan improve over the base model? On the custom retrieval benchmark, natotan improves NDCG@1 by 9.0% (from 0.352 to 0.384), Recall@5 by 5.9% (from 0.843 to 0.893), and MRR by 6.8% (from 0.578 to 0.618) compared to the unmodified Qwen3-VL-Embedding-2B.
How does natotan compare to Gemini multimodal embeddings?
natotan outperforms Google's Gemini multimodalembedding@001 by over 230% in NDCG@10 (0.699 vs 0.212) on the same benchmark. The gap is especially large for French-language queries, where natotan scores 0.697 NDCG@10 versus 0.132 for Gemini.
Does natotan work for both English and French? Yes. The evaluation dataset is split evenly between 2,714 English and 2,714 French query-document pairs. natotan improves retrieval in both languages, with slightly larger gains in French (NDCG@1 +12.3%) than English (NDCG@1 +5.8%).
Do I need to load a LoRA adapter separately?
No. The adapter weights have been merged into the base model. natotan can be loaded directly with AutoModel.from_pretrained() exactly like any standard Hugging Face model, with no additional dependencies.
What types of documents does natotan work best on? The model was evaluated on 16 categories of military documents. The largest improvements appear on tactical field manuals (+12.1% NDCG@10), UN training manuals (+14.6%), and allied joint medical publications (+14.9%). A small number of categories with very few samples (modern: n=14) show minor regressions.
Can natotan be used for non-military retrieval tasks? natotan inherits the general-purpose capabilities of Qwen3-VL-Embedding-2B. While it was fine-tuned specifically on defense documents, the LoRA adaptation is lightweight and the base model's broad capabilities are preserved. Performance on out-of-domain tasks has not been formally evaluated.
Citation
@misc{natotan2025,
title={natotan: LoRA-tuned Qwen3-VL-Embedding-2B for multimodal defense document retrieval},
year={2025},
url={https://huggingface.co/racineai/natotan}
}
- Downloads last month
- 7


