update
Browse files
README.md
CHANGED
|
@@ -2603,12 +2603,15 @@ pipeline_tag: sentence-similarity
|
|
| 2603 |
---
|
| 2604 |
|
| 2605 |
|
|
|
|
|
|
|
| 2606 |
<h1 align="center">FlagEmbedding</h1>
|
| 2607 |
|
| 2608 |
|
| 2609 |
<h4 align="center">
|
| 2610 |
<p>
|
| 2611 |
<a href=#model-list>Model List</a> |
|
|
|
|
| 2612 |
<a href=#usage>Usage</a> |
|
| 2613 |
<a href="#evaluation">Evaluation</a> |
|
| 2614 |
<a href="#train">Train</a> |
|
|
@@ -2628,8 +2631,8 @@ And it also can be used in vector databases for LLMs.
|
|
| 2628 |
|
| 2629 |
************* 🌟**Updates**🌟 *************
|
| 2630 |
- 09/12/2023: New Release:
|
| 2631 |
-
- **New reranker model**: release
|
| 2632 |
-
- **update embedding model**: release bge-*-v1.5 embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction.
|
| 2633 |
- 09/07/2023: Update [fine-tune code](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md): Add script to mine hard negatives and support adding instruction during fine-tuning.
|
| 2634 |
- 08/09/2023: BGE Models are integrated into **Langchain**, you can use it like [this](#using-langchain); C-MTEB **leaderboard** is [available](https://huggingface.co/spaces/mteb/leaderboard).
|
| 2635 |
- 08/05/2023: Release base-scale and small-scale models, **best performance among the models of the same size 🤗**
|
|
@@ -2661,7 +2664,7 @@ And it also can be used in vector databases for LLMs.
|
|
| 2661 |
|
| 2662 |
\*: If you need to search the relevant passages to a query, we suggest to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, **no instruction** needs to be added to passages.
|
| 2663 |
|
| 2664 |
-
\**: To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models.
|
| 2665 |
For examples, use bge embedding model to retrieve top 100 relevant documents, and then use bge reranker to re-rank the top 100 document to get the final top-3 results.
|
| 2666 |
|
| 2667 |
|
|
@@ -2673,7 +2676,7 @@ For examples, use bge embedding model to retrieve top 100 relevant documents, an
|
|
| 2673 |
<!-- ### How to fine-tune bge embedding model? -->
|
| 2674 |
Following this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) to prepare data and fine-tune your model.
|
| 2675 |
Some suggestions:
|
| 2676 |
-
- Mine hard negatives following this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune#
|
| 2677 |
- If you pre-train bge on your data, the pre-trained model cannot be directly used to calculate similarity, and it must be fine-tuned with contrastive learning before computing similarity.
|
| 2678 |
- If the accuracy of the fine-tuned model is still not high, it is recommended to use/fine-tune the cross-encoder model (bge-reranker) to re-rank top-k results. Hard negatives also are needed to fine-tune reranker.
|
| 2679 |
|
|
@@ -2957,8 +2960,8 @@ Cross-encoder will perform full-attention over the input pair,
|
|
| 2957 |
which is more accurate than embedding model (i.e., bi-encoder) but more time-consuming than embedding model.
|
| 2958 |
Therefore, it can be used to re-rank the top-k documents returned by embedding model.
|
| 2959 |
We train the cross-encoder on a multilingual pair data,
|
| 2960 |
-
The data format is the same as embedding model, so you can fine-tune it easily following our example.
|
| 2961 |
-
More details pelease refer to [./FlagEmbedding/reranker/README.md](
|
| 2962 |
|
| 2963 |
|
| 2964 |
## Contact
|
|
|
|
| 2603 |
---
|
| 2604 |
|
| 2605 |
|
| 2606 |
+
**Recommend switching to newest bge-base-en-v1.5, which has more reasonable similarity distribution and same method of usage.**
|
| 2607 |
+
|
| 2608 |
<h1 align="center">FlagEmbedding</h1>
|
| 2609 |
|
| 2610 |
|
| 2611 |
<h4 align="center">
|
| 2612 |
<p>
|
| 2613 |
<a href=#model-list>Model List</a> |
|
| 2614 |
+
<a href=#frequently-asked-questions>FAQ</a> |
|
| 2615 |
<a href=#usage>Usage</a> |
|
| 2616 |
<a href="#evaluation">Evaluation</a> |
|
| 2617 |
<a href="#train">Train</a> |
|
|
|
|
| 2631 |
|
| 2632 |
************* 🌟**Updates**🌟 *************
|
| 2633 |
- 09/12/2023: New Release:
|
| 2634 |
+
- **New reranker model**: release cross-encoder models `BAAI/bge-reranker-base` and `BAAI/bge-reranker-large`, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models.
|
| 2635 |
+
- **update embedding model**: release `bge-*-v1.5` embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction.
|
| 2636 |
- 09/07/2023: Update [fine-tune code](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md): Add script to mine hard negatives and support adding instruction during fine-tuning.
|
| 2637 |
- 08/09/2023: BGE Models are integrated into **Langchain**, you can use it like [this](#using-langchain); C-MTEB **leaderboard** is [available](https://huggingface.co/spaces/mteb/leaderboard).
|
| 2638 |
- 08/05/2023: Release base-scale and small-scale models, **best performance among the models of the same size 🤗**
|
|
|
|
| 2664 |
|
| 2665 |
\*: If you need to search the relevant passages to a query, we suggest to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, **no instruction** needs to be added to passages.
|
| 2666 |
|
| 2667 |
+
\**: Different embedding model, reranker is a cross-encoder, which cannot be used to generate embedding. To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models.
|
| 2668 |
For examples, use bge embedding model to retrieve top 100 relevant documents, and then use bge reranker to re-rank the top 100 document to get the final top-3 results.
|
| 2669 |
|
| 2670 |
|
|
|
|
| 2676 |
<!-- ### How to fine-tune bge embedding model? -->
|
| 2677 |
Following this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) to prepare data and fine-tune your model.
|
| 2678 |
Some suggestions:
|
| 2679 |
+
- Mine hard negatives following this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune#hard-negatives), which can improve the retrieval performance.
|
| 2680 |
- If you pre-train bge on your data, the pre-trained model cannot be directly used to calculate similarity, and it must be fine-tuned with contrastive learning before computing similarity.
|
| 2681 |
- If the accuracy of the fine-tuned model is still not high, it is recommended to use/fine-tune the cross-encoder model (bge-reranker) to re-rank top-k results. Hard negatives also are needed to fine-tune reranker.
|
| 2682 |
|
|
|
|
| 2960 |
which is more accurate than embedding model (i.e., bi-encoder) but more time-consuming than embedding model.
|
| 2961 |
Therefore, it can be used to re-rank the top-k documents returned by embedding model.
|
| 2962 |
We train the cross-encoder on a multilingual pair data,
|
| 2963 |
+
The data format is the same as embedding model, so you can fine-tune it easily following our [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/reranker).
|
| 2964 |
+
More details pelease refer to [./FlagEmbedding/reranker/README.md](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/reranker)
|
| 2965 |
|
| 2966 |
|
| 2967 |
## Contact
|