E5-small
Text Embeddings by Weakly-Supervised Contrastive Pre-training.
Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, Furu Wei, arXiv 2022
This model has 12 layers and the embedding size is 384.
Usage
Below is an example to encode queries and passages from the MS-MARCO passage ranking dataset.
import torch.nn.functional as F<br /> from torch import Tensor<br /> from transformers import AutoTokenizer, AutoModel<br /> def average_pool(last_hidden_states: Tensor,<br /> attention_mask: Tensor) -> Tensor:<br /> last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)<br /> return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]<br /> # Each input text should start with "query: " or "passage: ".<br /> # For tasks other than retrieval, you can simply use the "query: " prefix.<br /> input_texts = ['query: how much protein should a female eat',<br /> 'query: summit define',<br /> "passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",<br /> "passage: Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments."]<br /> tokenizer = AutoTokenizer.from_pretrained('intfloat/e5-small')<br /> model = AutoModel.from_pretrained('intfloat/e5-small')<br /> # Tokenize the input texts<br /> batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')<br /> outputs = model(**batch_dict)<br /> embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])<br /> # (Optionally) normalize embeddings<br /> embeddings = F.normalize(embeddings, p=2, dim=1)<br /> scores = (embeddings[:2] @ embeddings[2:].T) * 100<br /> print(scores.tolist())<br />
Training Details
Please refer to our paper at https://arxiv.org/pdf/2212.03533.pdf.
Benchmark Evaluation
Check out unilm/e5 to reproduce evaluation results
on the BEIR and MTEB benchmark.
Citation
If you find our paper or models helpful, please consider cite as follows:
@article{wang2022text,<br /> title={Text Embeddings by Weakly-Supervised Contrastive Pre-training},<br /> author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Jiao, Binxing and Yang, Linjun and Jiang, Daxin and Majumder, Rangan and Wei, Furu},<br /> journal={arXiv preprint arXiv:2212.03533},<br /> year={2022}<br /> }<br />
Limitations
This model only works for English texts. Long texts will be truncated to at most 512 tokens.
收录说明:
1、本网页并非 intfloat/e5-small 官网网址页面,此页面内容编录于互联网,只作展示之用;2、如果有与 intfloat/e5-small 相关业务事宜,请访问其网站并获取联系方式;3、本站与 intfloat/e5-small 无任何关系,对于 intfloat/e5-small 网站中的信息,请用户谨慎辨识其真伪。4、本站收录 intfloat/e5-small 时,此站内容访问正常,如遇跳转非法网站,有可能此网站被非法入侵或者已更换新网址,导致旧网址被非法使用,5、如果你是网站站长或者负责人,不想被收录请邮件删除:i-hu#Foxmail.com (#换@)
前往AI网址导航
2、本站所有文章、图片、资源等如果未标明原创,均为收集自互联网公开资源;分享的图片、资源、视频等,出镜模特均为成年女性正常写真内容,版权归原作者所有,仅作为个人学习、研究以及欣赏!如有涉及下载请24小时内删除;
3、如果您发现本站上有侵犯您的权益的作品,请与我们取得联系,我们会及时修改、删除并致以最深的歉意。邮箱: i-hu#(#换@)foxmail.com