亚洲国产爱久久全部精品_日韩有码在线播放_国产欧美在线观看_中文字幕不卡在线观看

Skip to content

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

License

Notifications You must be signed in to change notification settings

google-research-datasets/wit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

?

History

65 Commits
?
?
?
?
?
?
?
?
?
?
?
?
?
?

Repository files navigation

WIT : Wikipedia-based Image Text Dataset

Wikipedia-based Image Text (WIT) Dataset is a large multimodal multilingual dataset. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal machine learning models.

Key Advantages

A few unique advantages of WIT:

  • The largest multimodal dataset (publicly available at the time of this writing) by the number of image-text examples.
  • A massively multilingual dataset (first of its kind) with coverage for 108 languages.
  • First image-text dataset with page level metadata and contextual information
  • A collection of diverse set of concepts and real world entities.
  • Brings forth challenging real-world test sets.

You can learn more about WIT Dataset from our arXiv paper.

Latest Updates

2021 April: Happy to share the good news that our paper got accepted at SIGIR Conference. From ACM site, you can find our paper, slides and presentation.

2021 September: WIT Image-Text Competition is live on Kaggle. Our collaborators from Wikimedia Research blogged about this and they have made available the raw pixels and resnet50 embeddings for the images in this set. Here is our Google AI blog post.

2022 April: We are happy to share that the WIT paper and dataset was awarded the WikiMedia Foundation's Research Award of the Year (tweet 1, tweet 2). We are deeply honored and thank you for the recognition.

2022 May: We have released the WIT validation set and test set. Please see the data page for download links.

2022 Oct: Authoring Tools for Multimedia Content proposal accepted at TREC 2023

2023 Apr: AToMiC accepted at SIGIR 2023.

2023 Apr: WikiWeb2M Dataset released.

2023 May: Accepted submissions at WikiWorkshop 2023.

  • WikiWeb2M: A Page-Level Multimodal Wikipedia Dataset (pdf, arXiv)
  • Building Authoring Tools for Multimedia Content with Human-in-the-loop Relevance Annotations (pdf)
  • Characterizing Image Accessibility on Wikipedia across Languages (pdf)

WIT Example

Wikipedia Page

For example, let's take the Wikipedia page for Half Dome, Yosemite in CA.

WIT Wikipedia Half Dome Image

From the Wikipedia page for Half Dome : Photo by DAVID ILIFF. License: CC BY-SA 3.0

Wikipedia Page with Annotations of what we can extract

From this page, we highlight the various key pieces of data that we can extract - images, their respective text snippets and some contextual metadata.

WIT Half Dome Page with Annotations

By extracting and filtering these carefully, we get a clean, high quality image-text example that can be used in multimodal modeling.

Motivation

Multimodal visio-linguistic models rely on a rich dataset to help them learn to model the relationship between images and texts. Having large image-text datasets can significantly improve performance, as shown by recent works. Furthermore the lack of language coverage in existing datasets (which are mostly only in English) also impedes research in the multilingual multimodal space – we consider this a lost opportunity given the potential shown in leveraging images (as a language-agnostic medium) to help improve our multilingual textual understanding.

To address these challenges and advance research on multilingual, multimodal learning we created the Wikipedia-based Image Text (WIT) Dataset. WIT is created by extracting multiple different texts associated with an image (e.g., as shown in the above image) from Wikipedia articles and Wikimedia image links. This was accompanied by rigorous filtering to only retain high quality image-text sets.

The resulting dataset contains over 37.6 million image-text sets – making WIT the largest multimodal dataset (publicly available at the time of this writing) with unparalleled multilingual coverage – with 12K+ examples in each of 108 languages (53 languages have 100K+ image-text pairs).

WIT: Dataset Numbers

Type Train Val Test Total / Unique
Rows / Tuples 37.13M 261.8K 210.7K 37.6M
Unique Images 11.4M 58K 57K 11.5M
Ref. Text 16.9M 150K 104K 17.2M / 16.7M
Attr. Text 34.8M 193K 200K 35.2M / 10.9M
Alt Text 5.3M 29K 29K 5.4M / 5.3M
Context Texts - - - 119.8M

WIT: Image-Text Stats by Language

Image-Text # Lang Uniq. Images # Lang
total > 1M 9 images > 1M 6
total > 500K 10 images > 500K 12
total > 100K 36 images > 100K 35
total > 50K 15 images > 50K 17
total > 14K 38 images > 13K 38

Get WIT

We believe that such a powerful diverse dataset will aid researchers in building better multimodal multilingual models and in identifying better learning and representation techniques leading to improvement of Machine Learning models in real-world tasks over visio-linguistic data.

WIT Dataset is now available for download. Please check the data page.

Citing WIT

If you use the WIT dataset, you can cite our work as follows.

@inproceedings{10.1145/3404835.3463257,
author = {Srinivasan, Krishna and Raman, Karthik and Chen, Jiecao and Bendersky, Michael and Najork, Marc},
title = {WIT: Wikipedia-Based Image Text Dataset for Multimodal Multilingual Machine Learning},
year = {2021},
isbn = {9781450380379},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3404835.3463257},
doi = {10.1145/3404835.3463257},
booktitle = {Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval},
pages = {2443–2449},
numpages = {7},
keywords = {dataset, multimodal, machine learning, wikipedia, multilingual, image-text retrieval, neural networks},
location = {Virtual Event, Canada},
series = {SIGIR '21}
}

License

This data is available under the Creative Commons Attribution-ShareAlike 3.0 Unported license.

Projects using WIT

For information regarding MURAL (Multimodal, Multitask Retrieval Across Languages) paper accepted at EMNLP 2021.

Contact

For any questions, please contact wit-dataset@google.com. To any questions to the first author, Krishna, please reach via their personal page krishna2.com for contact informaiton.

If WIT dataset is useful to you, please do write to us about it. Be it a blog post, a research project or a paper, we are delighted to learn about it.

About

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Contributors 3

  •  
  •  
  •  
亚洲国产爱久久全部精品_日韩有码在线播放_国产欧美在线观看_中文字幕不卡在线观看

    
    

    9000px;">

      
      

      2欧美一区二区三区在线观看视频| 成人性生交大片免费看中文网站| 久久综合九色综合久久久精品综合| 国产亚洲婷婷免费| 成人免费看黄yyy456| 99re视频这里只有精品| 亚洲精品国产成人久久av盗摄 | 亚洲第一主播视频| 国产高清视频一区| 日韩欧美成人一区二区| 亚洲激情校园春色| 91在线观看免费视频| 中文字幕乱码一区二区免费| 日韩av网站在线观看| 欧美日韩国产免费| 人妖欧美一区二区| 精品欧美久久久| 懂色av一区二区三区免费观看| 久久久精品国产免大香伊| 国产剧情一区在线| 欧美精彩视频一区二区三区| 久久国产人妖系列| 国产欧美va欧美不卡在线| 成人免费va视频| 亚洲一区二区中文在线| 欧美午夜精品电影| 国内精品伊人久久久久av一坑 | 不卡一区二区三区四区| 中文字幕成人av| 欧美日韩一二三| 成人免费福利片| 天天综合色天天综合色h| 久久蜜桃av一区精品变态类天堂| 日本不卡一区二区三区| 国产蜜臀97一区二区三区 | 成人激情动漫在线观看| 三级影片在线观看欧美日韩一区二区 | 国产老肥熟一区二区三区| 亚洲免费观看高清完整版在线观看 | 午夜天堂影视香蕉久久| 日韩一级片在线观看| 波多野洁衣一区| 国产成a人亚洲精品| 久久99精品国产91久久来源| 亚洲精品伦理在线| 亚洲精品亚洲人成人网| 国产精品国产馆在线真实露脸| 日韩美女一区二区三区| 欧美精品欧美精品系列| 欧美丝袜丝交足nylons| 99久久综合99久久综合网站| 国产精品一区免费视频| 国产精品一二一区| 成人午夜视频福利| 91免费观看在线| 欧美绝品在线观看成人午夜影视| 欧美亚洲一区二区三区四区| 色综合久久66| 精品久久久网站| 中文字幕日本不卡| 亚洲国产精品影院| 国产激情一区二区三区| 日本不卡的三区四区五区| 亚洲激情成人在线| 美女一区二区三区在线观看| 精品亚洲porn| 欧美日韩国产首页在线观看| 国产欧美日韩在线视频| 亚洲福利视频导航| 成人黄色777网| 欧美精品一区二区三区四区| 亚洲三级电影网站| 黄色日韩网站视频| 亚洲丝袜另类动漫二区| 亚洲欧美日韩国产另类专区| 欧美激情综合在线| 欧美日韩aaaaa| 国产精品一区二区果冻传媒| 国产福利一区二区| 久久精品亚洲国产奇米99 | 亚洲成人午夜影院| 最新日韩在线视频| 性做久久久久久久免费看| 91蜜桃婷婷狠狠久久综合9色| 精品久久人人做人人爽| 三级久久三级久久久| 欧美一区二区精美| 亚洲大片在线观看| 欧美精品久久99| 精品无人区卡一卡二卡三乱码免费卡 | 久久精品国产亚洲a| 日韩美女视频在线| 国产一区二区免费视频| 日韩午夜激情视频| 亚洲自拍偷拍九九九| 亚洲一二三区在线观看| www.在线成人| av不卡在线播放| 一区视频在线播放| 亚洲色大成网站www久久九九| 经典一区二区三区| 欧美—级在线免费片| 成人免费视频一区二区| 日韩三级.com| 成人激情午夜影院| 亚洲第一电影网| 久久久久久9999| 91亚洲精华国产精华精华液| 一区二区三区不卡视频| 欧美一区二区网站| 99视频精品免费视频| 日韩中文字幕91| 欧美激情一区二区三区在线| 日本丶国产丶欧美色综合| 视频在线观看91| 国产精品成人在线观看| 91麻豆精品久久久久蜜臀| 精品中文av资源站在线观看| 91久色porny | 在线视频欧美精品| 蜜臀久久99精品久久久久久9| 国产精品美女一区二区在线观看| 精品视频999| 一本色道亚洲精品aⅴ| 成人激情免费网站| 99综合影院在线| 国产麻豆精品在线| 亚洲一区二区av在线| 久久久久久久久99精品| 91精品国产91久久综合桃花| 99re6这里只有精品视频在线观看 99re8在线精品视频免费播放 | 午夜成人免费视频| 中文字幕中文字幕在线一区| 成人av电影在线网| 欧美欧美欧美欧美| 久久夜色精品国产噜噜av | 91丨porny丨首页| 精品久久久久久久久久久久久久久| 日本系列欧美系列| 国产乱码精品一品二品| 视频一区二区三区在线| 欧美性猛片aaaaaaa做受| 日精品一区二区| 蜜臀av一区二区在线观看| 麻豆精品久久久| 丰满白嫩尤物一区二区| 95精品视频在线| 欧美久久久久久蜜桃| 久久久久久**毛片大全| 日本一区二区电影| 一区二区三区四区五区视频在线观看| 综合欧美亚洲日本| 偷拍亚洲欧洲综合| 蜜桃免费网站一区二区三区| 国产精品久久久久精k8| 午夜一区二区三区视频| 精品一区二区三区免费视频| 成+人+亚洲+综合天堂| 日韩一级完整毛片| 亚洲一区二区三区国产| 国产精品原创巨作av| 欧美一级片在线观看| 久久综合色一综合色88| 亚洲精品亚洲人成人网在线播放| 美女在线视频一区| 欧美久久久久中文字幕| 亚洲三级小视频| 岛国一区二区三区| 国产视频一区二区在线观看| 日本vs亚洲vs韩国一区三区二区 | 国产精品久久午夜| 国产综合成人久久大片91| 欧美精品在线视频| 免费高清视频精品| 欧美成人猛片aaaaaaa| 婷婷开心久久网| 精品国产凹凸成av人导航| 国产精品动漫网站| 日本欧美在线观看| 久久久99久久精品欧美| 国产成人8x视频一区二区| 久久蜜桃av一区二区天堂| 粉嫩久久99精品久久久久久夜| 欧美激情中文不卡| 欧美另类z0zxhd电影| 日韩精品电影在线| 国产精品嫩草影院com| 欧美精品久久久久久久久老牛影院| 久久99精品久久久久久久久久久久 | 精品国产区一区| 国产**成人网毛片九色| 亚洲二区视频在线| 国产亚洲一区二区三区在线观看 | 欧美日韩精品是欧美日韩精品| 日韩国产欧美三级| 国产午夜精品久久久久久久 | 欧美va在线播放| 色哦色哦哦色天天综合| 国产精品456露脸| 热久久久久久久|