亚洲国产爱久久全部精品_日韩有码在线播放_国产欧美在线观看_中文字幕不卡在线观看

Skip to content

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

License

Notifications You must be signed in to change notification settings

google-research-datasets/wit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

?

History

65 Commits
?
?
?
?
?
?
?
?
?
?
?
?
?
?

Repository files navigation

WIT : Wikipedia-based Image Text Dataset

Wikipedia-based Image Text (WIT) Dataset is a large multimodal multilingual dataset. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal machine learning models.

Key Advantages

A few unique advantages of WIT:

  • The largest multimodal dataset (publicly available at the time of this writing) by the number of image-text examples.
  • A massively multilingual dataset (first of its kind) with coverage for 108 languages.
  • First image-text dataset with page level metadata and contextual information
  • A collection of diverse set of concepts and real world entities.
  • Brings forth challenging real-world test sets.

You can learn more about WIT Dataset from our arXiv paper.

Latest Updates

2021 April: Happy to share the good news that our paper got accepted at SIGIR Conference. From ACM site, you can find our paper, slides and presentation.

2021 September: WIT Image-Text Competition is live on Kaggle. Our collaborators from Wikimedia Research blogged about this and they have made available the raw pixels and resnet50 embeddings for the images in this set. Here is our Google AI blog post.

2022 April: We are happy to share that the WIT paper and dataset was awarded the WikiMedia Foundation's Research Award of the Year (tweet 1, tweet 2). We are deeply honored and thank you for the recognition.

2022 May: We have released the WIT validation set and test set. Please see the data page for download links.

2022 Oct: Authoring Tools for Multimedia Content proposal accepted at TREC 2023

2023 Apr: AToMiC accepted at SIGIR 2023.

2023 Apr: WikiWeb2M Dataset released.

2023 May: Accepted submissions at WikiWorkshop 2023.

  • WikiWeb2M: A Page-Level Multimodal Wikipedia Dataset (pdf, arXiv)
  • Building Authoring Tools for Multimedia Content with Human-in-the-loop Relevance Annotations (pdf)
  • Characterizing Image Accessibility on Wikipedia across Languages (pdf)

WIT Example

Wikipedia Page

For example, let's take the Wikipedia page for Half Dome, Yosemite in CA.

WIT Wikipedia Half Dome Image

From the Wikipedia page for Half Dome : Photo by DAVID ILIFF. License: CC BY-SA 3.0

Wikipedia Page with Annotations of what we can extract

From this page, we highlight the various key pieces of data that we can extract - images, their respective text snippets and some contextual metadata.

WIT Half Dome Page with Annotations

By extracting and filtering these carefully, we get a clean, high quality image-text example that can be used in multimodal modeling.

Motivation

Multimodal visio-linguistic models rely on a rich dataset to help them learn to model the relationship between images and texts. Having large image-text datasets can significantly improve performance, as shown by recent works. Furthermore the lack of language coverage in existing datasets (which are mostly only in English) also impedes research in the multilingual multimodal space – we consider this a lost opportunity given the potential shown in leveraging images (as a language-agnostic medium) to help improve our multilingual textual understanding.

To address these challenges and advance research on multilingual, multimodal learning we created the Wikipedia-based Image Text (WIT) Dataset. WIT is created by extracting multiple different texts associated with an image (e.g., as shown in the above image) from Wikipedia articles and Wikimedia image links. This was accompanied by rigorous filtering to only retain high quality image-text sets.

The resulting dataset contains over 37.6 million image-text sets – making WIT the largest multimodal dataset (publicly available at the time of this writing) with unparalleled multilingual coverage – with 12K+ examples in each of 108 languages (53 languages have 100K+ image-text pairs).

WIT: Dataset Numbers

Type Train Val Test Total / Unique
Rows / Tuples 37.13M 261.8K 210.7K 37.6M
Unique Images 11.4M 58K 57K 11.5M
Ref. Text 16.9M 150K 104K 17.2M / 16.7M
Attr. Text 34.8M 193K 200K 35.2M / 10.9M
Alt Text 5.3M 29K 29K 5.4M / 5.3M
Context Texts - - - 119.8M

WIT: Image-Text Stats by Language

Image-Text # Lang Uniq. Images # Lang
total > 1M 9 images > 1M 6
total > 500K 10 images > 500K 12
total > 100K 36 images > 100K 35
total > 50K 15 images > 50K 17
total > 14K 38 images > 13K 38

Get WIT

We believe that such a powerful diverse dataset will aid researchers in building better multimodal multilingual models and in identifying better learning and representation techniques leading to improvement of Machine Learning models in real-world tasks over visio-linguistic data.

WIT Dataset is now available for download. Please check the data page.

Citing WIT

If you use the WIT dataset, you can cite our work as follows.

@inproceedings{10.1145/3404835.3463257,
author = {Srinivasan, Krishna and Raman, Karthik and Chen, Jiecao and Bendersky, Michael and Najork, Marc},
title = {WIT: Wikipedia-Based Image Text Dataset for Multimodal Multilingual Machine Learning},
year = {2021},
isbn = {9781450380379},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3404835.3463257},
doi = {10.1145/3404835.3463257},
booktitle = {Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval},
pages = {2443–2449},
numpages = {7},
keywords = {dataset, multimodal, machine learning, wikipedia, multilingual, image-text retrieval, neural networks},
location = {Virtual Event, Canada},
series = {SIGIR '21}
}

License

This data is available under the Creative Commons Attribution-ShareAlike 3.0 Unported license.

Projects using WIT

For information regarding MURAL (Multimodal, Multitask Retrieval Across Languages) paper accepted at EMNLP 2021.

Contact

For any questions, please contact wit-dataset@google.com. To any questions to the first author, Krishna, please reach via their personal page krishna2.com for contact informaiton.

If WIT dataset is useful to you, please do write to us about it. Be it a blog post, a research project or a paper, we are delighted to learn about it.

About

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Contributors 3

  •  
  •  
  •  
亚洲国产爱久久全部精品_日韩有码在线播放_国产欧美在线观看_中文字幕不卡在线观看

    
    

    9000px;">

      
      

      中文字幕视频一区| 91在线观看高清| 一区av在线播放| 成人亚洲一区二区一| 中文字幕一区二区三区精华液| 成人免费av资源| 欧美午夜精品久久久久久超碰| 日韩一二三区不卡| 亚洲少妇30p| 国产iv一区二区三区| 欧美精品欧美精品系列| 国内精品视频一区二区三区八戒| 成人免费高清视频在线观看| 777午夜精品免费视频| 亚洲自拍偷拍欧美| 成人精品国产一区二区4080| 欧美精品一区二区久久婷婷| 亚洲国产另类av| 不卡的av网站| 日本一区二区三级电影在线观看 | 日本亚洲三级在线| 亚洲精品乱码久久久久久久久 | 久久无码av三级| 日本不卡1234视频| 欧美日韩亚洲另类| 一卡二卡欧美日韩| 欧美性大战久久久久久久蜜臀| 欧美日韩一区二区电影| 亚洲国产成人高清精品| 色哟哟国产精品| 亚洲自拍偷拍欧美| 欧美揉bbbbb揉bbbbb| 视频一区二区三区中文字幕| 日韩午夜电影av| 国产盗摄女厕一区二区三区| 国产精品女同互慰在线看| 成人免费电影视频| 最新日韩av在线| 欧美性xxxxxxxx| 青青青爽久久午夜综合久久午夜| 久久免费的精品国产v∧| 成人免费av网站| 国产女人18毛片水真多成人如厕| a4yy欧美一区二区三区| 亚洲亚洲精品在线观看| 欧美一卡二卡三卡| 成人少妇影院yyyy| 久久日韩精品一区二区五区| 岛国av在线一区| 久久一夜天堂av一区二区三区| 99久久久久免费精品国产 | 日韩一区和二区| 国产一区999| 一区二区在线观看免费视频播放| 欧美一区二区视频在线观看2020| 国产精品一二三区在线| 亚洲免费大片在线观看| 欧美va天堂va视频va在线| 国产成人免费视频| 亚洲国产精品久久久久婷婷884| 日韩三级在线观看| 尤物视频一区二区| 久久久久亚洲蜜桃| 欧美日韩国产三级| av午夜一区麻豆| 日韩av网站免费在线| 亚洲免费观看高清在线观看| 久久蜜臀精品av| 91精品国产一区二区三区香蕉| 成人天堂资源www在线| 水野朝阳av一区二区三区| 久久久亚洲精品一区二区三区| 一本一道久久a久久精品 | jiyouzz国产精品久久| 亚洲成人黄色影院| 中文字幕制服丝袜成人av| 日韩三区在线观看| 麻豆专区一区二区三区四区五区| 中文幕一区二区三区久久蜜桃| 2欧美一区二区三区在线观看视频| av电影天堂一区二区在线观看| 极品少妇一区二区三区精品视频| 亚洲成a人v欧美综合天堂 | 在线观看视频欧美| 成人av在线网| 国产成人精品aa毛片| 久久99精品国产麻豆婷婷| 香蕉av福利精品导航| 亚洲免费看黄网站| 亚洲欧美偷拍卡通变态| ㊣最新国产の精品bt伙计久久| 中文字幕免费在线观看视频一区| 久久―日本道色综合久久| 99视频热这里只有精品免费| 国产91丝袜在线播放0| 国产一区二区三区久久悠悠色av| 喷水一区二区三区| 日韩精品欧美精品| 天天操天天色综合| 日韩成人一区二区| 五月激情六月综合| 久久99蜜桃精品| 91麻豆国产精品久久| 久久av资源站| 欧美成人a视频| 久久美女艺术照精彩视频福利播放| 欧美乱熟臀69xxxxxx| 欧美一二三四区在线| 亚洲欧洲精品一区二区三区| 国产精品久久久久久久午夜片| 中文字幕不卡在线观看| 亚洲色图制服丝袜| 亚洲精品国产第一综合99久久 | 国产精品久久久久久久蜜臀| 久久精品一区二区三区不卡牛牛| 欧美激情一区在线| 久久精品无码一区二区三区| 中文字幕在线观看一区| 国产精品毛片久久久久久久| 中文字幕不卡在线观看| 亚洲天堂免费看| 亚洲6080在线| 久久99九九99精品| www.日韩精品| 欧美日韩国产小视频在线观看| 欧美乱妇15p| 精品日韩在线一区| 久久99精品国产.久久久久久| 国产在线播放一区三区四| 91女厕偷拍女厕偷拍高清| 欧美手机在线视频| 国产亚洲1区2区3区| 亚洲一区二区不卡免费| 亚洲同性gay激情无套| 国产午夜精品久久| 天堂一区二区在线免费观看| 国产成人啪午夜精品网站男同| 免费在线视频一区| 成人免费视频app| 91精品国产色综合久久久蜜香臀| 国产精品你懂的在线| 欧美日韩国产综合久久| 欧美猛男男办公室激情| 国产日韩欧美亚洲| 性做久久久久久久久| 91在线porny国产在线看| 欧美日韩国产精品成人| 久久人人超碰精品| 亚洲综合色噜噜狠狠| 久久99精品久久久久久| 91视频免费观看| 久久久综合视频| 亚洲成人免费在线观看| 国产精品自在在线| 日韩视频一区二区三区| 欧美一级欧美三级在线观看| 免费看日韩a级影片| 精品免费一区二区三区| 国产 欧美在线| 亚洲国产精品影院| 精品国免费一区二区三区| 99久久99久久免费精品蜜臀| 亚洲成a人v欧美综合天堂| 久久视频一区二区| 欧美性猛片xxxx免费看久爱| 看电视剧不卡顿的网站| 自拍视频在线观看一区二区| 日韩欧美在线综合网| 91老师片黄在线观看| 久久超碰97中文字幕| 一区二区三区日韩欧美| 久久精品水蜜桃av综合天堂| 91高清视频在线| 国产精品夜夜嗨| 日本亚洲免费观看| 亚洲综合在线五月| 国产日韩在线不卡| 欧美一区二区国产| 欧美中文字幕一区二区三区亚洲| 国产一区二区三区四区五区美女 | 性感美女久久精品| 国产精品久久久久天堂| 精品免费视频一区二区| 欧美电影在哪看比较好| 一本一道综合狠狠老| 国产黄色精品网站| 激情综合五月天| 日本欧美加勒比视频| 亚洲v中文字幕| 一二三四区精品视频| 亚洲日本一区二区三区| 2023国产精华国产精品| 日韩欧美亚洲一区二区| 欧美一区二区三区不卡| 欧美色涩在线第一页| 一本到高清视频免费精品| 91捆绑美女网站| 欧美性猛片aaaaaaa做受| 欧美性做爰猛烈叫床潮| 欧美亚洲综合一区|