The Pile

    An 800GB Dataset of Diverse Text for Language Modeling

    What is the Pile?

    The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.

    Download

    The Pile is hosted by the Eye.

    The format of the Pile is jsonlines data compressed using zstandard.

    Have a model that uses or evaluates on the Pile? Let us know!

    Why is the Pile a good training set?

    Recent work has shown that especially for large models, diversity in data sources improves general cross-domain knowledge of the model, as well as downstream generalization capability. In our evaluations, not only do models trained on the Pile show moderate improvements in traditional language modeling benchmarks, they also show significant improvements on Pile BPB.

    Why is the Pile a good benchmark?

    To score well on Pile BPB (bits per byte), a model must be able to understand many disparate domains including books, github repositories, webpages, chat logs, and medical, physics, math, computer science, and philosophy papers. Pile BPB is a measure of world knowledge and reasoning ability in these domains, making it a robust benchmark of general, cross-domain text modeling ability for large language models.

    Citing

    If you use the Pile or any of the components, please cite us!

    @article{pile,
      title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},
      author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},
      journal={arXiv preprint arXiv:2101.00027},
      year={2020}
    }
                    

    Leaderboard

    * indicates potential test-set overlap. Zero-shot indicates that not all of the components of the Pile were present in the training data.

    Rank Model Test BPB

    1.

    Jan 1.2021

    GPT-3 (Zero-Shot)*

    OpenAI

    0.7177

    2.

    Jan 1.2021

    GPT-2 (Zero-Shot)*

    OpenAI

    1.2253

    主站蜘蛛池模板: 免费萌白酱国产一区二区| 亚洲欧美日韩国产精品一区 | 精品人妻中文av一区二区三区| 在线观看亚洲一区二区| 国产一区二区三区电影| 亚洲一区二区观看播放| 日韩精品人妻一区二区中文八零| 97久久精品无码一区二区天美| 一区五十路在线中出| 国产在线一区二区三区av| 麻豆AV无码精品一区二区| 精品国产一区二区三区无码| 亚洲国产精品一区第二页| 国产乱码伦精品一区二区三区麻豆| 无码人妻精品一区二区蜜桃 | 在线观看国产一区二区三区| 人妻无码一区二区三区AV| 久久99国产精品一区二区| 中文字幕精品一区二区精品| 亚洲乱码一区二区三区在线观看| 国产一区玩具在线观看| 一区二区三区人妻无码| 国产激情视频一区二区三区| 无码人妻少妇色欲AV一区二区| 久久精品无码一区二区三区| 无码囯产精品一区二区免费| 无码人妻精品一区二区三| 精品人妻一区二区三区浪潮在线| 亚洲高清美女一区二区三区| 中文字幕人妻AV一区二区| 国产精品成人免费一区二区 | 蜜桃臀无码内射一区二区三区| 日韩AV无码久久一区二区| 精品熟人妻一区二区三区四区不卡 | 国产伦精品一区二区三区| 中文字幕一区二区三区5566| 日韩一区二区在线播放| 国产成人一区二区精品非洲| 亚洲一区精品伊人久久伊人| 久久青草国产精品一区| 亚洲一区二区三区成人网站|