The Pile

    An 800GB Dataset of Diverse Text for Language Modeling

    What is the Pile?

    The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.

    Download

    The Pile is hosted by the Eye.

    The format of the Pile is jsonlines data compressed using zstandard.

    Have a model that uses or evaluates on the Pile? Let us know!

    Why is the Pile a good training set?

    Recent work has shown that especially for large models, diversity in data sources improves general cross-domain knowledge of the model, as well as downstream generalization capability. In our evaluations, not only do models trained on the Pile show moderate improvements in traditional language modeling benchmarks, they also show significant improvements on Pile BPB.

    Why is the Pile a good benchmark?

    To score well on Pile BPB (bits per byte), a model must be able to understand many disparate domains including books, github repositories, webpages, chat logs, and medical, physics, math, computer science, and philosophy papers. Pile BPB is a measure of world knowledge and reasoning ability in these domains, making it a robust benchmark of general, cross-domain text modeling ability for large language models.

    Citing

    If you use the Pile or any of the components, please cite us!

    @article{pile,
      title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},
      author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},
      journal={arXiv preprint arXiv:2101.00027},
      year={2020}
    }
                    

    Leaderboard

    * indicates potential test-set overlap. Zero-shot indicates that not all of the components of the Pile were present in the training data.

    Rank Model Test BPB

    1.

    Jan 1.2021

    GPT-3 (Zero-Shot)*

    OpenAI

    0.7177

    2.

    Jan 1.2021

    GPT-2 (Zero-Shot)*

    OpenAI

    1.2253

    主站蜘蛛池模板: 国产AV国片精品一区二区| 亚洲Aⅴ无码一区二区二三区软件 亚洲AⅤ视频一区二区三区 | 无码精品黑人一区二区三区 | 国产SUV精品一区二区88| 国产一区二区精品久久凹凸| 一区二区无码免费视频网站| 亚洲午夜一区二区三区| 91福利一区二区| 91在线看片一区国产| 好爽毛片一区二区三区四| 国产亚洲综合一区二区三区 | 日韩高清国产一区在线| 制服中文字幕一区二区| 伦理一区二区三区| 国产高清一区二区三区四区| 久久精品人妻一区二区三区| 福利一区二区视频| 精品人妻一区二区三区毛片 | 亚洲乱码日产一区三区| 久久免费国产精品一区二区| 久草新视频一区二区三区| 色屁屁一区二区三区视频国产| 无码日韩人妻av一区免费| 国产一区二区三区露脸| 色综合久久一区二区三区| 国精产品一区一区三区MBA下载 | 国产精品一区二区三区高清在线| 国产精品一区二区四区| 日美欧韩一区二去三区| 无码人妻一区二区三区在线水卜樱 | 一区二区三区免费视频播放器| 午夜性色一区二区三区不卡视频| 日韩精品一区二区三区在线观看l| 免费高清在线影片一区| 日韩一区二区三区视频久久| 一区二区三区精品| 99国产精品欧美一区二区三区| 国产精品一区二区三区久久| 亚洲成人一区二区| 国产无吗一区二区三区在线欢| 亚洲av无码天堂一区二区三区|