EleutherAI releases massive AI training dataset of licensed and open domain text | TechCrunch

EleutherAI, an AI research organization, has unveiled the Common Pile v0.1, a massive 8-terabyte dataset comprising licensed and open-domain text for training AI models. Developed over two…

Open original source

EleutherAI, an AI research organization, has unveiled the Common Pile v0.1, a massive 8-terabyte dataset comprising licensed and open-domain text for training AI models. Developed over two years in collaboration with various AI startups and academic institutions, this dataset serves as a response to the ongoing legal battles surrounding AI companies' use of copyrighted material for training.

The organization aims to provide a transparent alternative to the often-opaque data sourcing practices of major AI players, which have been criticized for limiting the accessibility of research and hindering the understanding of model functionalities and flaws. The Common Pile v0.1 was created with legal consultation and includes sources like public domain books and transcribed audio content using OpenAI's Whisper model.

This dataset was used to train two new AI models, Comma v0.1-1T and Comma v0.1-2T, which, according to EleutherAI, perform comparably to models trained on unlicensed, copyrighted data. The release of these models is intended to demonstrate that high-performing AI models can be built using carefully curated, openly licensed content, challenging the notion that unlicensed text is essential for achieving optimal performance.

EleutherAI's move is partly a correction of its past practices, as the organization previously released The Pile, which included copyrighted material and faced criticism as a result. The organization believes that increased transparency and the use of openly licensed data are crucial for the advancement of AI research.

By releasing the Common Pile v0.1, EleutherAI aims to provide a viable alternative for training AI models, thereby promoting openness and accessibility within the AI community. Looking ahead, EleutherAI plans to release open datasets more frequently in collaboration with its partners.

This commitment underscores the organization's dedication to fostering transparency and collaboration in AI research. The development of the dataset and the associated models involved a wide range of partners, including the University of Toronto, which led the research efforts.