> 11:30

Is there a text in this dataset?

A large language model requires vast and diverse amounts of data to function as a credible natural language generator. One of the most used datasets to achieve it is The Pile, whose components include Books3, the largest repository of .txt files to train models on. Its adoption was meant to refine the prose writing of LLMs, to make them generate believable texts from textual data. But the labor and methods that went into composing it, together with the authorship and ownership of the texts, have to be inquired. This talk aims to peer into some material, technical and semiotic aspects of text-based datasets for NLP, to highlight the political—as well as artistic—issues that they raise.

Programme (2023)#

Jeudi 23 novembre

“introduction” de 9h30 à 10h avec les organisateurs du colloque Pierre Cassou‐Noguès, Stéphane Degoutin, Arnaud Regnaud et Gwenola Wagon

