Is there a text in this dataset?
A large language model requires vast and diverse amounts of data to function as a credible natural language generator. One of the most used datasets to achieve it is The Pile, whose components include Books3, the largest repository of .txt files to train models on. Its adoption was meant to refine the prose writing of LLMs, to make them generate believable texts from textual data. But the labor and methods that went into composing it, together with the authorship and ownership of the texts, have to be inquired. This talk aims to peer into some material, technical and semiotic aspects of text-based datasets for NLP, to highlight the political—as well as artistic—issues that they raise.
Niccolò Monti is a PhD candidate in joint-supervision between the Universities of Turin and Paris 8. He adopts an historical and semiotic approach toward the study of automatic methods in literary writing, from the Surrealists to AI. He has written of electronic literature and prompting, of the creativity of automatism, of cybernetics and semantics. He also does research on literary European avant-gardes, especially the Modernist works of James Joyce and Samuel Beckett. He is a member of a literary collective, named Montag, experimenting with digital simultaneous writing, textual generation and speculative fiction.
https://linktr.ee/icareide