Meta Accused Of Training AI On Copyrighted Material

Meta, the parent company of Instagram and Facebook, faces allegations from prominent authors, including Sarah Silverman, Michael Chabon, and Ta-Nehisi Coates. The authors claim that Meta has been using their copyrighted books to train its artificial intelligence (AI) models despite being warned by the company’s legal team.

The complaint consolidates two copyright cases and accuses Meta of utilizing an online resource to train its large language models, known as Llama 1 and Llama 2, which contained the authors’ works without obtaining their permission.

Large language models require substantial data for training. In the case of Llama 1, Meta openly admitted to using the Books3 section of The Pile, a publicly available dataset consisting of nearly 200,000 books.

The authors argue that Meta was fully aware of the potential legal issues surrounding the use of this dataset. They cite a series of messages between a Meta AI researcher and researchers associated with EleutherAI, the organization responsible for assembling The Pile.

During these conversations, Meta researcher Tim Dettmers expressed interest in using The Pile but inquired about any legal concerns. While one EleutherAI researcher suggested that there might be a case for free use, Dettmers later stated that Meta’s lawyers had recommended avoiding the use of Books3, as it was clear that the data could not be used or published due to legal complications.

Interestingly, despite this advice, Meta went ahead and used Books3 in training Llama 1. The authors also allege that Meta employed Books3 to train Llama 2, although the company chose not to disclose the training datasets for this latest model, citing competitive reasons.

However, the authors argue that Meta’s explanation for withholding the training data is likely pretextual. They suggest that the real reason behind Meta’s decision was to avoid legal scrutiny from those whose copyrighted works were used without permission during the training of Llama 2.

The lawsuit further claims that Meta deliberately concealed the training dataset for Llama 2 to evade potential litigation regarding the use of copyrighted materials, which the company had already recognized as legally problematic.

These allegations highlight the ongoing debate surrounding the ethical and legal implications of AI training and the use of copyrighted materials. As the case unfolds, it will undoubtedly shape the future landscape of AI development and copyright protection.