Ex-OpenAI Researcher Exposes Copyright Violations in ChatGPT's Training Data

•

November 16, 2024 at 09:43 PM

A former OpenAI researcher has revealed concerning details about the company's data collection practices, particularly regarding copyright law violations during ChatGPT's training process.

OpenAI logo on laptop display

Suchir Balaji, who spent four years at OpenAI collecting data for large language models (LLMs), disclosed that the company indiscriminately scraped data from various sources, including:

Copyrighted books from pirate sites
Paywalled news content
User-generated content from platforms like Reddit

In 2022, Balaji concluded that OpenAI's data collection methods violated copyright law and potentially harmed the internet ecosystem. He specifically noted that the training process involves making unauthorized copies of copyrighted data, which may not qualify as "fair use" under current law.

Key concerns raised by Balaji:

AI systems are threatening the commercial viability of content creators
The training process likely violates copyright law
Popular websites like Stack Overflow are experiencing significant traffic drops
Current "fair use" arguments by AI companies may not hold up legally

Despite OpenAI securing licensing agreements with some newspapers, the company still faces lawsuits from authors who never consented to their works being used for AI training. Balaji ultimately left OpenAI in August 2024, stating that regulation is necessary to address these issues.

Man in suit looking at phone.

Female in white shirt smiling.

Get Jewels dialog box.

Tech Business News •

AI Technology Ethics •

OpenAI Keeps Non-Profit Core While Restructuring For-Profit Division to Raise Capital

Appeals Court Revives Copyright Lawsuit Against Sam Smith's 'Dancing With a Stranger'

Texas Court Orders Live Nation CEO to Testify in Astroworld Tragedy Case

11/16/2024

Tool Announces First Caribbean Festival with Three Days of Metal in Dominican Republic