
Ex-OpenAI Researcher Exposes Copyright Violations in ChatGPT's Training Data
A former OpenAI researcher has revealed concerning details about the company's data collection practices, particularly regarding copyright law violations during ChatGPT's training process.

OpenAI logo on laptop display
Suchir Balaji, who spent four years at OpenAI collecting data for large language models (LLMs), disclosed that the company indiscriminately scraped data from various sources, including:
- Copyrighted books from pirate sites
- Paywalled news content
- User-generated content from platforms like Reddit
In 2022, Balaji concluded that OpenAI's data collection methods violated copyright law and potentially harmed the internet ecosystem. He specifically noted that the training process involves making unauthorized copies of copyrighted data, which may not qualify as "fair use" under current law.
Key concerns raised by Balaji:
- AI systems are threatening the commercial viability of content creators
- The training process likely violates copyright law
- Popular websites like Stack Overflow are experiencing significant traffic drops
- Current "fair use" arguments by AI companies may not hold up legally
Despite OpenAI securing licensing agreements with some newspapers, the company still faces lawsuits from authors who never consented to their works being used for AI training. Balaji ultimately left OpenAI in August 2024, stating that regulation is necessary to address these issues.

Man in suit looking at phone.

Female in white shirt smiling.

Get Jewels dialog box.
Related Articles

Reddit Co-Founder Ohanian and Digg Creator Rose Unite to Revive Digg with AI-Powered Vision
