Ex-OpenAI Researcher Exposes Copyright Violations in ChatGPT's Training Data

Ex-OpenAI Researcher Exposes Copyright Violations in ChatGPT's Training Data

By Marcus Bennett

November 16, 2024 at 09:43 PM

A former OpenAI researcher has revealed concerning details about the company's data collection practices, particularly regarding copyright law violations during ChatGPT's training process.

OpenAI logo on laptop display

OpenAI logo on laptop display

Suchir Balaji, who spent four years at OpenAI collecting data for large language models (LLMs), disclosed that the company indiscriminately scraped data from various sources, including:

  • Copyrighted books from pirate sites
  • Paywalled news content
  • User-generated content from platforms like Reddit

In 2022, Balaji concluded that OpenAI's data collection methods violated copyright law and potentially harmed the internet ecosystem. He specifically noted that the training process involves making unauthorized copies of copyrighted data, which may not qualify as "fair use" under current law.

Key concerns raised by Balaji:

  • AI systems are threatening the commercial viability of content creators
  • The training process likely violates copyright law
  • Popular websites like Stack Overflow are experiencing significant traffic drops
  • Current "fair use" arguments by AI companies may not hold up legally

Despite OpenAI securing licensing agreements with some newspapers, the company still faces lawsuits from authors who never consented to their works being used for AI training. Balaji ultimately left OpenAI in August 2024, stating that regulation is necessary to address these issues.

Man in suit looking at phone.

Man in suit looking at phone.

Female in white shirt smiling.

Female in white shirt smiling.

Get Jewels dialog box.

Get Jewels dialog box.

Related Articles

Previous Articles