OpenAI recently asserted before a UK parliamentary committee that developing leading AI systems like ChatGPT would be “impossible” without using vast amounts of copyrighted data. The company argued that advanced AI tools require such extensive training that adhering strictly to copyright laws would be unfeasible.
In written testimony, OpenAI emphasized that the pervasive nature of copyright laws and protected online content means that “virtually every sort of human expression” is off-limits for training. From news articles to forum comments and digital images, most online content cannot be used freely or legally.
OpenAI claimed that avoiding copyright infringement while creating capable AI systems would be futile: “Limiting training data to public domain books and drawings created more than a century ago … would not provide AI systems that meet the needs of today’s citizens.”
While maintaining that its practices comply with legal standards, OpenAI acknowledged that partnerships and compensation schemes with publishers might be necessary to “support and empower creators.” However, the company did not indicate plans to significantly restrict its data collection, including from paywalled journalism and literature.
This stance has led to multiple lawsuits, including one from The New York Times, alleging copyright breaches. OpenAI appears unwilling to alter its data collection and training processes fundamentally, given the “impossible” constraints of strict copyright adherence. Instead, the company aims to rely on broad interpretations of fair use allowances to legally utilize vast amounts of copyrighted data.
As advanced AI continues to emulate human expression, legal experts anticipate intense courtroom battles over infringement by systems designed to absorb large volumes of protected text, media, and other creative output. For now, OpenAI is betting against strict copyright enforcement, favoring extensive data use to drive AI development.