AI firms have scraped thousands of YouTube videos

Together with

The Future. Investigations by Wired and Proof News have found that firms like Nvidia, Anthropic, Apple, and Salesforce have trained their systems on copyrighted YouTube videos. If YouTube decides to take any of these companies to court, it could radically alter the development of AI systems and put power back in the hands of creators.

Swipe software
AI companies have been very busy scraping content off the internet under the veil of it being “publicly available.”

  • Wired and Proof News discovered that top AI firms have trained their systems on a dataset that includes the plain text subtitles of 173,536 YouTube videos from 48,000 channels in various languages.
  • Those channels include Khan Academy, MIT, Harvard, WSJ, NPR, The Late Show with Stephen Colbert, Jimmy Kimmel Live!, MrBeast, and PewDiePie.
  • The dataset, called “YouTube Subtitles,” was collated by EleutherAI and included in a release called “The Pile” — a dataset that Big Tech has admitted to using for AI training.
  • Additionally, OpenAI has been vague on whether its upcoming video generator, Sora, was trained on the video of scraped YouTube content.

Of course, all of this is strictly against YouTube’s protocol, and likely, copyright law (several lawsuits are making their way through the courts).

But, it looks like if your content is anywhere online, AI firms now consider it fair game.

David Vendrell

Born and raised a stone’s-throw away from the Everglades, David left the Florida swamp for the California desert. Over-caffeinated, he stares at his computer too long either writing the TFP newsletter or screenplays. He is repped by Anonymous Content.

TOGETHER WITH CANVA

No design skills needed! 🪄✨

Canva Pro is the design software that makes design simple, convenient, and reliable. Create what you need in no time! Jam-packed with time-saving tools that make anyone look like a professional designer.

Create amazing content quickly with Canva