Investigation Shows Tech Companies Trained AI on YouTube Transcripts

Ai Trained Youtube Video Transcripts Featured

Artificial intelligence isn’t magical – it’s in the name: “artificial.” We know the content is originating from somewhere. An investigation showed that some of the big names in tech, including Apple, trained their AI technology on transcripts from YouTube videos – all without permission.

Investigation Shows YouTube Transcripts Used

Proof News conducted an investigation that included a search tool to look for YouTube in the dataset. The investigation determined that the subtitles from nearly 175,000 YouTube videos from more than 48,000 channels were used by tech companies.

The videos that were used included late-night TV episodes from The Late Show with Stephen Colbert and Jimmy Kimmel Live. Also showing up in the investigation were videos by MrBeast, PewDiePie, and Marques Brownlee.

Ai Trained Yourube Videos How Do Llm Work
Image source: Unsplash

The dataset came from “the Pile.” In 2020, the Pile was described as a mix of 22 datasets from EleutherAI, a nonprofit.

A Google spokesperson said in an email to CNET that the company stands by what it has said previously, going back to a comment from April. CEO Neal Mohan said at that time that he didn’t know whether OpenAI used YouTube videos. But if it did, he recognized that it would be a violation of YouTube’s TOS.

Where Else Does the AI Content Come From?

Nearly every tech company has announced recently that it is developing or has developed an AI system. As stated initially, we know it’s not magical and that the content comes from somewhere. It just wasn’t expected that the AI was coming from YouTube transcripts.

OpenAI, the creators of ChatGPT, has mentioned previously that it was getting more difficult to find datasets to train AI, and that led it to make deals with Reddit and News Corp. for their content. Google has said it has an agreement with content creators that allows it to use YouTube content in its AI training. AI Overview was recently added to Google Search. Learn how to turn AI Overview off if it isn’t your cup of tea.

Ai Trained Yourube Videos Chatgpt
Image source: Unsplash

Yet, an Anthropic spokesperson acknowledged to Proof News that it used the Pile to train Claude, it’s AI assistant. The spokesperson also acknowledged that there are some YouTube subtitles in the Pile.

Whether you use Claude, ChatGPT, or another AI technology, it was trained on a dataset. The question is whether it was trained on willing content providers, like Reddit, or whether the search for providers expanded to content that was used without the creators’ knowledge. It’s definitely something you should be considering the next time you use an AI chatbot.

Image credit: Unsplash

Subscribe to our newsletter!

Our latest tutorials delivered straight to your inbox

Laura Tucker Avatar

Read next

In 2016, archaeologists dated two rings of snapped stalagmites in France’s Bruniquel Cave to 176,500 years ago, evidence that Neanderthals had walked 336 metres into darkness with fire and built architecture deep underground long before modern humans reached Europe
Otto von Bismarck was 74 when Germany adopted the world’s first national old-age social insurance program in 1889, setting the pension age at 70 after years of fighting socialists with bans, laws, and a promise few workers would live long enough to use
When cosmonaut Valeri Polyakov stepped out of his Soyuz capsule in March 1995 after 437 consecutive days aboard Mir, doctors recorded him at several centimetres above his pre-flight height, and his spine had become so unaccustomed to gravity that the recovery team carried him to a chair rather than risk the compression of letting him walk.
When Bell Labs engineer Karl Jansky pointed a rotating antenna at the sky in 1932 looking for sources of transatlantic radio static, he kept picking up a faint hiss that peaked every 23 hours and 56 minutes, and he eventually realized he had become the first human to hear the center of the Milky Way.
When Harvard astronomer Cecilia Payne submitted her 1925 doctoral thesis arguing that the Sun was made almost entirely of hydrogen, the field’s senior figure Henry Norris Russell talked her into adding a line calling the result ‘almost certainly not real,’ and then published the same conclusion himself four years later to widespread acclaim.
When seismic waves from the Chicxulub impact reached what is now North Dakota roughly ten minutes after the asteroid struck, they appear to have triggered a ten-metre standing wave in an inland river that flung fish onto the bank and buried them under glass beads still falling from the sky.
When survivors near Lake Nyos woke on the morning of 22 August 1986, the cattle were dead in the fields, the birds had fallen out of the trees, and 1,746 of their neighbours were lying where they had stood the night before, with no fire, no flood, and no wound to explain it.
In October 2002, a Russian scientist named Dimitri Malashenkov stood up at a space conference in Houston and quietly explained that the dog Laika, whom the Soviet Union had publicly mourned as a heroic week-long orbiter in 1957, had actually died of heat and panic within about five hours of launch.