Researchers found child abuse material in the largest AI image generation dataset

11 months ago 25

Researchers from the Stanford Internet Observatory say that a dataset used to train AI image generation tools contains at least 1,008 validated instances of child sexual abuse material. The Stanford researchers note that the presence of CSAM in the...

Researchers from the Stanford Internet Observatory say that a dataset used to train AI image generation tools contains at least 1,008 validated instances of child sexual abuse material. The Stanford researchers note that the presence of CSAM in the dataset could allow AI models that were trained on the data to generate new and even realistic instances of CSAM.

LAION, the non-profit that created the dataset, told 404 Media that it "has a zero tolerance policy for illegal content and in an abundance of caution, we are temporarily taking down the LAION datasets to ensure they are safe before republishing them." The organization added that, before publishing its datasets in the first place, it created filters to detect and remove illegal content from them. However, 404 points out that LAION leaders have been aware since at least 2021 that there was a possibility of their systems picking up CSAM as they vacuumed up billions of images from the internet.�

According to previous reports, the LAION-5B dataset in question contains "millions of images of pornography, violence, child nudity, racist memes, hate symbols, copyrighted art and works scraped from private company websites." Overall, it includes more than 5 billion images and associated descriptive captions. LAION founder Christoph Schuhmann said earlier this year that while he was not aware of any CSAM in the dataset, he hadn't examined the data in great depth.

It's illegal for most institutions in the US to view CSAM for verification purposes. As such, the Stanford researchers used several techniques to look for potential CSAM. According to their paper, they employed "perceptual hash?based detection, cryptographic hash?based detection, and nearest?neighbors analysis leveraging the image embeddings in the dataset itself." They found 3,226 entries that contained suspected CSAM. Many of those images were confirmed as CSAM by third parties such as PhotoDNA and the Canadian Centre for Child Protection.

Stability AI founder Emad Mostaque trained Stable Diffusion using a subset of LAION-5B data. Google's Imagen text-to-image model was trained on a subset of LAION-5B as well as internal datasets. A Stability AI spokesperson told Bloomberg�that it prohibits the use of its test-to-image systems for illegal purposes, such as creating or editing CSAM.�This report focuses on the LAION-5B dataset as a whole,� the spokesperson said. �Stability AI models were trained on a filtered subset of that dataset. In addition, we fine-tuned these models to mitigate residual behaviors.�

Stable Diffusion 2 (a more recent version of Stability AI's image generation tool) was trained on data that substantially filtered out 'unsafe' materials from the dataset. That, Bloomberg notes, makes it more difficult for users to generate explicit images. However, it's claimed that Stable Diffusion 1.5, which is still available on the internet, does not have the same protections. "Models based on Stable Diffusion 1.5 that have not had safety measures applied to them should be deprecated and distribution ceased where feasible," the Stanford paper's authors wrote.

This article originally appeared on Engadget at https://www.engadget.com/researchers-found-child-abuse-material-in-the-largest-ai-image-generation-dataset-154006002.html?src=rss

View Entire Post

Read Entire Article

Researchers found child abuse material in the largest AI image generation dataset

Researchers from the Stanford Internet Observatory say that a dataset used to train AI image generation tools contains at least 1,008 validated instances of child sexual abuse material. The Stanford researchers note that the presence of CSAM in the...

Related

In quotes: finding child care for military reservists

Compass’ Neal Lawson claims 17-month probe found him ‘not guilty’ over tweet

Snap says New Mexico intentionally friended alleged child predators, then blamed the company

Child care won at the ballot box

The Gang Is Back Together in New 'Mission Impossible - The Final Reckoning' Image

Enhance Your Destin FL Resort Image with Regular Professional Window Cleaning

More News From ENgaget

LinkedIn is killing the standalone live audio feature you probably forgot about

The best iPhone 16 and iPhone 16 Pro cases for 2024

Sony will trial cloud streaming for the PS5 Portal

iFixit’s PS5 Pro teardown reveals an easily replaceable CMOS battery

Apple’s next AirTag is coming in 2025 with privacy improvements

Valve celebrates Half-Life 2’s 20th anniversary with a big update

Trending

Popular

What is Authentic Project Based Learning (PBL)?

Wall Street hits record high after Trump election win, as US dollar, bitcoin and Tesla shares surge – as it happened

Marathe departs ALM for The Insurer, a Reuters publication

A single grant of $123M has been finalized from $33B+ announced so far under the US CHIPS Act, which is facing criticism on both sides of the political aisle (Politico)

The NYC Marathon Is the Best Pick-Me-Up

Mainstream Media Was Afraid to Compare Trump to Hitler. Now the Press Has No Excuse.