Cleaned Dataset Released by LAION Sparks Controversy

LAION, a prominent German research organization, recently announced the release of a new dataset named Re-LAION-5B. This dataset is essentially a revamped version of the older LAION-5B, which has undergone extensive “fixes” following recommendations from various organizations such as the Internet Watch Foundation, Human Rights Watch, the Canadian Center for Child Protection, and the Stanford Internet Observatory. The primary aim of this data cleansing process was to eliminate any links to suspected child sexual abuse material (CSAM) from the dataset.

The Re-LAION-5B dataset is available in two versions: Re-LAION-5B Research and Re-LAION-5B Research-Safe. The Research-Safe version goes a step further by removing additional NSFW content apart from the links scrubbed for CSAM. LAION emphasized its commitment to promptly removing any illegal content from its datasets as soon as it is identified, adhering to strict principles regarding data integrity and ethical concerns.

It is noteworthy that LAION’s datasets do not actually contain images themselves; rather, they consist of indexes pointing to images and accompanying alt text. These links are curated from a separate dataset called the Common Crawl, which compiles scraped content from various websites and online sources. Therefore, LAION’s datasets serve as repositories of image links rather than actual image files.

Controversy and Actions Taken

The release of Re-LAION-5B follows a prior controversy surrounding the LAION-5B dataset, specifically a subset known as LAION-5B 400M. An investigation conducted by the Stanford Internet Observatory in December 2023 unveiled that this subset included numerous links to illegal images sourced from social media and adult websites. The report also highlighted the presence of inappropriate content like pornographic imagery and offensive language within the dataset.

In response to these findings, LAION temporarily withdrew the LAION-5B dataset and initiated corrective measures to address the problematic content. The Stanford report recommended discontinuing the use and distribution of models trained on LAION-5B to prevent further proliferation of undesirable output. Concurrently, AI startup Runway took down its Stable Diffusion 1.5 model, which was based on LAION data, from the hosting platform Hugging Face.

The Re-LAION-5B dataset, containing approximately 5.5 billion text-image pairs, was released under an Apache 2.0 license. LAION specified that third parties could utilize the metadata from this dataset to scrub existing copies of LAION-5B, thus aiding in the removal of illicit content matching the identified patterns. The organization reiterated that its datasets are intended for research purposes and should not be exploited for commercial gain.

Despite these guidelines, past instances have revealed that some entities, including tech giants like Google, have utilized LAION datasets for training image-generating AI models. LAION reported that over 2,200 links to suspected CSAM were eliminated after cross-referencing with partner-provided lists of illicit content, underscoring the ongoing efforts to uphold data integrity standards.

The release of the Re-LAION-5B dataset by LAION signifies a proactive step towards improving data cleanliness and addressing ethical concerns in AI research. By actively addressing past discrepancies and implementing corrective measures, LAION aims to uphold ethical standards and promote responsible data usage within the research community.

Controversy and Actions Taken

Articles You May Like

Leave a Reply Cancel reply