Clearing Rights For A ‘Non-Infringing’ Collection Of AI Training Media Is Hard

Mike Masnick

from Techdirt on 2024-05-31 20:41 (#6N6Z5)

In response toa number of copyright lawsuits about AI training datasets, we are starting to see efforts to build non-infringing' collections of media for training AI. While I continue to believe that most AI training is covered by fair use in the US and therefore inherently non-infringing', I think these efforts to build safe' or clean' or whatever other word one might use data sets are quite interesting. One reason they are interesting is that they can help illustrate why trying to build such a data set at scale is such a challenge.

That's why I was excited to read aboutSource.Plus(via apostfrom Open Future). Source.Plus is a tool fromSpawningthat purports to aggregate over 37 million public domain and CC0 images integrated from dozens of libraries and museums." That's a lot less than areused to train current generative models, but still a lot of images that could be used for all sorts of useful things.

However, it didn't take too much poking around on the site to find an illustration of why accurately aggregating nominally openly licensed images at scale can be such a challenge.

The site has plenty of OpenGLAM images that are clearly old enough to be in the public domain. It also has a number of newer images (like photographs) that are said to be licensed under CC0. Curious, I clicked on the first photograph I found on the Source.Plus home page:

According to theimage pageon Source.Plus, the image was from Wikimedia Commons and licensed under a CC0 public domain dedication. It listed the creator as Pixabay and the uploader (to Wikimedia) as Philipslearning.

Clicking through to thewikimedia pagereveals that the original source for the image was Pixabay, and that it was uploaded on March 9, 2023 by Philipslearning (an account that appears tono longer exist, for whatever that is worth). The file metadata says that the image itself was taken on May 18, 2016.

Clicking through to thePixabay page for the imagereveals that the image is available under thePixabay Content License. That license is fairly permissive, but does state:

You cannot sell or distribute Content (either in digital or physical form) on a Standalone basis. Standalone means where no creative effort has been applied to the Content and it remains in substantially the same form as it exists on our website.
If Content contains any recognisable trademarks, logos or brands, you cannot use that Content for commercial purposes in relation to goods and services. In particular, you cannot print that Content on merchandise or other physical products for sale.
You cannot use Content in any immoral or illegal way, especially Content which features recognisable people.
You cannot use Content in a misleading or deceptive way.
You cannot use any of the Content as part of a trade-mark, design-mark, trade-name, business name or service mark.

Which is to say, not CC0.

However, further investigation into thePixabay Wikipedia pagesuggests that images uploaded to Pixabay before January 9, 2019 are actually released under CC0. Section 4 of thePixabay termsconfirms that. The additional information on the image'sPixabay pageconfirms that it was uploaded on May 17, 2016 (which matches the metadata added by the unknown Philipslearning on the image'swikimedia page).

All of which means that this image is, in all likelihood, available under a CC0 public domain dedication. Which is great! Everything was right!

At the same time, the accuracy of that status feels a bit fragile. This fragility works in the context of wikipedia, or if you are looking for a handful of openly-licensed images. Is it likely to hold up at training set scale across tens of millions of images? Maybe? What does it mean to be good enough' in this case? If trainers do require permission from rightsholders to train, and one relied on Source.Plus/wikimedia for the CC0 status of a work, and that status turned out to be incorrect, should the fact that they thought they were using a CC0 image be relevant to their liability?

Michael Weingberg is the Executive Director of NYU's Engelberg Center for Innovation Law and Policy. This post is republished from his blog under its CC BY-SA 4.0 license. Hero Image:Interieur van de Bodleian Library te Oxford

Source	RSS or Atom Feed
Feed Location	https://www.techdirt.com/techdirt_rss.xml
Feed Title	Techdirt
Feed Link	https://www.techdirt.com/