Talking to Creative Commons’ Ryan Merkley about CC Search and Structured Data on Commons

Creative Commons’ Ryan Merkley and Wikimedia Foundation Exec Director Katherine Maher at Mozfest 2017 – Image by Jwslubbock CC BY-SA 4.0

CC Search beta was launched in February. This new tool incorporates ‘list-making features, and simple, one-click attribution to make it easier to credit the source of any image you discover.’ Its developer, Liza Daly, describes it as ‘a front door to the universe of openly licensed content.’

As a small organisation, Creative Commons did not have the resources to start by indexing all of the 1.1 billion Openly Licensed works that it estimates are available in the Commons. Liza Daly decided to start with a representative sample of about 1% of the known Commons content online, and decided to select about 10 million images rather than a cross-section of all media types, due to the fact that a majority of CC content is images.

One issue they encountered was in making sure that all the content they would include was CC licensed, where a provider (like Flickr) hosted content that was both CC and commercially licensed. They also decided to defer the use of material from Wikimedia Commons, saying that,

‘Wikimedia Commons represents a large and rich corpus of material, but rights information is not currently well-structured. The Wikimedia Foundation recently announced that a $3 million grant from the Sloan Foundation will be applied to work on this problem, but that work has just begun.’

The Wikimedia Foundation understands that the resources available through Wikimedia Commons are not as accessible as they could potentially be as a result of the ad hoc nature of much of the metadata attached to the files people have uploaded. For example, one common query is ‘Why can’t I search Commons by date’. The problem here is ‘which date?’ Is it the stated date that the photo was taken (which could be incorrect) or the date that the file was created, which could be different?

This is why Structured Data is so important. The $3m grant that the WMF has received to implement structured data on Commons, in a similar way to how it’s structured on Wikidata, will allow for much better searching and indexing of media files.

CC search wants to make CC content more discoverable, regardless of where it is hosted online. To do this, they decided to import the metadata from the selected works that they are currently indexing –  title, creator name, any known tags or descriptions. This data will link directly back to the original source so you can view and download the media. It seems that in its current, unstructured state, Wiki Commons is not very good for systematically importing this kind of metadata.

It seems that Creative Commons is even looking at the possibility of using some kind of blockchain-like ledger system to record reuse of CC licensed works so that reuse can be tracked. However, this remains a longer term goal.

I asked Creative Commons CEO Ryan Merkley some questions about how the project had been progressing since its announcement and how it might work.

WMUK: How much progress has been made on CC search since the start of 2017? Have you indexed many more than the original 10 million media items?

RM: CC has hired a Director of Product Engineering, Paola Villarreal to lead the project. We’re staffing up the team, with a Data Engineer starting soon. In addition, we’ll be pushing a series of enhancements, including adding new content, by the end of the year.

WMUK: Will you have to wait until the end of the Structured Data on Commons project to index Wikimedia content? Or does the tool only require basic metadata categories like Title, Creator, Description, Category Tags, meaning it be possible to start this before the end of the project?

RM: We’re happy to work with the Wikimedia Commons community on the project. In our initial conversations, we mutually decided to wait until some of that work was further along. We want to make sure our work is complementary.

WMUK: Is it still an ultimate ambition to use some kind of blockchain architecture to record reuse? Or is that potentially a goal that would require more resources than will likely be available for the foreseeable future?

RM: Not necessarily. There’s a lot of interesting work going on with the blockchain and distributed ledger projects. What’s most important to us is a complete, updated, and enhanced catalog of works and metadata that is fast and accessible.

WMUK: Can you explain how ledger entries would be created when someone reused a CC licensed work?

RM: The tools to track remix don’t exist right now. It’s something we’re really interested in, and our community wants as well. It will require new tools, and collaboration with platforms and creators.

There are so many incredible applications possible for all the data on Wikimedia Commons, and we hope that after the content is structured properly, it will become a valuable source which can be searched along with other CC content online using Creative Commons’ CC Search tool. Like a lot of the changes we would like to see in the way the Wikimedia products work, this will likely take some time, but we are hopeful that the wait will be worth it.

Sidebar