From Spreadsheet to Wikidata with QuickStatements

QuickStatements logo – Wikimedia Commons CC BY-SA 4.0

By Charles Matthews, Wikimedian in Residence at ContentMine

With the end of October, Wikidata’s birthday comes round once more, and on the 29th it will be six years old. With the passing of time Wikimedia’s structured data site grows, is supported by an increasingly mature set of key tools, and is applied in new directions.

Fundamental is the SPARQL query tool at query.wikidata.org, an exemplary product of Wikimedia Foundation engineering. But I wanted to talk here about its “partner in crime”, the QuickStatements tool by Magnus Manske, which is less known and certainly comparatively undocumented. QuickStatements, simply put, allows you to make batch edits that add hundreds or thousands of statements to Wikidata.

So QuickStatements is a bot, but importantly you don’t need to be a bot operator to use it. You do need to have an account on Wikidata (which is automatic if you have a Wikipedia account). And you do need to allow QuickStatements to edit through your account. That can be carried out by means of a WiDar login. For that you simply need to go to https://tools.wmflabs.org/widar/ and click the button.

So far, so good. Now we need to look at your “use case”: the data you have that you think should be in Wikidata. How is it held, and how far have you got in translating it into Wikidata terms? Are you envisaging simple Wikidata statements, or are you reckoning on adding qualifiers, or references, or both? One of the issues with the documentation I come across is that “or both” may be the underlying assumption, but it can make it harder to see the wood for the trees.

Charles Matthews and Jimmy Wales at Wikimania 2018 – Wikimedia Commons CC BY-SA 4.0

A further question that is fundamental is whether you are adding statements to existing items, or creating new items with statements. In the first case, without qualifiers or references (though referencing matters greatly, on Wikidata as on Wikipedia), we can say straightforwardly that you’ll need three columns of data. At the very least, understanding this case is the natural place to start.

Let the first column be the items you’ve identified that need to have statements added. Getting this far may indeed be the most important step. If you have a list of people, or of places, they need to be matched correctly to Wikidata Q-numbers. Proper names are very often ambiguous: for example Springfield shows 41 places called Springfield (where fictionally The Simpsons live: the idea that there is one in every state is an urban myth, it turns out). Matching into Wikidata is a cottage industry in its own right, around the mix’n’match tool.

Suppose then you have your first column in good shape. You now need properties (basically predicates in the statements), and either objects, for forming predicates, or other strings, depending on the type of property. For example, if what you have is a list of people born in Birmingham, UK, you need a second column for P19, “place of birth”, and a third column of Q2256. For the population of a place you need P1082, for books where you are adding publication date there is property P577. You always need a second column which is filled with the property code, and then a third column giving the “object” data.

So the assumption is that you are now manipulating the data in a spreadsheet. I find filling a column in Google Sheets can be troublesome, because it wants to increment numbers, so I use a dummy word to fill and then apply find-and-replace.

To avoid disappointment, you also really need to read the instructions, some time or other. These explain that string values such as numbers need to be in “quotes”, but dates need to have a code appended. More spreadsheet skills may therefore be needed, to wrangle the data, but such is modern life.

The payoff comes in being able to paste from the spreadsheet columns into QuickStatements. That introduces the tab characters spoken of in the documentation.

Actually this is not the pro way to use the tool, but does fine anyway: it is officially “Version 1 format” of the “old interface” of QuickStatements2 ,. Under the “Import commands” menu select Version 1, and paste into the “Import V1 commands” box. Click the “Import” button for a preview, and then the “Run” button. You should definitely run a small test first.

Charles Matthews and Martin Poulter at WikidataCon2017 – Wikimedia Commons CC BY-SA 4.0

QuickStatements runs quite slowly for a bot, taking about a second over each statement. Since the edits are credited to your account, you can see them happening through the “Contributions” link you have when logged in on Wikidata. A top tip is to use the analytics tool, which is easy to do with the property number in the “Pattern” field by setting the approximate times of the run.

There is quite a lot more to learn, obviously. For example, for populations of towns, a qualifier with P585 for “point in time” is the first request anyone would make, and a reference perhaps the second. So more data work, but the same process of creation.

QuickStatements is a workhorse behind numerous other Wikidata tools that create items or add statements to them. In my Wikimedian in Residence work on the ScienceSource project we will use it both on our own wiki to move in text-mining data, and for exporting referenced facts from biomedical articles to Wikidata itself. For more about Wikidata and that project, there is a Wikidata workshop in Cambridge on 20 October.

 

Comments

So empty here ... leave a comment!

Leave a Reply

Your email address will not be published. Required fields are marked *

Sidebar