Reconcile your metadata
The goal of reconciliation is to connect your collection-specific vocabulary to a controlled vocabulary on the Web.
For example: does your label instruments indicate musical instruments, measuring instruments, or even aeronautical instruments?
Reconciliation is about giving meaning to field values, making your metadata interpretable by the whole wide world.
Follow along with this screencast and/or the steps below.
The reconciliation screencast is coming soon. Stay tuned!
Get the cleaned Powerhouse Museum metadata
In the previous screencast, we have cleaned the Powerhouse Museum metadata.
To start at the same spot, download the cleaned OpenRefine project or create a new project using the cleaned metadata, released under a CC-BY-SA license.
Install the RDF extension
OpenRefine has many reconciliation possibilities.
Here, we are going to reconcile against an RDF data source. Therefore, we need the RDF Extension for OpenRefine. Download and install it.
Example reconciliation steps
Pick a column to reconcile
The Powerhouse Museum collection contains a Categories field. This field is a good candidate for connecting to a controlled vocabulary, because:
- it refers to well-defined external concepts
- it offers an important subdivision of the collection
Pick a vocabulary to reconcile with
There are several controlled vocabularies online. Which one to choose depends on your metadata domain and the digital availability of the vocabulary.
We chose the Library of Congress Subject Headings (LCSH), since it provides an established vocabulary and is made available through a SPARQL endpoint. This means that computer programs can automatically access and browse it.
Tell OpenRefine about the vocabulary
In order to use a certain vocabulary, OpenRefine needs to know about it. This is done by adding a reconciliation service. This step is somewhat technical, but needs to be performed only once per vocabulary and we'll guide you through it.
First, make sure you have installed the RDF extension.
Then, click the RDF button in the top right corner, select Add reconciliation service, Based on SPARQL endpoint. In the dialog, you tell OpenRefine how to access the controlled vocabulary. For LCSH, provide the following parameters:
- Endpoint URL
- Graph URI
- Label properties
- check only
This instructs OpenRefine to create an LCSH reconciliation service that reads the vocabulary identified by
http://id.loc.gov/authorities/subjects from our endpoint. The vocabulary is loaded in Virtuoso (a database type) and uses SKOS for labels.
Start the reconciliation process
Click the triangle in front of the Categories column and choose Reconcile, Start reconciling. OpenRefine now presents all available reconciliation services. Select the newly created LSCH service.
OpenRefine will now use some examples of your dataset to see how they map to the LCSH vocabulary. As a result, you'll see that it recognizes the Categories as SKOS concepts. Choose Start Reconciling to let OpenRefine perform the hard work.
Understanding the reconciliation results
After it has finished reconciling, OpenRefine will automatically create facets for its judgement (matched or not) and best candidate's score. A green bar underneath the column name indicates the success percentage.
A matching category assignment is indicated as a dark blue link. For example, click the Botanical specimens link to view to what vocabulary item it has been reconciled. Unreconciled categories are black. For some of them, OpenRefine found several alternatives (in light blue) but was unable to select the best one. You could click each correct category manually, but this involves a lot of work.
Interpreting the reconciliation results
To interpret the results, you should be familiar with the two OpenRefine modes:
- Rows mode
- Each element is a single category assignment belonging to a collection item. Each assignment is manipulated individually.
Example: Item 7 is assigned category Botanical specimens.
- Records mode
- Each element is a collection item with (possibly) multiple category assignments. They are manipulated as a whole.
Example: Item 7 is assigned categories Botanical specimens and Numismatics.
First, switch to rows mode. You can see there are 167.016 category assignments. Use the judgement facet to select all 20.239 matched rows. You can see that each category is indeed dark blue, indicating a successful match. This means that 20.239 category assignments have been reconciled to the LCSH vocabulary.
With the matched rows still selected, switch to records mode. OpenRefine informs us that there are 19.100 matched records. This means that there are 19.000 records with at least one reconciled category assignment. This should make clear the distinction between rows and records.
Retrying the reconciliation process
At this point, you may feel that a 25% records reconciliation rate is rather low. Indeed, this implies that 75% of records cannot be linked. This is due to the way the RDF extension uses the LCSH vocabulary.
As a workaround, we provide a preprocessed version of the LCSH vocabulary, which does not have this limitation. Let's retry reconciliation with that endpoint.
First, undo the reconciliation by going to the Undo / Redo tab, and clicking on the step before Reconcile cells. Then, go back to the Facet / Filter tab and remove the facets by clicking Remove All. To add the new endpoint, click the RDF button, select Add reconciliation service, Based on SPARQL endpoint. Its parameters are:
- LCSH (preprocessed)
- Endpoint URL
- Graph URI
- Label properties
- check only
Now, restart the reconciliation process for the Categories column with this new endpoint (Reconcile, Start reconciling, LCSH [preprocessed], Start Reconciling).
Interpreting the new reconciliation results
Reconciliation gives far better results now: almost 75% of records has been reconciled. To have an overview of the non-reconciled categories, in rows mode, select the none facet under judgement.
You can have an even better view on them by creating a facet on the Categories column (Facet, Text Facet). If there are too many choices, click Facet by choice counts and select a more appropriate range. You can reconcile some of the largest categories manually if necessary.
Obtaining the reconciled URL
Now that reconciliation has been performed, you probably want to know the reconciled URL.
Remove all facets by clicking the Remove All button, and switch to rows mode. Then, on the Categories column, choose Edit column, Add column based on this column. Name this column Category URL and enter the GREL expression
cell.recon.match.id. It takes the reconciliation information of the cell, looks for the match, and extracts the ID, which is the URL. Then click OK.
You can now use these URLs to link your metadata to the controlled vocabulary.
You have successfully completed your first reconciliation.
Time to reconcile your own metadata!
If you want to have a look at the result without following all steps, download the finished OpenRefine project. Use the Undo / Redo history to review each step.