Clean up your metadata
No matter how much effort you have put into creating it, tiny errors always creep into a large metadata set. Cleaning is the process that tries to find and correct those errors in a semi-automated way.
The easiest way to get started, is to follow through an example. We use the collection metadata provided by the Sydney Powerhouse Museum and walk you through Google Refine, right from the start.
Follow along with this screencast and/or the steps below.
Get the Powerhouse Museum metadata
You can download the collection metadata compressed or uncompressed. Alternatively, you can download it directly from the Powerhouse Museum. This metadata is released under a CC-BY-SA license.
Install and start Google Refine
Download Google Refine and follow the instructions. Refine then opens in your browser.
Example cleaning steps
Import the collection into Google Refine
On the Refine start page, create a new project using the downloaded data file. Do not forget to tick the Ignore quotation marks checkbox, since the quotes inside the file do not have any meaning to Refine. If all goes well, you see 75.814 rows.
Alternatively, you can download the initial Refine project directly.
Get to know your data
The first thing to do is to look around and get to know your data. You can inspect the different data values by displaying them in facets. Click the triangle in front of the column name, select Facet, and create a facet of your choice.
Remove blank rows
One thing you notice when creating a numeric facet for the Record ID column, is that three rows are empty. You can find them by unticking the Numeric checkbox, leaving only Non-numeric values. To remove these rows, click the triangle in front of the All column, select Edit rows, and then Remove all matching rows. Close the numeric facet to see the remaining 75.811 rows.
Remove duplicate rows
To eliminate duplicate rows, we have to bring them together. This is most easily done by sorting them on a unique value, such as the Record ID. To do this, click the triangle of the Record ID column and sort the cell values as numbers. In Refine, sorting is only a visual aid, unless you make the order permanent. To do this, click the Sort menu at the top and choose Reorder rows permanently.
Identical rows are now adjacent to each other. Next, blank the Record ID of rows that have the same Record ID as the row above them, marking duplicates. Click on the Record ID triangle, choose Edit cells, Blank down. The status message tells you that 84 columns were affected. Eliminate those rows by creating a facet on blank cells (Facet, Customized facets, Facet by blank), selecting the blank rows by clicking on true, and removing them using the All triangle (Edit rows, Remove all matching rows). Upon closing the facet, you see 75.727 unique rows.
Split cells with multiple values
In some datasets, several cells can contain multiple values. This is the case with the Categories field. To have a better view, click the Categories triangle and select View, Collapse all other columns. You can see that categories are separated by a pipe character
|. Choose Edit cells, Split multi-valued cells, entering
| as the value separator. Refine informs you that you now have 170.167 rows.
It is important to fully understand the rows/records paradigm. Make the Record ID column visible to see what is going on. In the rows view, each row represents a couple of a Record ID and a single Category, enabling manipulation of each one individually. The records view has an entry for each Record ID, which can have different Categories, but each record is manipulated as a whole. Concretely, now there are 170.167 category assignments, spread over 75.736 collection items.
Remove blank cells
As it turns out, some of the Categories are empty. Since we are not interested in these values, you can select them by creating a facet for blanks (Facet, Customized facets, Facet by blank, set filter to true) and remove them (All triangle, Edit rows, Remove all matching rows). Make sure you are in rows mode: in records mode, you would remove all records with at least one blank category. Instead, we want to remove only rows with empty categories, leaving 169.706 rows or 75.275 records.
An amazing feature of Refine is its ability to automatically cluster similar values. This helps to correct typos and minor spelling differences. Click the Categories triangle and choose Edit cells, Cluster and edit. Using standard settings (key collision with fingerprint), we find 27 clusters, which concern casing differences. Choose Select All and Merge Selected & Re-Cluster to merge them.
Clustering again by key collision using an ngram-fingerprint of size 2 remedies spacing differences. After a quick manual review, we decided not to cluster Shirts and T-shirts, but you can merge everything else. Other methods, such as nearest neighbor using Levenshtein distance, do not bring helpful suggestions, except for singular/plural variations. You can choose to accept those manually if you like.
Remove double category values
You may notice that some of the records contain the same Category value twice. However, removing duplicates here is not as simple as before, since the values should only be unique per record instead of globally. We found a way to achieve the desired result, but it uses some of the more advanced Refine functions.
The idea is to bring the values together, keep the unique ones, and separate them again. So first, we join the multi-valued cells using Edit cells, Join multi-valued cells, OK. The category field is now a single connected piece of text. Then, a custom expression will remove duplicates in this text. Click Edit cells, Transform and enter this expression:
value.split(", ").uniques().join(", "), confirming with OK. This code is written in the Google Refine Expression Language (GREL). It inspects the comma-separated parts of the text individually, keeps only the unique ones, and joins them back together in the cell. Afterwards, we split the multi-valued Categories field again by choosing Edit cells, Split multi-valued cells, OK.
This leaves you with 75.275 records and 167.016 rows. These numbers my vary slightly, depending on your choices in the clustering step.
You have successfully cleaned your first dataset. Time to free your own metadata!
If you want to have a look at the result without following all steps, download the finished Google Refine project. Use the Undo / Redo history to review each step.
Take your metadata to the next level
Now, you're ready to reconcile this metadata and connect it to the Linked Data Cloud. These simple steps show you how.