We've mentioned the prominent role that OpenRefine has taken in our legacy metadata cleanup on the blog before, but have not yet gone into detail about how we've integrated it into our legacy EAD cleanup project. Before we get into the specifics of using OpenRefine, we'll first need to get our dates into a format that is easy to work with in the tool.
OpenRefine supports TSV, CSV, *SV, Excel (.xls and .xlsx), JSON, XML, RDF as XML, and Google Data documents. Although EADs are XML documents, we did not want to have to normalize dates in each XML document individually in OpenRefine. Instead, we opted to output all of our non-normalized dates to a single CSV using Python, so that we could clean up our dates in OpenRefine in spreadsheet format.
The following Python script is what we've been using to output the non-normalized dates to a CSV:
This script outputs the following information for each non-normalized date to each row in the CSV: the filename of the EAD, the XPath (the unique position in the XML) of the
Once the dates are in a CSV, a project can be created in OpenRefine, and the date normalization can begin.
And on and on and on. There is no way for us to automate the normalization of all of these various forms of dates, but with OpenRefine we can at least break them down into manageable and uniform chunks.
For example, let's say we want to normalize all dates like "May 1909-July 1968," or of the form "Month YYYY-Month YYYY." First we click the triangle next to the column name 'expression', and select "Text filter" from the drop down menu. This gives us a text box to enter either exact text or a regular expression to match on. Almost all of the filtering we do is with regular expressions; it's worth becoming familiar with them. The form of date we're after (at least for these purposes) can be expressed in a regular expression as follows:
Ultimately, what we'll want to end up with for this particular form of date is a normalized version of the form YYYY-MM/YYYY-MM (our example from earlier, for instance, should end up as 1909-10/1968-07). To make this a little more manageable, we can use OpenRefine to split our existing column into four separate columns (begin month, begin year, end month, end year), transform each of those values into the desired format, and then rejoin the columns in the proper order and with the correct punctuation. To split the column into four columns, do the following:
- Click the triangle next to the column name and select "Split into several columns..." from the "Edit column" drop down.
- In the pop up box, enter "-" (a single dash, no quotes) as the separator and uncheck the boxes for "Guess cell type" and "Remove this column." The column should now be split into the first set of Month YYYY and the end Month YYYY (keeping up with our example, we should now have one column for May 1909 and another for July 1968).
- Split each of these new columns into several columns, this time entering " " (a single space, no quotes) as the separator and leaving the "Remove this column" box checked.
- Rename the resulting columns something short and easily identifiable. I've renamed mine "bm," "by," "em," and "ey" (for begin month, begin year, end month, and end year respectively).
Our spreadsheet now looks like this:
The next step is to make sure that some of the assumptions we've made about our data so far are correct, in particular the assumption that all of the strings of letters we captured with our regular expression signify months. To find this out, again click the triangle next to the column names for months (in my case, 'bm' and 'em') and select "Text facet" from the "Facet" drop down. A list of all values and the amount of times they appear in that column will now display on the left of the project. In this example, the text facet has revealed that our column containing months has 19 values (i.e., 7 more than we would assume), mostly due to abbreviations. Each facet can be clicked on or edited from within the facet pane, making it very simple to change all instances of, for example, "Aug" to "August" at once.
Once the column values contain only the 12 correct and complete spellings of months that we're after, the next step is to transform all of the alphabetical representations of months into their respective numeric representations. To do this, select "Transform" from the "Edit cells" drop down within the columns menu and enter the following replacement formula:
This is a change I make quite frequently, so I have this transformation formula saved for easy reference. Here it is as plain text for easy copying and pasting:
value.replace('January', '01').replace('February','02').replace('March', '03').replace('April', '04').replace('May', '05').replace('June', '06').replace('July','07').replace('August', '08').replace('September','09').replace('October', '10').replace('November','11').replace('December', '12')
Once we do that with our begin and end month columns, we're ready to join our months and years back together as the final normalized version. Open the drop down menu for any column and select "Add column based on this column..." from the "Edit column" drop down. Enter the name for the new column (I'll call mine "normal"), and enter the following in the "Expression" field: